Inferential statistics

Up until now we have explored probability and statistics very abstractly, without ever applying it to a real world events; even when we calculated probabilities about dice rolls or medical tests, we did it by studying theoretical models, for example we assumed that the probability of rolling a 1 was 161\over 6 without needing any real world data to support this hypothesis.
In this chapter we will begin to explore inferential statistics, that is the statistics that deals with how to look at real world data and deduce valuable informations.

In general our goal when doing inferential statistics is to observe a random phenomenon, measure several occurrences of that phenomenon, and then deduce the distribution of the random variable that represents it or some property of that distribution.
To be precise we will mostly be doing parametric statistics, that is we will assume that it's distribution is within some family, for example the family of the normal distributions, and then deduce which distribution best fits the data, for example finding the mean and variance of the normal distribution.

Statistical hypothesis test

Let's begin with an example: we think that the average width of the petals of the Iris flower is 1.5 cm, to test that hypothesis we will pick and measure 100 flowers and we decide that if the average width is between 1.4 cm and 1.6 cm we will consider our hypothesis to be correct, otherwise we will consider it to be incorrect.

We have just constructed an hypothesis test, formally we would say that:

  • we have taken a random sample, i.e. the petal width of the 100 flowers
  • we have defined a null hypothesis, i.e. the average petal width is 1.5 cm, and an alternative hypothesis, i.e. the average petal width is not 1.5 cm
  • we have defined a region of rejection, i.e. all the combinations of petal widths whose average isn't between 1.4 cm and 1.6 cm

Using this procedure we can experimentally verify, with some degree of certainty, that our hypothesis is correct. At this point you are probably asking yourself "How can we be sure that the range 1.4 cm to 1.6cm is the right one? Isn't it too large, or isn't it too small?", luckily for us statistics has a few tools to assess the quality of an hypothesis test, the most important ones being:

  • the significance level of a test, usually denoted by α\alpha, is the probability of rejecting the null hypothesis even though in reality it's true
  • the power of a test, usually denoted by 1β1-\beta, is the probability of rejecting the null hypothesis when it's false

To evaluate any hypothesis test we just need to compute it's significance level and it's power, a good test will have a low significance level and a high power; generally the test is built by choosing the significance level and then maximizing the power.

In our floral example we will make a couple extra assumptions: the distribution of the petal widths is a normal distribution (this is quite reasonable, in fact this is true for many measurements of natural occurring phenomena) with variance of 0.16 cm (this assumption is needed for didactical purposes, later on we won't need this simplification anymore).
So we know that the distribution of any petal width is N(μ,0.16)\text{N}(\mu, 0.16), and we can write the null hypothesis as μ=1.5\mu=1.5; now we can compute the significance level:

α=P[Xˉn>1.6 or Xˉn<1.4μ=1.5]\alpha = \mathbb{P}[\bar X_n>1.6\text{ or } \bar X_n<1.4\mid \mu = 1.5]

To compute the significance level we assume the null hypothesis to be true, so we know that the petal width has distribution N(1.5,0.16)N(1.5, 0.16), thus the average of the 100 measurements will have normal distribution with mean 1.51.5 and variance 0.16/100=0.00160.16/100 = 0.0016, so Z=Xˉn1.50.0016Z={\bar X_n - 1.5 \over \sqrt{0.0016}} has distribution N(0,1)N(0,1). Let's try to rewrite the previous expression in terms of ZZ:

α=P[Xˉn>1.6 or Xˉn<1.4μ=1.5]=P[Xˉn1.50.0016>1.61.50.0016 or Xˉn1.50.0016<1.41.50.0016μ=1.5]=P[Z>2.5 or Z<2.5]=P[Z>2.5]+P[Z<2.5]\begin{aligned} \alpha &= \mathbb{P}\left[\bar X_n>1.6\text{ or } \bar X_n<1.4\bigm\vert \mu = 1.5\right]\\ &=\mathbb{P}\left[{\bar X_n - 1.5\over \sqrt{0.0016}}> {1.6-1.5\over \sqrt{0.0016}} \text{ or }{\bar X_n - 1.5\over \sqrt{0.0016}} < {1.4-1.5\over \sqrt{0.0016}}\biggm\vert \mu=1.5\right]\\ &=\mathbb{P}[Z>2.5\text{ or }Z<-2.5]\\ &=\mathbb{P}[Z>2.5]+\mathbb{P}[Z<-2.5] \end{aligned}

To finish the calculations we need to compute the two probabilities at the last term, since the distribution of ZZ is known they can be computed by integrating the probability density function of N(0,1)N(0,1). Computing these integrals by hand is hard and tedious, so we usually rely on computers to do the calculations for us:

α=P[Z>2.5]+P[Z<2.5]=2.5+12πe(xμ)22  dx+2.512πe(xμ)22  dx=0.0062+0.0062=0.0124\begin{aligned} \alpha &=\mathbb{P}[Z>2.5]+\mathbb{P}[Z<-2.5]\\ &=\int_{2.5}^{+\infty}{1\over\sqrt{2\pi}} e^{-{(x-\mu)^2\over 2}} \; dx + \int_{-\infty}^{-2.5}{1\over\sqrt{2\pi}} e^{-{(x-\mu)^2\over 2}} \; dx\\ &=0.0062 + 0.0062\\ &=0.0124 \end{aligned}

Our test thus has a significance level of about 1%, this is a very good significance level: depending on the field of study the significance level for a test to be considered good usually has to be below 5% or in some cases below 1%.

You may find it weird that the primary focus when designing an hypothesis test is keeping the significance level low, shouldn't we be more concerned in making sure that the power is high? Isn't it more important to be sure that if our hypothesis is correct we know that it's correct? The confusion comes from the fact that often we choose as null hypothesis the one we hope is false and thus we want to be certain that we don't reject it when it's true.
For example if we want to test the effectiveness of a fertilizer the null hypothesis could be that the plant's height is unaffected by the fertilizer, thus rejecting the null hypothesis means that our fertilizer is effective, so for our test to be meaningful we must make sure that the probability of rejecting the null hypothesis when it's true is low.
Unfortunately in this case we don't have a simple rule: in other cases the null hypothesis is the common behavior (e.g. the petal width of similar species) and in other cases it's chosen in yet another way, it's up to your good judgment choosing the most sensible null hypothesis.

What if the samples are not normally distributed?

You may argue that the hypothesis test we've seen only works because the sample is normally distributed and that's not always true, so what do we do in such cases? We use the Central Limit Theorem.

Suppose we have a complex reaction between two chemical agents and we know that it has a certain probability to succeed, we know from similar reactions that it should be less than 5%, but we want to verify this empirically; we have data about 50 reactions and we want to build a hypothesis test.

In this case the random variables XiX_i representing each reaction can only have 2 values: 1 for successful reaction and 0 for failed reaction, this definitely isn't a normally distributed variable, but we'll see that we can leverage the Central Limit Theorem (CLT) to still be dealing with a normal distribution in the hypothesis test. The first step to apply the CLT is knowing the average and the variance of the random variables, so let's compute them.

The average can be computed as for any discrete random variable:

E[Xi]=1P(Xi=1)+0P(Xi=0)=p\mathbb{E}[X_i] = 1\cdot \mathbb{P}(X_i=1) + 0\cdot \mathbb{P}(X_i=0) = p

Now we can compute the variance:

Var(Xi)=E[(Xip)2]=(1p)2P(Xi=1)+(0p)2P(Xi=0)=(1p)2p+p2(1p)=p(1p)(1p+p)=p(1p)\begin{aligned} \text{Var}(X_i)&=\mathbb{E}[(X_i-p)^2]\\ &=(1-p)^2\cdot \mathbb{P}(X_i=1)+(0-p)^2\cdot\mathbb{P}(X_i=0)\\ &=(1-p)^2\cdot p + p^2\cdot (1-p)\\ &=p(1-p)(1-p+p)\\ &=p(1-p) \end{aligned}

This kind of distribution, that can only have value 00 or 11 and has probability pp to be 11, is very common in statistics and even has its own name: Bernoulli distribution, often shortened to Be(p)\text{Be}(p).

We know that the random variables XiX_i representing each reaction all have distribution Be(p)\text{Be}(p), in short we write XiBe(p)X_i\sim \text{Be}(p), and our null hypothesis is p0.05p\le0.05, we can now choose a region of rejection and compute the significance level.
The first step in choosing a region of rejection is summarizing all the sample data in a single metric, in this case the natural choice is the average of the samples Xˉn\bar X_n; it's the natural choice because E[Xˉn]=p\mathbb{E}[\bar X_n]=p and our hypothesis is about pp. We can now choose the region of rejection to be, for example, Xˉn>0.1\bar X_n>0.1.

Proceding as in the previous example we write out explicitly the significance level:

α=P[Xˉn>0.1p0.05]\alpha = \mathbb{P}[\bar X_n > 0.1\mid p\le 0.05]

The first issue here is that the null hypothesis doesn't choose a specific value for pp: the probability of the sample being in the region of rejection is different if p=0.01p=0.01 rather than p=0.04p=0.04, but both cases are included in the null hypothesis. We thus compute the probability in the worst case, that is the one in which that probability is higher. In this example the worst case is p=0.05p=0.05 and in general it's almost always the equality corresponding to the boundary of the null hypothesis.
We thus want to compute

α=P[Xˉn>0.1p=0.05]\alpha = \mathbb{P}[\bar X_n > 0.1\mid p = 0.05]

The distribution of Xˉn\bar X_n is quite hard to deal with, so we use the Central Limit Theorem (CLT) to make a normal distribution appear: from the CLT we know that Zn=Xˉnμσ2/nZ_n={\bar X_n - \mu \over \sqrt{\sigma^2/n}} tends to a normal distribution, since nn is quite big (the rule of thumb is n>30n>30) we can approximate the distribution of ZZ to the limit distribution, that is N(0,1)\text{N}(0,1). Notice that in the definition of ZnZ_n we use μ\mu and σ2\sigma^2, those are the expected value and variance of the variables XiX_i that we have computed before: μ=p\mu=p and σ2=p(1p)\sigma^2=p(1-p), furthermore in our case n=50n=50, so we have Z50=Xˉnpp(1p)/50Z_{50} = {\bar X_n - p\over \sqrt{p(1-p)/50}}.
Now we just need to rewrite the expression for α\alpha with respect to Z50Z_{50}:

α=P[Xˉn>0.1p=0.05]=P[Xˉnpp(1p)/50>0.1pp(1p)/50p=0.05]=P[Z50>0.10.050.05(10.05)/50]=P[Z50>0.64]\begin{aligned} \alpha &= \mathbb{P}[\bar X_n > 0.1\mid p = 0.05]\\ &=\mathbb{P}\left[{{\bar X}_n-p\over \sqrt{p(1-p)/50}} > {0.1-p\over\sqrt{p(1-p)/50}}\Bigm\vert p=0.05\right]\\ &=\mathbb{P}\left[Z_{50} > {0.1-0.05\over\sqrt{ 0.05(1-0.05)/50}}\right]\\ &=\mathbb{P}[Z_{50}>0.64] \end{aligned}

Knowing that Z50Z_{50} has distribution N(0,1)\text{N}(0,1), we can conclude the computation by looking up the computer calculated value for P[Z50>0.64]\mathbb{P}[Z_{50}>0.64], which comes out to be 5,26%; depending on our field of work we could now accept the 5,26% significance or try to build a test with a lower level of significance.

This technique of replacing a hard to deal with distribution with the approximation given by the Central Limit Theorem can be applied to a wide range of cases provided that the sample is big enough (n>30n>30).