Perhaps the most important distribution in all of statistics is the Gaussian distribution, also known as normal distribution. It has a bell shape and spans all the real line, in fact for every real value we have .
To be precise we don't have one single normal distribution, we have instead an infinite family of normal distributions, each characterized (or more formally "parameterized") by and ; for this reason we usually denote a normal distribution as .
The probability density function of a normal distribution is
It may look scary at first, but let's unpack it bit by bit:
- the fraction is just a normalization term, which means that it's there just to make sure that the area under the curve is , this is a convention throughout probability but it doesn't fundamentally change the essence of the distribution
- the second term is the exponential of a negative quantity, so it will approach very fast as the exponent gets larger
- at the exponent we have , this bit is symmetric with respect to and thus it makes the whole distribution symmetric with respect to
- finally at the denominator of the exponent we have , so for larger values of the exponent will increase slower and thus the whole distribution will approach slower
The probability distribution is defined with and as parameters, which may look confusing because and are also used to refer to the mean and the variance of a random variable. However it turns out that if we have a random variable with distribution , which we can also compactly write as , then and ; we can easily prove the statement about the expected value leveraging the symmetry of the distribution, the one about the variance is less straightforward.
Proof
Let's start by considering : take a random variable , then by definition its expected value is
As we noticed above the function is symmetric with respect to , so in this case we have . We can leverage this fact by splitting the integral in two parts and then performing a variable change in the first one:
To generalize the proof to the case with any let's first notice that given any random variable and a constant we have that
Then we can conclude the thesis by noticing that if then , as we can easily verify that .
Trivially , so , but we have proven that , so .
From the properties of mean and variance that we discovered in the previous chapter follows that if we have a random variable with distribution , then:
- has distribution
- has distribution
- has distribution
Stability of the gaussian distribution
An interesting property of the normal distribution is that any linear combination of normally distributed random variables is still normally distributed, that is if we take and independent random variables then (where and are real numbers) has distribution . This is property is called stability of the distribution.
Furthermore from the properties of the expected value and the variance seen in the previous chapter follows that and .
Down below you can see what happens when you sum two normally distributed random variables:
Mean 2: Variance 2:
Notice that not all distributions have the stability property, for example the uniform distribution is not stable:
Mean 2: Variance 2:
Convergence
If shown the sequence , even without formal mathematical training, you'd easily recognize that the sequence is approaching , or more formally that the limit of the sequence is ; we can define a similar concept of limit also for random variables.
In the last chapter we have seen that a random variable is identified by its cumulative distribution function (CDF), so it seems natural to defined the convergence of a sequence of random variables by looking at the sequence of the CDFs: given any sequence of random variables we say that the sequence has as limit the random variable if
Which means that the value of the CDF of at any point approaches the value of the CDF of at that same point. Visually it looks like this
The Central Limit Theorem
Many real life random variables, for example height and weight, follow a normal distribution, so studying its properties allows us to have more tools to study real life phenomena. This alone would make the Gaussian distribution a very important distribution, but its applicability is far wider thanks to the Central Limit Theorem, whose statement is the following:
Given independent random variables all having the same distribution with average and variance , define where is the average of the variables , then the limit of the sequence is a random variable with distribution .
Let's unpack all that jargon:
- we take independent random variables with any distribution, not necessarily normally distributed, but it has to be the same for all the variables
- we average all the variables, just like we did in the law of large numbers. This time however we shift the result by subtracting , this makes it so that (you can verify that by doing a bit of calculations if you feel like it), and then we scale the result by dividing it by , this makes it so that (again you can check it for yourself)
- the incredible result is that as gets larger and larger the distribution of this "modified" average tends to a normal distribution, meaning that its cumulative distribution function (CDF) gets closer and closer to the CDF of the normal distribution
Unfortunately the proof of this theorem requires some advanced concepts, so we won't work through it, but we will be using this theorem extensively in the following chapters; it will be particularly useful to approximate the distribution of sums of independent random variables when dealing with the correct distribution is too hard.
For example if we choose the random variables to represent coin flips, then then the distribution of compared to the distribution looks as follows: