Statistical Inference – The One with the CLT

So far, we have learned about the descriptive statistics that helps to describe and summarize the data in meaningful ways. However, it does not allow us to make conclusions beyond the data we analyzed. Descriptive statistics is the first step to any analysis before even we think of fitting a model to the data. An eyeball test on the summary statistics gives you the feel of the data beforehand. Then you can move on to do inferential analysis using various models and estimations. Prior to that, it is important to know the difference between population and sample and why we need them. For example, say, we calculate the mean and standard deviation of GRE scores for 1000 students. Any group of data that includes all the 1000 students is called the population. The mean and standard deviation of the population are called as parameters.

Often you would not need to access the whole population because you are only interested in knowing the GRE scores of all students from India. Well, say, it is not feasible to measure the GRE scores of all students in India but it is possible to measure a smaller set of students as a representative for larger population of Indian students. Hence, we use a set of 200 Indian students as a sample and generalize the results for the population (from which the samples are obtained). The mean and standard deviation of the sample data are called as statistics.

clt1

Please answer following questions from your observations on the picture.

  1. What would have caused the window to break?
  1. Who do you think would have broken the window?

Well, it is highly likely you would have guessed that the boy broke the window while playing the ball. Even though it is not completely explained in the story, you figured out something from the clues in the picture. This is called inference.

Statistical Inference:

It is the procedure where inference about a population is made on the basis of results obtained from the sample drawn from that population.

There are three different types of inferences:

  1. Estimation – estimate the value of population parameter using a sample
  1. Testing – Perform a test to make a decision about population parameters
  1. Regression – Make predictions and forecasts about a statistic

clt3

In order to understand the estimation and testing type of inferences, you need to know the sampling distributions and central limit theorem. Before that, try to analyze the difference between various symbols which are frequently used to define statistics and parameters.

Sample Statistic Population Parameter
Mean x¯¯¯ (“x-bar”) μ
Proportion pˆ (“p-hat”) p
Variance s2 σ2
Standard Deviation S σ

The sample statistics, called as random variables have a distribution called as a sampling distribution, which is a probability distribution of a sample statistic based on all random samples of same size and same population. For example, in our case of GRE scores, in order to find the GRE scores of Indian students, you would take repeated samples of various students from different cities and then compute average i.e. mean test scores for each sample. The distribution of those samples would give you the sampling distribution of average GRE scores. The below picture clearly explains the difference between the frequency histogram of the population and the distribution of the sample mean i.e. x¯¯¯ of various random samples.

Source: Wolfram

Central Limit Theorem- CLT : As sample size ‘n’ gets large enough, sampling distribution becomes almost normal regardless of the shape of the population.

Now we are moving to the central limit theorem and how we are using it for statistical inference. It will turn out to be a proof on how central limit theorem is used for inference from a sampling distribution. As per Kahneman’s Law of large numbers, the larger the sample size, the higher chance of obtaining a normal sampling distribution. And many statisticians use the guideline of ‘n’ greater than or equal to 30 beyond which  the sampling distribution of the mean will be approximately normal regardless the shape of the original distribution i.e. population. Central limit theorem is, for any normal distribution.

  1. The sampling distribution is also normal distribution
  2. The mean of x-bars is μ.
  3. The standard deviation of x-bars is Z = (x-bar – μ)/σ/√n

  σ/√n – The Standard error (SE) is the standard deviation of the sampling distribution.

The formulas can easily freak you out and look like a lot of work when it’s not. I will clearly differentiate how this is different from what we learned as a population.

For a population distribution, descriptive statistics, you had a normally distributed X and a mean μ with standard deviation as σ. We calculated the probability of value above or below a value of X using z-table for probabilities where Z=(x- μ)/ σ

 For a sampling distribution, inferential statistics, you have a normally distributed X and a mean μ with standard deviation as σ. We have to calculate the probability of a sample n (from the population) with a mean value above or below a value of x-bar using z-table for probabilities where Z=(xbar-μ)/ σ√n

 Example: Throwing a fair coin ‘n’ times

You can observe that x-bar is normally distributed for larger ‘n’

clt5

Given a population with a finite mean μ and a finite non-zero variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean of μ and a variance of σ2/N as N, the sample size, increases.

Estimation:

  1. Point Estimation – Calculating a single value of a statistic using the sample data is called as point estimation. For example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample proportion p is a point estimate of the population proportion P.
  1. Interval Estimation – Calculating an interval of possible values of an unknown population parameter is called interval estimation It is generally is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a and less than b

Confidence Interval:

The idea behind confidence intervals is that it is not enough just using sample mean to estimate the population mean. The sample mean by itself is a single point. This does not give people any idea as to how good your estimation is of the population mean. If we want to assess the accuracy of this estimate we will use confidence intervals which provide us with information as to how good our estimation is. Statisticians use a confidence interval to express the precision and uncertainty associated with a particular sampling method. A confidence interval consists of three parts.

  • A confidence level – describes the uncertainty of a sampling method
  • A statistic – any statistic usually sample mean.
  • A margin of error – the range of values above and below the sample statistic

clt6

For example, consider the election exit polls and reports of the candidates who will receive 40% of the vote. The exit polls report that 5% margin of error and a confidence interval of 95%. It can be read as “We are 95% confident that the candidate will receive between 35% and 45% of the vote”.

Points to consider:

  • Statistical inference is the procedure where inference about a population is made on the basis of results obtained from the sample
  • Sampling distribution is the distribution of frequency of means instead of frequency of data
  • The standard deviation of sampling distribution is called the Standard Error
  • As the sample size increases, the sampling distribution becomes almost normal as per Central Limit Theorem
  • Sample mean is the point estimate and Sample mean +/- margin of error is the confidence interval or interval estimate