# Statistical Inference – The One with the CLT

So far, we have learned about the descriptive statistics that helps to describe and summarize the data in meaningful ways. However, it does not allow us to make conclusions beyond the data we analyzed. Descriptive statistics is the first step to any analysis before even we think of fitting a model to the data. An eyeball test on the summary statistics gives you the feel of the data beforehand. Then you can move on to do inferential analysis using various models and estimations. Prior to that, it is important to know the difference between population and sample and why we need them. For example, say, we calculate the mean and standard deviation of GRE scores for 1000 students. Any group of data that includes all the 1000 students is called the population. The mean and standard deviation of the population are called as parameters.

Often you would not need to access the whole population because you are only interested in knowing the GRE scores of all students from India. Well, say, it is not feasible to measure the GRE scores of all students in India but it is possible to measure a smaller set of students as a representative for larger population of Indian students. Hence, we use a set of 200 Indian students as a sample and generalize the results for the population (from which the samples are obtained). The mean and standard deviation of the sample data are called as statistics.

1. What would have caused the window to break?
1. Who do you think would have broken the window?

Well, it is highly likely you would have guessed that the boy broke the window while playing the ball. Even though it is not completely explained in the story, you figured out something from the clues in the picture. This is called inference.

#### Statistical Inference:

It is the procedure where inference about a population is made on the basis of results obtained from the sample drawn from that population.

There are three different types of inferences:

1. Estimation – estimate the value of population parameter using a sample
1. Testing – Perform a test to make a decision about population parameters
1. Regression – Make predictions and forecasts about a statistic

In order to understand the estimation and testing type of inferences, you need to know the sampling distributions and central limit theorem. Before that, try to analyze the difference between various symbols which are frequently used to define statistics and parameters.

 Sample Statistic Population Parameter Mean x¯¯¯ (“x-bar”) μ Proportion pˆ (“p-hat”) p Variance s2 σ2 Standard Deviation S σ

The sample statistics, called as random variables have a distribution called as a sampling distribution, which is a probability distribution of a sample statistic based on all random samples of same size and same population. For example, in our case of GRE scores, in order to find the GRE scores of Indian students, you would take repeated samples of various students from different cities and then compute average i.e. mean test scores for each sample. The distribution of those samples would give you the sampling distribution of average GRE scores. The below picture clearly explains the difference between the frequency histogram of the population and the distribution of the sample mean i.e. x¯¯¯ of various random samples.

Central Limit Theorem- CLT : As sample size ‘n’ gets large enough, sampling distribution becomes almost normal regardless of the shape of the population.

Now we are moving to the central limit theorem and how we are using it for statistical inference. It will turn out to be a proof on how central limit theorem is used for inference from a sampling distribution. As per Kahneman’s Law of large numbers, the larger the sample size, the higher chance of obtaining a normal sampling distribution. And many statisticians use the guideline of ‘n’ greater than or equal to 30 beyond which  the sampling distribution of the mean will be approximately normal regardless the shape of the original distribution i.e. population. Central limit theorem is, for any normal distribution.

1. The sampling distribution is also normal distribution
2. The mean of x-bars is μ.
3. The standard deviation of x-bars is Z = (x-bar – μ)/σ/√n

σ/√n – The Standard error (SE) is the standard deviation of the sampling distribution.

The formulas can easily freak you out and look like a lot of work when it’s not. I will clearly differentiate how this is different from what we learned as a population.

For a population distribution, descriptive statistics, you had a normally distributed X and a mean μ with standard deviation as σ. We calculated the probability of value above or below a value of X using z-table for probabilities where Z=(x- μ)/ σ

For a sampling distribution, inferential statistics, you have a normally distributed X and a mean μ with standard deviation as σ. We have to calculate the probability of a sample n (from the population) with a mean value above or below a value of x-bar using z-table for probabilities where Z=(xbar-μ)/ σ√n

Example: Throwing a fair coin ‘n’ times

You can observe that x-bar is normally distributed for larger ‘n’

Given a population with a finite mean μ and a finite non-zero variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean of μ and a variance of σ2/N as N, the sample size, increases.

#### Estimation:

1. Point Estimation – Calculating a single value of a statistic using the sample data is called as point estimation. For example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample proportion p is a point estimate of the population proportion P.
1. Interval Estimation – Calculating an interval of possible values of an unknown population parameter is called interval estimation It is generally is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a and less than b

#### Confidence Interval:

The idea behind confidence intervals is that it is not enough just using sample mean to estimate the population mean. The sample mean by itself is a single point. This does not give people any idea as to how good your estimation is of the population mean. If we want to assess the accuracy of this estimate we will use confidence intervals which provide us with information as to how good our estimation is. Statisticians use a confidence interval to express the precision and uncertainty associated with a particular sampling method. A confidence interval consists of three parts.

• A confidence level – describes the uncertainty of a sampling method
• A statistic – any statistic usually sample mean.
• A margin of error – the range of values above and below the sample statistic

For example, consider the election exit polls and reports of the candidates who will receive 40% of the vote. The exit polls report that 5% margin of error and a confidence interval of 95%. It can be read as “We are 95% confident that the candidate will receive between 35% and 45% of the vote”.

Points to consider:

• Statistical inference is the procedure where inference about a population is made on the basis of results obtained from the sample
• Sampling distribution is the distribution of frequency of means instead of frequency of data
• The standard deviation of sampling distribution is called the Standard Error
• As the sample size increases, the sampling distribution becomes almost normal as per Central Limit Theorem
• Sample mean is the point estimate and Sample mean +/- margin of error is the confidence interval or interval estimate

# Normal! Normal! – The One with all the Distributions

From my previous post, you would have understood the basics of probability of an event. When the frequency of an event is divided by the total number of events, you get the probability of an event. Let us start with different kinds of probabilities like binomial. What are some examples of binomial probabilities that come across your mind? Think of any event that has a dichotomous outcome. Yes. Tossing a coin for ‘n’ number of times. Asking 100 people if they vote or not. As I mentioned, any events with binary outcome like Head or Tail / Yes or No are binomial in nature.

If ‘p’ is the probability of success of an event and ‘q’ is the probability of failure in a binomial experiment of ‘n’ events, then the expected number of successes i.e. the mean value of the binomial distribution is np.

Points to consider:

• Binomial outcomes are mutually exclusive
• Variables are represented as “counts” of success or failure
• The type of variable is discrete
• The graph resembles a histogram

Now that we know how to identify the distribution for a discrete random variable, we can move towards finding the probabilities and distribution for the continuous random variable, the normal. We say a distribution is normal, if the values fall into a smooth continuous curve with a bell shape symmetric pattern and there should not be any skewness or kurtosis.

I am sure you would have seen something like this, a curve diagram before, with all the z scores and t scores. Before trying to understand the intricacies of a normal data, let us first understand what does the word “normal” mean in this context. Yes,we all know that the curve should be symmetrical, it should have a bell shape etc., We all have learned that from various sources of information. But what does is mean actually.

We understood about the frequency of certain events occurring both in discrete and continuous sense. We have various sources of data from natural events like measuring height and weight to man-made events like analyzing financial data etc., Normality is when the average of the data i.e. the mean tends to be more frequently occurring in the data and other values tends to be closer to the mean and also the measures that are away from mean occurs less frequently. In short, most frequencies of the data are centered around the mean. With mean at the center, a smaller standard deviation results in a taller and narrow tailed curve and a larger standard deviation results in a flat and wider tailed curve.  Hence the standard deviation defines the overall shape of the curve.

One of the popular normal distribution is the Z distribution, which has a mean of zero and a standard deviation of 1 and area under the curve adds up to 1. A value on the Z -distribution signifies the number of standard deviations the data is above or below the mean; these are called z scores. For example, z=1 on the Z-distribution represents a value that is 1 standard deviation above the mean. Similarly, z= –1 represents a value that is one standard deviation below the mean.

So far, we have discussed only probability of a single event. But more often, there would be a need to find the probability of the odds of two or more events happening. This is called cumulative probability. Make sure to keep in mind that each event needs to be independent and the outcomes should not influence the other. In order to find the probability of a set of events, you first need to identify the z score and look up at the Z table for the matching probability [Refer: Z Normal table ]. A z score of -1.0 gives a cumulative probability of 0.1584 and a z score of 0 gives a cumulative probability of 0.50. Hence the probability between each section of z scores is the difference of higher and lower probabilities. For our example of Z between -1 and 0 is 0.3413.

You will come across this many times: To define a range of events, it is often represented as P (-1 <= Z <= 1) or between -1σ and 1σ.

P (-1 <= Z <= 1) = 68% probability, which is the sum of -1 to 0 and 0 to 1(calculated above as 0.3413)

Points to consider:

• Normal outcomes are mutually exclusive
• Variables are measurements of an event
• The type of variable is continuous
• The graph resembles a bell curve
• Converts to z scores and used normal z tables for areas