Statistical Inference – The One with the CLT

So far, we have learned about the descriptive statistics that helps to describe and summarize the data in meaningful ways. However, it does not allow us to make conclusions beyond the data we analyzed. Descriptive statistics is the first step to any analysis before even we think of fitting a model to the data. An eyeball test on the summary statistics gives you the feel of the data beforehand. Then you can move on to do inferential analysis using various models and estimations. Prior to that, it is important to know the difference between population and sample and why we need them. For example, say, we calculate the mean and standard deviation of GRE scores for 1000 students. Any group of data that includes all the 1000 students is called the population. The mean and standard deviation of the population are called as parameters.

Often you would not need to access the whole population because you are only interested in knowing the GRE scores of all students from India. Well, say, it is not feasible to measure the GRE scores of all students in India but it is possible to measure a smaller set of students as a representative for larger population of Indian students. Hence, we use a set of 200 Indian students as a sample and generalize the results for the population (from which the samples are obtained). The mean and standard deviation of the sample data are called as statistics.

clt1

Please answer following questions from your observations on the picture.

  1. What would have caused the window to break?
  1. Who do you think would have broken the window?

Well, it is highly likely you would have guessed that the boy broke the window while playing the ball. Even though it is not completely explained in the story, you figured out something from the clues in the picture. This is called inference.

Statistical Inference:

It is the procedure where inference about a population is made on the basis of results obtained from the sample drawn from that population.

There are three different types of inferences:

  1. Estimation – estimate the value of population parameter using a sample
  1. Testing – Perform a test to make a decision about population parameters
  1. Regression – Make predictions and forecasts about a statistic

clt3

In order to understand the estimation and testing type of inferences, you need to know the sampling distributions and central limit theorem. Before that, try to analyze the difference between various symbols which are frequently used to define statistics and parameters.

Sample Statistic Population Parameter
Mean x¯¯¯ (“x-bar”) μ
Proportion pˆ (“p-hat”) p
Variance s2 σ2
Standard Deviation S σ

The sample statistics, called as random variables have a distribution called as a sampling distribution, which is a probability distribution of a sample statistic based on all random samples of same size and same population. For example, in our case of GRE scores, in order to find the GRE scores of Indian students, you would take repeated samples of various students from different cities and then compute average i.e. mean test scores for each sample. The distribution of those samples would give you the sampling distribution of average GRE scores. The below picture clearly explains the difference between the frequency histogram of the population and the distribution of the sample mean i.e. x¯¯¯ of various random samples.

Source: Wolfram

Central Limit Theorem- CLT : As sample size ‘n’ gets large enough, sampling distribution becomes almost normal regardless of the shape of the population.

Now we are moving to the central limit theorem and how we are using it for statistical inference. It will turn out to be a proof on how central limit theorem is used for inference from a sampling distribution. As per Kahneman’s Law of large numbers, the larger the sample size, the higher chance of obtaining a normal sampling distribution. And many statisticians use the guideline of ‘n’ greater than or equal to 30 beyond which  the sampling distribution of the mean will be approximately normal regardless the shape of the original distribution i.e. population. Central limit theorem is, for any normal distribution.

  1. The sampling distribution is also normal distribution
  2. The mean of x-bars is μ.
  3. The standard deviation of x-bars is Z = (x-bar – μ)/σ/√n

  σ/√n – The Standard error (SE) is the standard deviation of the sampling distribution.

The formulas can easily freak you out and look like a lot of work when it’s not. I will clearly differentiate how this is different from what we learned as a population.

For a population distribution, descriptive statistics, you had a normally distributed X and a mean μ with standard deviation as σ. We calculated the probability of value above or below a value of X using z-table for probabilities where Z=(x- μ)/ σ

 For a sampling distribution, inferential statistics, you have a normally distributed X and a mean μ with standard deviation as σ. We have to calculate the probability of a sample n (from the population) with a mean value above or below a value of x-bar using z-table for probabilities where Z=(xbar-μ)/ σ√n

 Example: Throwing a fair coin ‘n’ times

You can observe that x-bar is normally distributed for larger ‘n’

clt5

Given a population with a finite mean μ and a finite non-zero variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean of μ and a variance of σ2/N as N, the sample size, increases.

Estimation:

  1. Point Estimation – Calculating a single value of a statistic using the sample data is called as point estimation. For example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample proportion p is a point estimate of the population proportion P.
  1. Interval Estimation – Calculating an interval of possible values of an unknown population parameter is called interval estimation It is generally is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a and less than b

Confidence Interval:

The idea behind confidence intervals is that it is not enough just using sample mean to estimate the population mean. The sample mean by itself is a single point. This does not give people any idea as to how good your estimation is of the population mean. If we want to assess the accuracy of this estimate we will use confidence intervals which provide us with information as to how good our estimation is. Statisticians use a confidence interval to express the precision and uncertainty associated with a particular sampling method. A confidence interval consists of three parts.

  • A confidence level – describes the uncertainty of a sampling method
  • A statistic – any statistic usually sample mean.
  • A margin of error – the range of values above and below the sample statistic

clt6

For example, consider the election exit polls and reports of the candidates who will receive 40% of the vote. The exit polls report that 5% margin of error and a confidence interval of 95%. It can be read as “We are 95% confident that the candidate will receive between 35% and 45% of the vote”.

Points to consider:

  • Statistical inference is the procedure where inference about a population is made on the basis of results obtained from the sample
  • Sampling distribution is the distribution of frequency of means instead of frequency of data
  • The standard deviation of sampling distribution is called the Standard Error
  • As the sample size increases, the sampling distribution becomes almost normal as per Central Limit Theorem
  • Sample mean is the point estimate and Sample mean +/- margin of error is the confidence interval or interval estimate
Advertisements

Normal! Normal! – The One with all the Distributions

From my previous post, you would have understood the basics of probability of an event. When the frequency of an event is divided by the total number of events, you get the probability of an event. Let us start with different kinds of probabilities like binomial. What are some examples of binomial probabilities that come across your mind? Think of any event that has a dichotomous outcome. Yes. Tossing a coin for ‘n’ number of times. Asking 100 people if they vote or not. As I mentioned, any events with binary outcome like Head or Tail / Yes or No are binomial in nature.

If ‘p’ is the probability of success of an event and ‘q’ is the probability of failure in a binomial experiment of ‘n’ events, then the expected number of successes i.e. the mean value of the binomial distribution is np.

Points to consider:

  • Binomial outcomes are mutually exclusive
  • Variables are represented as “counts” of success or failure
  • The type of variable is discrete
  • The graph resembles a histogram

Now that we know how to identify the distribution for a discrete random variable, we can move towards finding the probabilities and distribution for the continuous random variable, the normal. We say a distribution is normal, if the values fall into a smooth continuous curve with a bell shape symmetric pattern and there should not be any skewness or kurtosis.

Source: MIT EDU
Normal Distribution

I am sure you would have seen something like this, a curve diagram before, with all the z scores and t scores. Before trying to understand the intricacies of a normal data, let us first understand what does the word “normal” mean in this context. Yes,we all know that the curve should be symmetrical, it should have a bell shape etc., We all have learned that from various sources of information. But what does is mean actually.

We understood about the frequency of certain events occurring both in discrete and continuous sense. We have various sources of data from natural events like measuring height and weight to man-made events like analyzing financial data etc., Normality is when the average of the data i.e. the mean tends to be more frequently occurring in the data and other values tends to be closer to the mean and also the measures that are away from mean occurs less frequently. In short, most frequencies of the data are centered around the mean. With mean at the center, a smaller standard deviation results in a taller and narrow tailed curve and a larger standard deviation results in a flat and wider tailed curve.  Hence the standard deviation defines the overall shape of the curve.

One of the popular normal distribution is the Z distribution, which has a mean of zero and a standard deviation of 1 and area under the curve adds up to 1. A value on the Z -distribution signifies the number of standard deviations the data is above or below the mean; these are called z scores. For example, z=1 on the Z-distribution represents a value that is 1 standard deviation above the mean. Similarly, z= –1 represents a value that is one standard deviation below the mean.

So far, we have discussed only probability of a single event. But more often, there would be a need to find the probability of the odds of two or more events happening. This is called cumulative probability. Make sure to keep in mind that each event needs to be independent and the outcomes should not influence the other. In order to find the probability of a set of events, you first need to identify the z score and look up at the Z table for the matching probability [Refer: Z Normal table ]. A z score of -1.0 gives a cumulative probability of 0.1584 and a z score of 0 gives a cumulative probability of 0.50. Hence the probability between each section of z scores is the difference of higher and lower probabilities. For our example of Z between -1 and 0 is 0.3413.

normal_distribution_and_scales
Normal Distribution with Z scores

You will come across this many times: To define a range of events, it is often represented as P (-1 <= Z <= 1) or between -1σ and 1σ.

P (-1 <= Z <= 1) = 68% probability, which is the sum of -1 to 0 and 0 to 1(calculated above as 0.3413)

 Points to consider:

  • Normal outcomes are mutually exclusive
  • Variables are measurements of an event
  • The type of variable is continuous
  • The graph resembles a bell curve
  • Converts to z scores and used normal z tables for areas

 

 

Big Data – What Awaits Us?

In this super visible world of exabytes and petabytes of data, we have obtained the power to see what we had never seen before. It is said that everything is quantifiable including human emotions. Now, with the microscope like instrument to analyze what was there all along, what awaits us? A fading bloom or a major revolution?

Considered to be the holy grail of the 21st century, Big Data is believed to be one of the extremely important tool for the future. Data is not a new term and probably it happened to be coined to create a hype. Data existed always from the time humans started their communication through words. We stored unchangeable data through stones, letters during stone ages. Next, we advanced to computer disks and pen drives giving more storage capability, more processing and re usability. In this era of social media and internet of things, the data has gone beyond the realm of storing in a static device. Data has transformed from a static storage form to a more fluid flowing real time form. This rapid flowing, huge amount of real time data is termed as Big Data. In simple terms, Big Data are those large datasets that are challenging to store and analyze using traditional relational data bases and queries. And it’s not new, specialists in the field of banking, telecommunications have been struggling with Big Data for decades. What is new however is that, now technologies have emerged that offers options to exploit this Big Data.

According to IBM, we create 2.5 quintilian bytes of data every day from various sources including social media posts, finance, GPS locations and more. And it is also observed that, in the last two years, we have generated nine-tenths of data that exists now and all the data processing done in last 2 years is more than all that was done in 3000 years. The more information we get the more difficult problem we solve.

What can we do with Big Data?

To understand and gain insights from Big Data, there is a whole set of technologies, one undeniably is Hadoop ecosystem and other ETL concepts. These technologies can handle unstructured complex data with utmost scalability. Now, let’s look at some of the scenarios where big data is very helpful in certain industries.

  • Sentiment Analysis – A focus on millions of customers reveals insights on how they think or feel about a certain product or a topic. For e.g. Big data helped in understanding the power of advertising spots in Super Bowl based on the social media data.
  • Predictive Modelling – it is one of the important skills of big data. Forecasts and prediction are extensively used in Banking industry. For e.g. an airline can make maximum price predictions for customers and develop a pricing strategy to increase revenue.
  • Recommendations – Using Big data, we can promote recommendations, product bundles using customer history of purchases, and make customized offers through online ads etc., This can enhance customer loyalty and increase sales. For example, Netflix custom made movie recommendations using past behavior and preferences.

What cannot we do with Big Data?

Even though Big data can be used to do certainly amazing things, there are various scenarios where Big Data fails and it is not applicable, yet. Let’s look at some cases where applying Big Data will end up in a failure.

  • Predict a certain future – With all those predictive modelling techniques, at a maximum, we can achieve only 90% accuracy. We could never reach a 100% accuracy even with more enhanced machine learning concepts and hence predicting a definite future is highly impossible for now.
  • Read your mind – Big data technologies are machines which work on commands and we cannot expect these algorithms to read our mind. It is used in understanding the customer behavior as closely as possible but it cannot know certainly what a customer will purchase or how he will behave.
  • Solve Non-Quantifiable problems – A biggest use case of Big data is understanding customer behavior but still one cannot quantify a human behavior. A group of individuals can be analyzed and observed for trends but it is statistically impossible for now to give a score to a human behavior.

Big Data – A Marketing Perspective

John Wanamaker, an American retailer, quoted “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.”

With significant initial investment, there are nearly endless opportunities for a marketer to derive insights and gain better than ever. From an analysis by eMarketer recently, it is observed that the spending trends on “data-driven marketing” has increased and marketers are learning to leverage the data to benefit their approach. As per the report on Data Driven Advertising and Marketing tactics used by U.S B2B marketers, cross-channel, cross-device and lookalike targeting are the three most popular data driven tactics for 2017. For example, analytics will help to gain insights of which lead will turn into revenue and save from spending huge money on low quality leads.

Winners of the Data Creativity Category of Marketing New thinking awards 2016. Inspired by their 20 years’ worth of customer information about their journeys, the marketing team at EasyJet decided to create emotional stories as dynamic email customized for every individual telling a story from their first flight to recommending future flights with some facts about their travel behavior. They had a successful open rate of more than 100% from previous time and 25% higher click rates

These are some of the cases which took data marketing to another level. And the huge success of this part of story telling is the customer feels connected and feels special and noticed. This might be a beginning to establish a life-term relationship. Marketers should start thinking about using data effectively to build a customized experience in order to engage the customers and drive retention. Going forward, marketers will need to cater to the needs of various sources of data, not just social media but from virtual reality to sensors. I believe that it is the future of big data marketing where nothing is being said or shared. Everything is just being observed and learned.

It’s all good?

Well, big data is not all amazing and good, it sure has its own drawbacks. I would like to take reference here to generally highlight what major threat big data will bring to the world.

False Prediction – The movie, Minority Report, set in 2054, handled the concepts of machine learning and analyzing large data sets to predict human crimes. Pre-cogs, the super data scientists, can predict future crimes like murder along with the intricate details of who, when and where. The movie then revolves around a wrong prediction over its own head officer and one should watch the movie to understand the impact of the same.

Security – If that explains how the future might bring to us, at present we are facing an alarming increase of cyber crimes. With more and more data, mostly personal information, how well is our data protected and encrypted. Given that most of the data handling software are open source like Hadoop, which is not secured, it is important to ensure the security of the data itself.

Facts are stubborn – In reference to the John Adams quote, “Facts are stubborn things”, It is now known that, analytics of data cannot understand creative thinking or as an extreme free will. It might affect largely when a generation emerges with only logical thinkers who values analytics that in turn values only given set of metrics and under performs values such as emotions, creativity.

Humans vs Machines –  Sooner or later, we will be officially on a war of machines vs humans, where we already know that automation and machine learning is taking up most of the jobs from us. This has the potential to do what the white collar jobs did to the blue collars in 1950s. Consider a lab technician, who analyzes thousands of data of a gnome or cancer cells. He/She wil take weeks to gather data, analyze and finally produce results. But now this is being done in days by machine learning algorithms. Gone were the days, when banks had 15 to 20 representatives such as cashier, accountant, cheque handler. Now computer have taken over all the jobs and the same will re happen to take away the jobs of people handling the computers since now the computers can handle themselves.

Now, we have come to this breaking question we are trying to understand from the beginning. Is Big Data a fading bloom or evidently can create a major revolution. Having touched upon, the definition of big data, its capabilities, its impact in my field of interest, and also slightly brushed on the threats of big data, I believe that Big Data is here to stay. Interestingly, recently I was faced with the question regarding the impacts of all this data analytics on a regular human life. I was asked to imagine as an under age 18, pregnant girl, who has access to all these internet, social media. Imagine how would she feel when a retailer could predict her pregnancy and send personalized baby care products brochure to her home and her dad sees it. I was asked how would she react and how would she feel about this. My mind instantly freaked out and said that’s creepy. Apparently, that was not an imagination, but an incident happened in US where the giant retailer Target knew the girl was pregnant before her own father.

Yes, it is creepy. That’s why there is an increase in the number of adblockers which has become a huge hurdle for online advertising. But when I think of it in a broader sense, the concept of advertising has been there all along. When you try to correlate the initial stages of marketing and advertisements through a mass medium, how it opened a whole new gate to communicate with consumers and understand the likes and dislikes to the current data driven marketing, you end up seeing the similarities. It would have considered to be annoying when the tv ads were introduced but eventually we have grown past it and are able to embrace and even love the ads for what they bring to the table. Similarly, the more we get hold of the customer data and get clear and more sensible insights, online advertising would become more targeted and personalized and interesting. Hence, I believe there will come a time where online ads would be as praised and looked forward like in super bowls.

With over a decade passed since we started using Big Data, I personally, do not perceive any fading whatsoever. In fact, we have stated a new term this year as Fast Data, data that is real time and more rapid in nature.  There is no stop for this, as I could see, there are more products keep popping out of that ecosystem and various job descriptions are being created that involves data analytics. Properly handled data might eliminate most of the shortcomings I mentioned above. But there will be job loss to some extent but it is also observed that there will be need of human touch. International Data Corporation predicts more people will be needed with deep analytics skills in next two years. Analytics has reached all levels of an enterprise and possibly touched different work environments, businesses are starting to embrace data science and there are cars that drive themselves. Big data will continue to get bigger, we just have to learn to embrace the journey

The Very Basics – The One with the Probabilities

hashdatascience1

On my new journey as a grad school student, the one thing I realized was the importance of statistics knowledge before getting your hands on with all those interesting regressors and classifiers. A basic knowledge of statistics will help you understand the concepts and lets you see the problem and the solution inside out.

To start with, there are two different concepts of probability
1) The frequency at which a particular event happens in the long run – This is called statistical probability
2) The degree of belief which it is reasonable to place in proportion to a given evidence – This is called inductive probability

For example,

If I toss a coin, what is the probability that it will turn up heads? Now everyone will say “ A half chance” considering the coin, I suppose, “Fair” and the “chances of H or T is equally likely”. But what if we toss the coin lots of times? What will happen then? Will the coin still be fair? Well, an experiment by John Kerrich (from World war II) attempted to explain the long-run relative frequencies by tossing the coin for 10000 times.

hashdatascience4
At the end of the experiment, the proportion of heads was 5067 and that tails of 4933. Even though there are fluctuations in the beginning, the graph seems to settle down for larger numbers. It seems the fluctuations disappear slightly and the value is nearing around half i.e. equally likely. This limiting value is the “statistical” probability value of heads that you answered in the previous question. Consider there are three events A, B, C and if an event A occurs ‘n’ times, the proportion of times on which the event has occurred is n(A)/n which is P(A) – probability. This is an empirical or experimental approach towards probability and we can never know the certainty of the probability of the event and according to Kahneman, the results follows the law of large numbers.

Mostly, the events we deal with for the probabilities or any analysis, in general, are numerical. Those numerical variables which take different values with different probabilities are known as random variables. Events such as a number of students in each class which results from counting and can take only values like 1, 2, 25 are Discrete variables and events such as the weight of the students in the class which results from measuring and can take a certain range of values are Continuous variables.

fig6

Frequency Distribution of SAT Scores

As you can see from the graph of frequencies of various scores, the number of observations increase as the frequency distribution limits to a probability distribution.

https://www.mathsisfun.com/data/images/histogram-heights.gif

And the graph of weights of the trees is called the histogram that is the bars represent the probability of that particular event. For larger observations and smaller intervals, the histogram will look like a smooth curve which is called the probability density curve. And the plot is called the normal distribution.
There are various properties to this distribution like center, variability, and shape etc.

LOCATION:

Mean: Mean is simply the average of all the observations. i.e. sum of ‘n’ observations divided by ‘n’. And the important difference to keep in mind is that the mean of the frequency distributions is denoted a x¯ and the mean of the whole population or probability distribution is denoted as the μ.

Median: The set of observations in the middle when arranged in a sorted order.

Mode: The value of the most frequent observation. The mode of the discrete probability distribution is the value which has the highest probability of occurring and the mode of the continuous probability distribution is the point at which the density function attains the maximum value.

hashdatascience2

VARIABILITY

Dispersion: The feature of the random variable is its variability and the mean deviation is the most obvious measure i.e. the average of absolute deviations from the mean. Another measure of dispersion is the interquartile range which is the region between the upper and the lower quartile. The upper and the lower quartiles are the points in the graph at which the cumulative frequency distribution reaches the ¼ and ¾  respectively. The third and most used measure of dispersion is the Standard Deviation, it is the square root of the average of the squared deviations from the mean.

https://www.cdc.gov/ophss/csels/dsepd/ss1978/lesson2/images/figure2.7.jpg

After understanding the location and variability, now we focus on the shape of the distribution.

SHAPE

Skewness: It is a measure of the lack of symmetry. A distribution is symmetrical if it looks the same on the both sides from the center. Considering the degree of the symmetry, the distributions whose right tail is longer than the left hand are called skew to right and the distributions whose left tail is longer than the right hand is called skew to left.

https://www.safaribooksonline.com/library/view/clojure-for-data/9781784397180/graphics/7180OS_01_180.jpg

Kurtosis

It is a measure of the peak in the distribution. A normal distribution is a mesokurtic distribution and heavier tails and higher peaks than normal is a leptokurtic distribution. A platykurtic distribution has a lower peak than normal distribution and lighter tails.

http://mvpprograms.com/help/images/KurtosisPict.jpg

A normal distribution is a proper bell curve with ‘0’ skew and kurtosis

hashdatascience3