This chapter is the next chapter from Basic Statistic Part I, will focus on the statistics required by any aspiring data scientist.
An experiment consists of treatment and the observation of its effect on the
subjects. Subjects in an experiment are called experimental units. This is usually how most scientific labs collect data. They will put people into two or more groups (usually just two) and call them the control and the experimental group. The control group is exposed to a certain environment and then observed. The experimental group is then exposed to a different environment and then observed. The experimenter then aggregates data from both the groups and makes a decision about which environment was more favorable (favorable is a quality that the experimenter gets to decide).
Remember how statistics are the result of measuring a sample of a population. Well, we should talk about two very common ways to decide who gets the honor of being in the sample that we measure. We will discuss the main type of sampling, called random sampling, which is the most common way to decide our sample sizes and our sample members.
Probability sampling is a way of sampling from a population, in which every person has a known probability of being chosen but that number might be a different probability than another user. The simplest (and probably the most common) probability sampling method is random sampling.
Suppose that we are running an A/B test and we need to figure out who will be in group A and who will be in group B. There are the following three suggestions from your data team:
- Separate users based on location: Users on the west coast are placed in
group A, while users on the east coast are placed in group B
- Separate users based on the time of day they visit the site: Users who visit between 7 p.m. and 4 a.m. get site A, while the rest are placed in group B
- Make it completely random: Every new user has a 50/50 chance of being
placed in either group
The first two are valid options for choosing samples and are fairly simple to
implement, but they both have one fundamental flaw: they are both at risk of
introducing a sampling bias.
A sampling bias occurs when the way the sample is obtained systemically favors some outcome over the target outcome.
It is not difficult to see why choosing option 1 or option 2 might introduce bias. If we chose our groups based on where they live or what time they log in, we are priming our experiment incorrectly and, now, we have much less control over the results.
Specifically, we are at risk of introducing a confounding factor into our analysis, which is bad news.
A confounding factor is a variable that we are not directly measuring but connects the variables that are being measured.
Basically, a confounding factor is like the missing element in our analysis that is invisible but affects our results.
In this case, option 1 is not taking into account the potential confounding factor of geographical taste. For example, if website A is unappealing, in general, to the west coast users, it will affect your results drastically.
Similarly, option 2 might introduce a temporal (time-based) confounding factor. What if website B is better viewed in a nighttime environment (which was reserved for A), and users are turned off to the style purely because of what time it is. These are both factors that we want to avoid, so, we should go with option 3, which is a random sample.
Unequal probability sampling
Recall that I previously said that a probability sampling might have different
probabilities for different potential sample members. But what if this actually
introduced problems? Suppose we are interested in measuring the happiness level of our employees. We already know that we can’t ask every single person on the staff because that would be silly and exhausting. So, we need to take a sample. Our data team suggests random sampling and at first everyone high fives because they feel very smart and statistical. But then someone asks a seemingly harmless question does anyone know the percentage of men/women who work here?
The high fives stop and the room goes silent. This question is extremely important because sex is likely to be a confounding factor. The team looks into it and discovers a split of 75% men and 25% women in the company.
This means that if we introduce a random sample, our sample will likely have a similar split and, thus, favor the results for men and not women. To combat this, we can favor including more women than men in our survey in order to make the split of our sample less favored for men.
At first glance, introducing a favoring system in our random sampling seems like a bad idea, however, alleviating unequal sampling and, therefore, working to remove systematic bias among gender, race, disability, and so on is much more pertinent.
A simple random sample, where everyone has the same chance as everyone else, is very likely to drown out the voices and opinions of minority population members. Therefore, it can be okay to introduce such a favoring system in your sampling techniques.
How do we measure statistics?
Once we have our sample, it’s time to quantify our results. Suppose we wish to
generalize the happiness of our employees or we want to figure out whether salaries in the company are very different from person to person. These are some common ways of measuring our results.
Measures of center
Measures of center are how we define the middle, or center, of a dataset. We do this because sometimes we wish to make generalizations about data values. For example, perhaps we’re curious about what the average rainfall in Seattle is or what the median height for European males is. It’s a way to generalize a large set of data so that it’s easier to convey to someone.
A measure of center is a value in the “middle” of a dataset. However, this can mean different things to different people. Who’s to say where the middle of a dataset is? There are so many different ways of defining the center of data. Let’s take a look at a few. The arithmetic mean of a dataset is found by adding up all of the values and then dividing it by the number of data values.
This is likely the most common way to define the center of data, but can be flawed! Suppose we wish to find the mean of the following numbers with Python programming language:
import numpy as np
np.mean([11, 15, 17, 14]) == 14.25
Simple enough, our average is 14.25 and all of our values are fairly close to it.
But what if we introduce a new value: 31?
np.mean([11, 15, 17, 14, 31]) == 17.6
This greatly affects the mean because the arithmetic mean is sensitive to outliers. The new value, 31, is almost twice as large as the rest of the numbers and, therefore, skews the mean.
Another, and sometimes better, measure of center is the median.
The median is the number found in the middle of the dataset when it is sorted in order, as shown:
np.median([11, 15, 17, 14]) == 14.5
np.median([11, 15, 17, 14, 31]) == 15
Note how the introduction of 31 using the median did not affect the median of the dataset greatly. This is because the median is less sensitive to outliers.
When working with datasets with many outliers, it is sometimes more useful to use the median of the dataset, while if your data does not have many outliers and the data points are mostly close to one another, then the mean is likely a better option.
But how can we tell if the data is spread out? Well, we will have to introduce a new type of statistic.
Measures of variation
Measures of center are used to quantify the middle of the data, but now we will explore ways of measuring how “spread out” the data we collect is. This is a usefulway to identify if our data has many outliers lurking inside. Let’s start with an example.
Consider that we take a random sample of 24 of our friends on Facebook and wrote down how many friends that they had on Facebook. Here’s the list:
friends = [109, 1017, 1127, 418, 625, 957, 89, 950, 946, 797, 981,
125, 455, 731, 1640, 485, 1309, 472, 1132, 1773, 906, 531, 742, 621]
np.mean(friends) == 789.1
The average of this list is just over 789. So, we could say that according to this
sample, the average Facebook friend has 789 friends. But what about the person who only has 89 friends or the person who has over 1,600 friends? In fact, not a lot of these numbers are really that close to 789.
Well, how about we use the median, as shown, because the median generally is not as affected by outliers:
np.median(friends) == 769.5
The median is 769.5 , which is fairly close to the mean. Hmm, good thought, but still, it doesn’t really account for how drastically different a lot of these data points are to one another. This is what statisticians call measuring the variation of data. Let’s start by introducing the most basic measure of variation: the range. The range is simply the maximum value minus the minimum value, as illustrated:
np.max(friends) — np.min(friends) == 1684
The range tells us how far away the two most extreme values are. Now, typically, the range isn’t widely used but it does have its use in application. Sometimes we wish to just know how spread apart the outliers are. This is most useful in scientific measurements or safety measurements.
Suppose a car company wants to measure how long it takes for an air bag to deploy. Knowing the average of that time is nice, but they also really want to know how spread apart the slowest time is versus the fastest time. This literally could be the difference between life and death.
Shifting back to the Facebook example, 1,684 is our range, but I’m not quite sure it’s saying too much about our data. Now, let’s take a look at the most commonly used measure of variation, the standard deviation.
I’m sure many of you have heard this term thrown around a lot and it might even incite a degree of fear, but what does it really mean? In essence, standard deviation, denoted by s when we are working with a sample of a population, measures how much data values deviate from the arithmetic mean.
It’s basically a way to see how spread out the data is. There is a general formula to calculate the standard deviation, which is as follows:
Before you freak out, let’s break it down. For each value in the sample, we will take that value, subtract the arithmetic mean from it, square the difference, and, once we’ve added up every single point this way, we will divide the entire thing by n, the number of points in the sample. Finally, we take a square root of everything.
Without going into an in-depth analysis of the formula, think about it this way: it’s basically derived from the distance formula. Essentially, what the standard deviation is calculating is a sort of average distance of how far the data values are from the arithmetic mean.
Let’s go back to our Facebook example for a visualization and further explanation of this. Let’s begin to calculate the standard deviation. So, we’ll start calculating a few of them. Recall that the arithmetic mean of the data was just about 789, so, we’ll use 789 as the mean.
We start by taking the difference between each data value and the mean, squaring it, adding them all up, dividing it by one less than the number of values, and then taking its square root. This would look as follows:
On the other hand, we can take the Python approach and do all this
programmatically (which is usually preferred).
np.std(friends) # == 425.2
What the number 425 represents is the spread of data. You could say that 425 is a kind of average distance the data values are from the mean. What this means, in simple words, is that this data is pretty spread out.
So, our standard deviation is about 425. This means that the number of friends that these people have on Facebook doesn’t seem to be close to a single number and that’s quite evident when we plot the data in a bar graph and also graph the mean as well as the visualizations of the standard deviation. In the following plot, every person will be represented by a single bar in the bar chart, and the height of the bars represent the number of friends that the individuals have:
import matplotlib.pyplot as plt
y_pos = range(len(friends))
plt.plot((0, 25), (789, 789), ‘b-’)
plt.plot((0, 25), (789+425, 789+425), ‘g-’)
plt.plot((0, 25), (789–425, 789–425), ‘r-’)
The blue line in the center is drawn at the mean (789), the red line on the bottom is drawn at the mean minus the standard deviation (789–425 = 364), and, finally, the green line towards the top is drawn at the mean plus the standard deviation (789+425 = 1,214).
Note how most of the data lives between the green and the red lines while the
outliers live outside the lines. Namely, there are three people who have friend counts below the red line and three people who have a friend count above the green line.
It’s important to mention that the units for standard deviation are, in fact, the
same units as the data’s units. So, in this example, we would say that the standard deviation is 425 friends on Facebook.
See this pose:
- Sinan Ozdemir-Principles of Data Science (Packt)