This chapter will focus on the statistics required by any aspiring data scientist
The purpose of this is to provide a comprehensive overview of the fundamentals of statistics that you’ll need to start your data science journey. There are many articles already out there, but I’m aiming to make this more concise.
We will explore ways of sampling and obtaining data without being affected by bias and then use measures of statistics to quantify and visualize our data. We will see how we can standardize data for the purpose of both graphing and interpretability.
- How to obtain and sample data
- The measures of center, variance, and relative standing
- Normalization of data using the z-score
- The Empirical rule
Statistics can be a powerful tool when performing the art of Data Science. From a high-level view, statistics is the use of mathematics to perform technical analysis of data. A basic visualization such as a bar chart might give you some high-level information, but with statistics, we get to operate on the data in a much more information-driven and targeted way. The math involved helps us form concrete conclusions about our data rather than just guesstimating.
Using statistics, we can gain deeper and more fine-grained insights into how exactly our data is structured and based on that structure how we can optimally apply other data science techniques to get even more information. Today, we’re going to look at 5 basic statistics concepts that data scientists need to know and how they can be applied most effectively!
What are statistics?
This might seem like an odd question to ask, but I am frequently surprised by the number of people who cannot answer this simple and yet powerful question: what are statistics? Statistics are the numbers you always see on the news and in the paper. Statistics are useful when trying to prove a point or trying to scare you, but what are they?
To answer this question, we need to back up for a minute and talk about why we even measure them in the first place. The goal of this field is to try to explain and model the world around us. To do that, we have to take a look at the population.
We can define a population as the entire pool of subjects of an experiment or a model. Essentially, your population is who you care about. Who are you trying to talk about? If you are trying to test if smoking leads to heart disease, your population would be the smokers of the world. If you are trying to study teenage drinking problems, your population would be all teenagers.
Now, consider that you want to ask a question about your population, for example, if your population is all of your employees (assume that you have over 1,000 employees), perhaps you want to know what percentage of them use illicit drugs. The question is called a parameter.
We can define a parameter as a numerical measurement describing a characteristic of a population. For example, if you ask all 1,000 employees and 100 of them are using drugs, the rate of drug use is 10%. The parameter here is 10%.
However, let’s get real, you probably can’t ask every single employee whether
they are using drugs. What if you have over 10,000 employees? It would be very difficult to track everyone down in order to get your answer. When this happens, it’s impossible to figure out this parameter. In this case, we can estimate the parameter.
First, we will take a sample of the population. We can define a sample of a population as a subset (random not required) of the population.
So, we perhaps ask 200 of the 1,000 employees you have. Of these 200, suppose 26 use drugs, making the drug use rate 13%. Here, 13% is not a parameter because we didn’t get a chance to ask everyone. This 13% is an estimate of a parameter. Do you know what that’s called? That’s right, a statistic!
We can define a statistic as a numerical measurement describing a characteristic of a sample of a population.
A statistic is just an estimation of a parameter. It is a number that attempts to
describe an entire population by describing a subset of that population. This is
necessary because you can never hope to give a survey to every single teenager or to every single smoker in the world. That’s what the field of statistics is all about taking samples of populations and running tests on these samples.
So, the next time you are given a statistic, just remember, that number only
represents a sample of that population, not the entire pool of subjects.
How do we obtain and sample data?
If statistics is about taking samples of populations, it must be very important to know how we obtain these samples, and you’d be correct. Let’s focus on just a few of the many ways of obtaining and sampling data.
There are two main ways of collecting data for our analysis: observational and
experimentation. Both these ways have their pros and cons, of course. They each produce different types of behavior and, therefore, warrant different types of analysis.
We might obtain data through observational means, which consists of measuring specific characteristics but not attempting to modify the subjects being studied. For example, you have tracking software on your website that observes users’ behavior on the website, such as length of time spent on certain pages and the rate of clicking on ads, all the while not affecting the user’s experience, then that would be an observational study.
This is one of the most common ways to get data because it’s just plain easy. All you have to do is observe and collect data. Observational studies are also limited in the types of data you may collect. This is because the observer (you) is not in control of the environment. You may only watch and collect natural behavior. If you are looking to induce a certain type of behavior, an observational study would not be useful.
I hope you learned something today. Feel free to leave a message if you have any feedback, and share it with anyone that might find this useful.
See this pose: