Impossible or Improbable — A Gentle Introduction to Probability

9 min readMar 20, 2020

True logic of this world lies in the calculus of probabilities — James Clerk Maxwell

Over the next few chapters, we will explore both probability and statistics as
methods of examining both data-driven situations and real-world scenarios. The rules of probability govern the basics of prediction. We use probability to define the chances of the occurrence of an event.
In this chapter, we will look at the following topics:

What is probability?
The differences between the Frequentist approach and the Bayesian approach
How to visualize probability
How to utilize the rules of probability
Using confusion matrices to look at the basic metrics

Probability will help us model real-life events that include a sense of randomness and chance. Over the next two chapters, we will look at the terminology behind probability theorems and how to apply them to model situations that can appear unexpectedly.

Basic definitions

One of the most basic concepts of probability is the concept of a procedure. A procedure is an act that leads to a result. For example, throwing dice or visiting a website. An event is a collection of the outcomes of a procedure, such as getting heads on a coin flip or leaving a website after only 4 seconds. A simple event is an outcome/ event of a procedure that cannot be broken down further. For example, rolling two dice can be broken down into two simple events: rolling die 1 and rolling die 2. The sample space of a procedure is the set of all possible simple events. For example, an experiment is performed, in which a coin is flipped three times in succession. What is the size of the sample space for this experiment? The answer is eight because the results could be any one of the possibilities in the following sample space — {HHH, HHT, HTT, HTH, TTT, TTH, THH, or THT}.

Probability

The probability of an event represents the frequency, or chance, that the event will happen. For notation, if A is an event, P(A) is the probability of the occurrence of the event. We can define the actual probability of an event, A, as follows:

Here, A is the event in question. Think of an entire universe of events where
anything is possible, and let’s represent it as a circle. We can think of a single event, A, as being a smaller circle within that larger universe, as shown in the following diagram:

Let’s now pretend that our universe involves a research study on humans, and the A event is people in that study who have cancer.

If our study has 100 people and A has 25 people, the probability of A or P(A) is 25/100.

The maximum probability of any event is 1. This can be understood as the red circle grows so large that it is the size of the universe (the larger circle).

The most basic examples (I promise they will get more interesting) are coin flips. Let’s say we have two coins and we want the probability that we will roll two heads. We can very easily count the number of ways two coins could end up being two heads. There’s only one! Both coins have to be heads. But how many options are there? It could either be two heads, two tails, or a heads/tails combination.

First, let’s define A. It is the event in which two heads occur. The number of ways that A can occur is 1. The sample space of the experiment is {HH, HT, TH, TT}, where each two-letter word indicates the outcome of the first and second coin simultaneously. The size of the sample space is four. So, P(getting two heads) = 1/4.

Let’s refer to a quick visual table to prove it. The following table denotes the options for coin 1 as the columns and the options for coin 2 as the rows. In each cell, there is either a True or a False. A True value indicates that it satisfies the condition (both heads) and False indicates otherwise.

So, we have one out of a total of four possible outcomes.

Bayesian versus Frequentist

The preceding example was almost too easy. In practice, we can hardly ever truly count the number of ways something can happen. For example, let’s say that we want to know the probability of a random person smoking cigarettes at least once a day. If we wanted to approach this problem using the classical way (the previous formula), we would need to figure out how many different ways a person is a smoker — someone who smokes at least once a day — which is not possible! When faced with such a problem, two main schools of thought are considered when it comes to calculating probabilities in practice: the Frequentist approach and the Bayesian approach. This chapter will focus heavily on the Frequentist approach while the subsequent chapter will dive into the Bayesian analysis.

Frequentist approach

In a Frequentist approach, the probability of an event is calculated through
experimentation. It uses the past in order to predict the future chance of an event. The basic formula is as follows:

Basically, we observe several instances of the event and count the number of times A was satisfied. The division of these numbers is an approximation of the probability. The Bayesian approach differs by dictating that probabilities must be discerned using theoretical means. Using the Bayes approach, we would have to think a bit more critically about events and why they occur. Neither methodology is 100% the correct answer all the time. Usually, it comes down to the problem and the difficulty of using either approach.
The crux of the Frequentist approach is the relative frequency. The relative frequency of an event is how often an event occurs divided by the total number of observations.

Example — marketing stats

Let’s say that you are interested in ascertaining how often a person who visits your website is likely to return on a later date. This is sometimes called the rate of repeat visitors. In the previous definition, we would define our A event as being a visitor coming back to the site. We would then have to calculate the number of ways a person can come back, which doesn’t really make sense at all! In this case, many people would turn to a Bayesian approach; however, we can calculate what is known as relative frequency. So, in this case, we can take the visitor logs and calculate the relative frequency of event A (repeat visitors). Let’s say, of the 1,458 unique visitors in the past week, 452 were repeat visitors. We can calculate this as follows:

So, about 31% of your visitors are repeat visitors.

The law of large numbers

The reason that even the Frequentist approach can do this is because of the law of large numbers, which states that if we repeat a procedure over and over, the relative frequency probability will approach the actual probability. Let’s try to demonstrate this using Python.

If I were to ask you the average of the numbers 1 and 10, you would very quickly answer around 5. This question is identical to asking you to pick the average number between 1 and 10. Let’s design the experiment to be as follows:

Python will choose n random numbers between 1 and 10 and find their average.

We will repeat this experiment several times using a larger n each time, and then we will graph the outcome. The steps are as follows:

Pick a random number between 1 and 10 and find the average.
Pick two random numbers between 1 and 10 and find their average.
Pick three random numbers between 1 and 10 and find their average.
Pick 10,000 random numbers between 1 and 10 and find their average.
Graph the results.

Compound events

Sometimes, we Compound events Sometimes, we need to deal with two or more events. These are called compound events. A compound event is any event that combines two or more simple events. When this happens, we need some special notation.

Given events A and B:

The probability that A and B occur is P(A ∩ B) = P(A and B)
The probability that either A or B occurs is P(A B) = P(A or B)

Understanding why we use set notation for these compound events is very
important. Remember how we represented events in a universe using circles earlier? Let’s say that our Universe is 100 people who showed up for an experiment, in which a new tesUnderstanding why we use set notation for these compound events is very important. Remember how we represented events in a universe using circles earlier?

Let’s say that our Universe is 100 people who showed up for an experiment, in
which a new test for cancer is being developed:t for cancer is being developed:

In the preceding diagram, the red circle, A, represents 25 people who actually have cancer. Using the relative frequency approach, we can say that P(A) = number of people with cancer/number of people in study, that is, 25/100 = 1⁄4 = .25. This means that there is a 25% chance that someone has cancer.
Let’s introduce a second event, called B, as shown, which contains people for whom the test was positive (it claimed that they had cancer). Let’s say that this is for 30 people. So, P(B) = 30/100 = 3/10 = .3. This means that there is a 30% chance that the test said positive for any given person:

These are two separate events, but they interact with each other. Namely, they might intersect or have people in common, as shown here:

Anyone in the space that both A and B occupy, otherwise known as A intersect B or A ∩ B, are people for whom the test claimed they were positive for cancer (A) and they actually do have cancer. Let’s say that’s 20 people. The test said positive for 20 people, that is, they have cancer, as shown here:

This means that P(A and B) = 20/100 = 1/5 = .2 = 20%.

If we want to say that someone has cancer or the test came back positive. This would be the total sum (or union) of the two events, namely, the sum of 5, 20, and 10, which is 35. So, 35/100 people either have cancer or had a positive test outcome. That means, P(A or B) = 35/100 = .35 = 35%.

All in all, we have people in the following four different classes:

Pink: This refers to the people who have cancer and had a negative test
outcome
Purple (A intersect B): These people have cancer and had a positive test
outcome
Blue: This refers to the people with no cancer and a positive test outcome
White: This refers to the people with no cancer and a negative test outcome

So, effectively, the only times the test was accurate was in the white and purple regions. In the blue and pink regions, the test was incorrect.

I hope I was able to help you teach and show you a little more about the importance of probability. I recommend all aspiring data scientists to go out and learn more. And see this pose:

Sinan Ozdemir-Principles of Data Science (Packt)
https://towardsdatascience.com/data-scientists-must-know-probability-7722cdd49d21
https://luminousmen.com/post/data-science-probability