What is Bayesian Statistics? The Beginner Math Guide (Part One)
Bayesian Statistics is used in many various fields such as: Machine Learning, Engineering, Programming, Data Science, Physics, Finance, and more
Introduction
Life is full of uncertainty. Even something predictable may not happen.
We can get by uncertainty by planning our day. For example, if you have an in-person job interview, you might leave home earlier to allow for delays. We have a sense of how to deal with uncertainty and when you think this way, you’re starting to think of probability.
Bayesian statistics will help us get better at making better choices given our limited information. It’s not just data scientists that will benefit from this; engineers, programmers, salespeople, and marketers will all benefit from knowing Bayesian statistics.
Bayesian Thinking
The full Bayesian analysis consists of the following:
Observed data
Formed a hypothesis
Changed your beliefs based on data
Observing Data
Before you have any conclusion, you need to understand the data you’re observing. We would write this as:
P(data example) = likely
where P is probability and the data are listed inside the parentheses. You would read this as: "The probability of this data example is likely.” For example, we can write:
P(snow) = very unlikely
This equation reads as: “The probability of snow is very unlikely.”
We can also have two pieces of data listed inside the parentheses. Such as:
P(snow, the temperature is cold) = very likely
You would read this equation as: "The probability of snow and a cloudy sky is very likely." We use commas to separate events when we're combining probability of multiple events.
The probability of one of these events occurring on its own would be different.
Holding Prior Beliefs
Prior beliefs are beliefs we have built during our lifetime of experiences. You believe the sun will set because the sun sets every day. You might have a prior belief that when a traffic light is red, you would stop, but if it’s green, you would go.
Our prior beliefs say that the chance of snow is very unlikely. This could be different in other places such as Aomori City, where it snows most of the time. The probability of snow would be very likely.
We can enter our prior beliefs in the equation, separated with a | like:
P(snow, the temperature is cold | experience in San Francisco Bay Area) = very unlikely
We would read this as": “The probability of snow and the temperature being cold, given our experience in San Francisco Bay Area, is very unlikely.”
The probability outcome is called conditional probability.
We typically use shorter variable names for events and conditions, such as:
D = snow, the temperature is cold
X = experience in San Francisco Bay Area
We can write this equation as P(D | X). This would make it easier to write.
We can add more than one piece of prior knowledge. Suppose it's January 29th and you are in the Bay Area. From prior experience, it does snow in the Bay Area (well in the mountains). Given our experience in the Bay Area and the fact it's January 29th, the probability of seeing snow is less unlikely. You can rewrite the equation as:
P(snow, the temperature is cold | January 29th, experience in San Francisco Bay Area) = unlikely
Our probability changed from 'very unlikely' to 'unlikely.'
It's essential to keep in mind that our understanding of the world comes from our prior experience in the world.
Forming Hypothesis
To explain what you have seen, you need some form of hypothesis - a model about how the world works that make a prediction. Hypotheses come in many forms:
If you believe that your favorite basketball team is the greatest, you will predict they will win more championships than other teams.
If you believe the Earth rotates, you predict the sun will set and rise at certain times.
Hypotheses can also be formal:
A scientist may hypothesize that a certain treatment can slow the spread of Covid.
A neural network may predict which images are traffic lights and which ones are stop signs.
When we’re talking about hypotheses in Bayesian statistics, we are concerned with how well the hypotheses are at predicting the data we observe.
For example, we could define our first hypothesis as:
Because H1 predicts the data D, the probability of the data increases. We can write this as:
The equation says: “The probability of white flakes and the temperature is cold, given by prior beliefs that this is snow and my prior experience, is much higher (indicated by >>) than just seeing white flakes and the temperature is cold without explanation.” We would use the hypothesis to explain the data.
The key to probabilistic reasoning is to think about how you interpret data, make hypotheses, and change your beliefs. Without H1, you be confused since you have no explanation of the data you've observed.
Gathering Evidence
To improve our knowledge, we would need to gather more data. To get more data, we would need more observations.
In our scenario, let’s say that you have noticed a camera crew with a snow machine. Because of this, you have instantly changed your mind about what is happening. With the new evidence, you realize someone is just shooting out artificial snow.
Let’s break it down. You started with your initial hypothesis:
The chance of having snow in your backyard is very, unlikely during the spring (using Lake Tahoe as an example). So we would write:
When you have observed additional data, you realize there is another hypothesis:
Now the chance of someone shooting artificial snow outside your window is extremely low. So we would write:
We write extremely low because the chance of you experiencing something like that is low compared to having snow in your backyard during the spring in Lake Tahoe. There is a higher chance of having snow in your backyard in Lake Tahoe than it is with someone shooting artificial snow outside your window.
Comparing Hypotheses
Now because there is another explanation - the person shooting artificial snow - you have formed an alternate hypothesis. Your updated data are:
On observing the extra data, you have changed your conclusion. Breaking it down we get:
You have a new hypothesis, H2 which explains the data better:
The key is to understand that we’re comparing how well each of these hypotheses explains the observed data. Because we’ve observed the second hypothesis better, we would use >>. We say one brief is more accurate than another because it has a better explanation of what we observed.
We would express this as:
Data Informs Beliefs
The only absolute is your data. Your hypotheses change and your experience (X) also changes. It may be different from someone else’s, but the data (D) is the same.
Bayesian thinking is about changing and updating your mind to understand the world. The data is real, so our beliefs need to shift to align with the data.
Uncertainty
Probability is a measurement of how strongly you believe things. For example, you might have said something such as “That is unlikely” or “I’m not sure about that”. All of these are probabilities.
We need to find out how strongly we believe in X, so we use values such as true and false. Computers represent true as 1 and false as 0. We can use this model with probability. P(X) = 0 is like saying X = false, and P(X) = 1 is like saying X is true. A value closer to 0 means that we are certain it’s false, while a value closer to 1 means that we are certain it’s true. It’s worth noting that 0.5 means that we’re unsure if it’s true or false.
There is also negation. When we say “not false” we mean true. Likewise, when we say “not true” we mean false. We want probability to work the same, so values are either X or not X.
Calculating Probabilities By Outcomes
A way to calculate probability is to count the outcomes of events. We have two sets of outcomes. The first is all possible outcomes that will happen. The second is the count of the outcome that you’re interested in. For example, a coin toss of all possible outcomes would be “heads” or “tails.” If you have decided that tails mean you win, then you’re only interested in outcomes involving tails.
The first step is to make count of all possible events (in this case, it’s heads or tails). We would use omega to indicate all sets of events:
If we want to know the probability of getting tails in a single toss, written as P(tails). We would divide it by the number of the total outcomes, which is 2: (since it’s either heads or tails)
So the probability of tails is:
Now, what will the probability be for tails if we toss two coins? We have to include all possible pairs of heads and tails if we want to find out:
To figure out the probability of getting at least one tail, we would look at how many pairs match our condition, which in this case is:
{(tails, tails), (heads, tails), (tails, heads)}
We have 3 elements, and there are 4 possible pairs we could get. So this means that P(at least one tail) = 3/4.
Count the events you care about and the total possible events, and then you can come up with an easy probability.
Calculating Probabilities as Ratios of Beliefs
Counting events is useful for physical objects, but not great for real-life questions such as:
“What is the probability it will snow tomorrow?”
“Is that Michael Jordan?”
Having bets is a practical way to express how strongly we hold our beliefs. Your bet expresses how strongly you believe in your hypothesis by giving odds of the bet. Odds are common ways to represent beliefs as a ratio of how much you’re winning to pay if you were wrong about the outcome to how much you would receive for being correct.
For example, say the odds of 60-year-old Michael Jordan beating current Lebron James in a 1v1 is 10 to 1. That means you would pay $1 to take the bet, and if you’re right about Michael Jordan winning the 1v1, you would get $10 in return.
I’m going to use 100 to 5 as an example instead of 10 to 1. We can write this ratio for your beliefs that Michael Jordan will win, P(H_win), to your friend’s belief that Michael Jordan will lose, P(H_lose), like:
In this case for 100 to 5:
Solving Probabilities
We can write our equation like this:
We read this as “The probability that Michael Jordan losing is 20 times greater than the probability of Michael Jordan winning.”
There are only two possibilities: Michael Jordan winning or losing. Because there are two outcomes, we know the probability of winning is just 1 minus the probability of losing, we can substitute P(H_win) with its value in terms of P(H_lose) like:
We can expand the equation by multiplying both parts in the parentheses by 20:
We can remove the P(H_lose) term from the right side of the equation by adding 20 * P(H_lose) to both sides:
We can divide both sides by 21:
The above equation is derived from (in case you’re wondering why there is a 21):
Logic of Uncertainty
In logic, there are three important operators:
AND
OR
NOT
Here is an example:
If it is snowing outside AND I’m going outside, I will need a snow jacket.
We can also add multiple operators like:
If it is NOT snowing OR if I’m not going outside, I will NOT need a snow jacket.
I have used the NOT for probabilistic reasoning (from Michael Jordan's example):
Combining with AND
We use AND as a way to combine events. Such as:
It’s snowing AND you forgot to wear your snow jacket
Going 200 mph AND crashing the car
Let’s start with a simple example involving a dice and a coin.
Suppose that we want to know the probability of getting tails AND rolling a six on a dice. We know the probabilities of each as:
The probability of both of these occurring is written as:
P(tails, six) = ?
When we flipped a coin, we have two outcomes, heads or tails (ignore how ugly this is):
For each possible coin flip, there are six possible results from the dice:
Using this, we can count our possible solutions. So there are 12 possible outcomes of flipping a coin and rolling a die, so it’s:
Product Rule of Probability
First, we need the probability of flipping tails. Because the probability of tails is 1/2, we can eliminate half of our possibilities. Then for the remaining branches, there is a 1/6 chance of getting a 6 on a die. This image will show the outcome:
If we multiply the two probabilities:
This is the exact same answer as before. The general rule for combining probabilities with AND is this formula:
Because we’re multiplying our results, we refer to this as the product rule of probability.
The rule can also be expanded to include more possibilities. We can combine a third probability, P(C), by repeating the formula:
Combining with OR
The other rule of logic is OR. Some examples:
Flipping tails on a coin OR rolling a 6 on a die
Snowing OR raining
Catching covid OR catching the flu
This is more complicated because events can either be mutually exclusive or not. Mutually exclusive events mean one event happening implies that other possible events can’t happen. For example, the possible outcomes of rolling a single die are mutually exclusive since we can’t roll a 6 or a 2. Not mutually exclusive events are events that can happen at the same time. For example, we run and sweat, so two events are happening at the same time (run and sweat).
Calculating OR
We know that the probability of rolling a 1 in a dice is 1/6, the same is true for rolling a 6:
We can add the two probabilities together (1/6 + 1/6), and see that the combined probability of rolling a 1 OR a 6 is 2/6, or 1/3:
This rule only applies to mutually exclusive events. Mutually exclusive events mean:
P(A) AND P(B) = 0
Mutually exclusive events of getting A and B have to be 0 because one event prevents another event from happening just like how it’s impossible to get heads and tails on flipping one coin.
Using Sum rule
The rule for combining non-mutually exclusive probabilities with OR is known as the sum rule of probability:
P(A) OR P(B) = P(A) + P(B) - P(A,B)
So the probability of rolling a number less than six or flipping tails is:
1/2 is the probability of landing tails
1/6 is the probability of landing a six on a dice
1/12 is the multiply combined outcome of 1/2 and 1/6
Binomial Probability
Binomial probability is used to calculate the probability of a certain number of successful outcomes, given the probability of successful outcomes and the number of trials. The “bi” in binomial refers to two possible outcomes: an event happening and an event that isn’t happening. If there are two outcomes, it is now called a multinomial.
A Binomial has three parameters:
k: Number of outcomes we care about
n: Total number of trials
p: the probability of an event happening
So if we were calculating the probability of flipping two tails in three coin tosses:
k = 2, the number of outcomes that we want (two tails)
n = 3, the number of times the coin flipped
p = 1/2, probability of flipping tails
So the notation to express this would be:
Which for the three coin tosses:
Combinatorics
Combinatorics is just a name for advanced counting. There is an operation in combinatorics called the binomial coefficient. The binomial coefficient is selecting the outcome we care about from the total number of trials. So the notation will look like this:
We read this as “n choose k.” So in our three coin tosses, it would be “three tosses choose tails.”
The ! means factorial, which is the product of all numbers up to, including the number before the ! symbol, so 4! = (4 * 3 * 2 * 1), or 6! = (6 * 5 * 4 * 3 * 2 * 1).
The generalized formula is:
We can put it all together and create a more general formula for the number of tails and the number of not tails:
We can replace P(heads) with p. This would give us a general solution for k, n, and p.
Now we have this equation, we can solve the probability of a coin toss. For example, let’s find the probability of flipping 10 tails in 20 coin tosses:
For a easy-to-read example:
[End of Part One]
Get a dose of inspiration and information with WhoWhatWhyAi! Our beautifully curated newsletter blends art, stories and updates of the latest AI creations. Written by Brian Ball and Zeng Check out WhoWhatWhyAi