# The Empirical Derivation of the Bayesian Formula

Guest contributorMachine LearningModelingStatisticsbayseianMachine Learningposted by Jannes Klaas June 18, 2019 Jannes Klaas

Deep learning has been made practical through modern computing power, but it is not the only technique benefiting from this large increase in power. Bayesian inference is up and coming technique whose recent progress is powered by the increase in computing power.

We can explain the mathematical expression of Bayes formula using an example similar to the great **Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference** to a financial context and using mathematical concepts intuitively arise from code. Our simulation method is a simple and convenient workaround. It shows a brief overview of modern Bayesian machine learning and its applications in finance.

Bayesian inference and probabilistic programming are two upcoming techniques whose recent progress is powered by large computing power. Advances in the field have received significantly less press coverage than deep learning, yet to the financial practitioner, they might be even more useful. Bayesian models are interpretable and can express uncertainty naturally. They are less “black box” and make the modelers assumptions explicit.

This is an excerpt from the book Machine Learning for Finance written by Jannes Klaas. This book explores new advances in machine learning and shows how they can be applied in the financial sector. It explains the concepts and algorithms behind the main machine learning techniques and provides example Python code for implementing the models yourself. In this article, we’ll look into the mathematical expression of the Bayes formula.

*[Related Article: Bayesian Estimation, Group Comparison, and Workflow]*

## An intuitive guide to Bayesian inference

Before starting, we need to import NumPy and matplotlib; we can do this by running the following code:

*import numpy as np import matplotlib.pyplot as plt*

*% matplotlib inline*

**Note:** This example is similar to one given in the great *Bayesian Methods for Hackers* but adapted to a financial context and rewritten so that mathematical concepts intuitively arise from code.

Imagine you have a security that can either pay $1 or nothing. The payoff depends on a two-step process: With a 50% probability, the payoff is random, with a 50% chance of getting $1 and a 50% chance of making nothing. However, you also have a 50% chance of getting the dollar with a **true payoff probability** (**TPP**) *x*.

The payoff scheme is shown as follows:

You are interested in finding out what the true payoff ratio is, as it will inform your trading strategy. Your boss allows you to buy 100 units of the security. You do, and 54 of the 100 securities pays you a dollar.

But what is the actual TPP? In this case, there is an analytical solution to calculate the most likely TPP, but we will use a computational method, which also works for more complicated cases. We will **simulate** the security payoff process.

## Flat prior

The variable *x* represents the TPP. We randomly sample 100 “truth” values, which are one if you had gotten the dollar under the true payoff and zero otherwise. We also sample the two random choices at **Start** and **Random Payoff** in the preceding scheme. It is computationally more efficient to sample the random outcomes in one go for all students, even though they are not all needed.

The result is then the truth payoff where the first random choice one and the second random choice value where the first random choice value is zero. Finally, we sum up the payoffs and divide them by the number of securities in our simulation in order to obtain the share of payoffs in the simulation.

The following code snippet runs one simulation. It’s important though to make sure that you understand how the computations follow from our security structure:

*def run_sim(x): *

* truth = np.random.uniform(size=100) < x *

* first_random = np.random.randint(2,size=100) *

* second_random = np.random.randint(2,size=100) *

* res = np.sum(first_random*truth + (1- first_random)*second_random)/100 *

* return res*

We would like to try out many possible TPPs. So we sample a candidate TPP and run the simulation with the candidate probability. If the simulation outputs the same payoff as we observed in real life, then our candidate is a real possibility. The following sample method returns real possibilities, or None if the candidate it tried out was not suitable:

*def sample(data = 0.54): *

* x = np.random.uniform() *

* if run_sim(x) == data: *

* return x*

We have to sample many possible TPPs. To do this faster we use a library called JobLib that helps with parallel execution. We need to import the Parallel class, which helps to run loops in parallel, and the delayed method which helps to execute functions in order inside the parallel loop:

*from JobLib import Parallel, delayed*

The details are not relevant for this example, but the Parallel(n_jobs=-1) method makes the job run with as many parallel executions as there are CPUs on the machine. delayed(sample)() for i in range(100000) runs the sample method 100,000 times.

We obtain a Python list t which we turn into a NumPy array. As you can see in the following code snippet, about 98% of the array are None values. That means that 98% of the values the sampler tried out for *x* did not yield results matching our data:

*t = Parallel(n_jobs=-1)(delayed(sample)() for i in range(100000)) *

*t = np.array(t,dtype=float) *

*share = np.sum(np.isnan(t))/len(t)*100 *

*print(f'{share:.2f}% are throwaways’)*

We’ll now throw away all None values, leaving us with the possible values for *x*:

*t_flat = t[~np.isnan(t)] *

*plt.hist(t_flat, bins=30,density=True) *

*plt.title(‘Distribution of possible TPPs’) *

*plt.xlim(0,1);*

As you can see, there is a *distribution* of possible TPPs. The most likely TPP is somewhere around 50% to 60%. Other values are possible but less likely.

Here you already see one big advantage of Bayesian methods. All estimates come in distributions, for which we can calculate confidence intervals (or credibility intervals in Bayesian terminology). It allows us to be more precise about how sure we are about things and which other values parameters in our model could have. In financial applications, where millions are staked on the outputs of models, it is very advantageous to quantify such uncertainty.

## < 50% prior

You take your results to your boss, who is a domain expert on the security you are trading. He looks at your analysis and shakes his head. “The TPP cannot be more than 0.5” he explains, “from the underlying business, it’s physically impossible to do more than that”.

How can you incorporate this fact into your simulation analysis? Well, the straightforward solution is to only try out candidate TPPs from zero to 0.5. All you have to do is to limit the space you sample candidate value of *x*, which can be achieved by running the following code block:

*def sample(data = 0.54): *

* x = np.random.uniform(low=0,high=0.5) *

* if run_sim(x) == data: *

* return x*

Now you can run the simulations exactly as before:

*t = Parallel(n_jobs=-1)(delayed(sample)() for i in range(100000)) *

*t = np.array(t,dtype=float) *

*# Optional *

*share = np.sum(np.isnan(t))/len(t)*100 *

*print(f'{share:.2f}% are throwaways’)*

99.10% are throwaways

*t_cut = t[~np.isnan(t)] plt.hist(t_cut, bins=15,density=True)*

*plt.title(‘Distribution of possible TPPs’)*

plt.xlim(0,1);

plt.xlim(0,1);

## Prior and posterior

Clearly, your choice of values to try out influenced the outcome of your simulation analysis. The choice also reflected your beliefs about possible values of *x*.

The first time around, you believed that all TPPs between 0 and 100% were equally likely before seeing any data. This is called a “flat” prior as the distribution of values is the same for all values and therefore “flat.” The second time, you believed that the TPPs had to be below 50%.

The distribution expressing your beliefs about *x* before seeing data is called the “prior distribution” *P(TPP) *or just “prior”. The distribution of possible values of *x* we obtained from simulation, that is after seeing data *D*, is called the “posterior distribution” *P(TPP|D)* or just posterior.

The following plots shows samples from prior and posterior for the first and second round. The first plot shows the results with a flat posterior:

*flat_prior = np.random.uniform(size=1000000) *

*plt.hist(flat_prior,bins=10,density=True, label=’Prior’) *

*plt.hist(t_flat, bins=30,density=True, label=’Posterior’) *

*plt.title(‘Distribution of $x$ with no assumptions’) *

*plt.legend() *

*plt.xlim(0,1);*

The next plot shows the output of our sampler with a <50% prior. It is still the same sampler, but the outcome is quite different:

*cut_prior = np.random.uniform(low=0,high=0.5,size=1000000) *

*plt.hist(cut_prior,bins=10,density=True, label=’Prior’) *

*plt.hist(t_cut, bins=15,density=True, label=’Posterior’) *

*plt.title(‘Distribution of $x$ assuming TPP <50%’) *

*plt.legend() *

*plt.xlim(0,1);*

Now notice something curious. The posterior values of the second round are roughly equal to the posterior values of the first round but cut off at 0.5. This is because the second round prior is zero for values above 0.5 and one everywhere else.

As we only keep simulation results that match the data, the number of kept simulation results shown in the histogram reflects the probability of running a simulation that yields the observed data *D* for a given TPP *C*, *P(D|TPP)*.

The posterior probabilities *P(C|D) *we obtain from our simulations are equal to the probability that we observe the data when trying out a given TPP *P(D|TPP)* times the probability that we do try out that cheater share *P(TPP)*. Mathematically: *P(TPP|D)=P(D|TPP)P(TPP)*.

When the data is naturally obtained, we might need to account for biases in our data collection method. Most of the time we do not have to worry about this and can simply leave it out, but sometimes the measurement can amplify certain outcomes. To mitigate this, we divide by the data distribution *P(D)* as a final add on to our posterior formula and arrive at:

*[Related Article: Quantifying Uncertainty with Bayesian Statistics]*

It’s Bayes formula! When running our simulation, we are sampling from the posterior. Why can’t we just use Bayes formula to calculate the posterior? Because evaluating *P(D|TPP)* requires integrating over *TPP*, which is intractable. Our simulation method is a simple and convenient workaround.