Whenever we’re working with data, there is necessarily uncertainty in our results. Firstly, we can’t collect all the possible data, so instead we randomly sample from a population. Accordingly, there is a natural variance and uncertainty in any data we collect. There is also uncertainty from missing data, systematic errors in the experiment or data collection, and many more sources of error, although most studies are designed to minimize these types of errors. We must understand and quantify this uncertainty so we can judge the quality of our final results. If there is low uncertainty in our results, we can be confident in our analysis. On the other hand, if the uncertainty is high, we should be wary.
As a test of our results in the face of uncertainty, traditionally we’d turn to hypothesis testing. For a classic example, in clinic drug trials, you want to compare a group of people given a drug vs a group of people given a placebo. We’re looking to see if the drug produces a substantial change in some metric (cholesterol levels for example) compared to the placebo. However, any difference we see might actually be due to the natural variance (uncertainty) in our samples.
Histograms of example data from a drug trial
Typically you’d use hypothesis testing and calculate things like t-statistics and eventually a p-value. However, hypothesis testing and p-values are in general a poor way of quantifying uncertainty. Firstly, p-values have sampling distributions, which means there is uncertainty in the p-value itself. With a effect size of d=0.3 (that is, the difference of the sample means divided by the standard deviation is 0.3) and roughly 100 people in a trial, you’re about 50% likely to get p < 0.05 from any one experiment. For data like this like this, whether your experiment is a “success” or not is roughly a coin flip.
Sampling distribution of p for an experiment with effect size d = 0.3 and N = 100, simulated with 10,000 experiments.
Also, p-values are a function of both the effect size and the sample size. You can have a small effect with a lot of data, or a medium effect with less data and get the same p-value, say 0.04. Intuitively we trust the result with more data over the result with less data, since we’re more certain of results with more data. But, this intuition isn’t reflected in hypothesis testing, p-values are in general opaque. Overall, p-values are not a measure of uncertainty and don’t provide enough information to truly judge our results.
As a response to this sort of criticism, many have promoted parameter estimation as a better method over hypothesis testing. For example, you could estimate the difference in the treatment and placebo groups, along with the 95% confidence interval as a measure of the uncertainty in that estimate. However, confidence intervals are often misinterpreted as meaning there is a X% change the parameter falls in the interval. Instead, the correct interpretation is obtuse: if you did your experiment an infinite number of times and calculated the confidence interval for each of those experiments, then X% of the time the X% confidence interval would contain the true parameter. For any one experiment though, there is no probability calculated to tell you if your confidence interval actually contains the true parameter, the parameter is either in the interval or not.
The solution to these problems is Bayesian inference. Here our goal is to represent our beliefs with probability distributions and update our beliefs given data. The less certain we are about some prior belief, the wider the probability distribution representing that belief will be. Conversely, stronger beliefs lead to narrower probability distributions. We’ll use these probability distributions to quantify our knowledge about the parameters including the uncertainty in our conclusions.
For this, we use Bayes’ theorem to model our degrees of belief. That is, we have some beliefs (and knowledge) about these parameters before we see data, then we update our beliefs after seeing the data. Bayes’ theorem relates information given by the data (the data likelihood) with prior knowledge of the parameters before seeing data (called the prior). The data and the priors are used to calculate the posterior, a probability distribution of the parameters after seeing the data.
With the model defined (our priors and data likelihood), we sample from the posterior distribution using Markov Chain Monte Carlo (MCMC) methods such as NUTS (the No-U-Turn Sampler). This gives us an actual probability distribution for parameters like the means of the control and treatment groups. We also get a distribution for the difference of these means from which we can get a 95% probability interval (called the credible region in Bayesian statistics) as well as calculate the probability that the difference is greater than zero.
Posterior distributions for the means of the control and treatment groups, as well as the difference in those means. For each distribution, the 95% Credible Region (CR) is labeled, as well as the mean of the distribution. For the difference of means, probabilities of the difference being less than or greater than 0 are provided (the difference is 96% likely to be greater than 0).
Using Bayesian inference, we are able to truly quantify the uncertainty in our results. From these posterior distributions, we get estimates of the parameters with actual probabilities which we can use to reason about our results and judge their validity. With Python packages such as PyMC and Sampyl, anyone can start using Bayesian inference. This is an essential skill scientists and analysts should learn to properly address the challenges in working with our uncertain world.
To learn more, tune into Mat Leonard’s webinar on Thursday, June 7th.