Previously in this series:
- The beta distribution
- Empirical Bayes estimation
- Credible intervals
- The Bayesian approach to false discovery rates
- Bayesian A/B testing
- Beta-binomial regression
- Understanding empirical Bayesian hierarchical modeling
- Mixture models and expectation-maximization
- The ebbr package
We’re approaching the end of this series on empirical Bayesian methods, and have touched on many statistical approaches for analyzing binomial (success / total) data, all with the goal of estimating the “true” batting average of each player. There’s one question we haven’t answered, though: do these methods actually work?
Even if we assume each player has a “true” batting average as our model suggests, we don’t know it, so we can’t see if our methods estimated it accurately. For example, we think that empirical Bayes shrinkage gets closer to the true probabilities than raw batting averages do, but we can’t actually measure the mean-squared error. This means we can’t test our methods, or examine when they work well and when they don’t.
In this post we’ll simulate some fake batting average data, which will let us know the true probabilities for each player, then examine how close our statistical methods get to the true solution. Simulation is a universally useful way to test a statistical method, to build intuition about its mathematical properies, and to gain confidence that we can trust its results. In particular, this post demonstrates the tidyverse approach to simulation, which takes advantage of the
broom packages to examine many combinations of input parameters.
Most of our posts started by assembling some per-player batting data. We’re going to be simulating (i.e. making up) our data for this analysis, so you might think we don’t need to look at real data at all. However, data is still necessary to estimating the parameters we’ll use in the simulation, which keeps the experiment realistic and ensures that our conclusions will be useful.
(Note that all the code in this post can be found here).
Choosing a distribution of p and AB
In the beta-binomial model we’ve been using for most of these posts, there are two values for each player i:
α0;β0 are “hyperparameters”: two unobserved values that describe the entire distribution. pi is the true batting average for each player- we don’t observe this, but it’s the “right answer” for each batter that we’re trying to estimate. ABi is the number of at-bats the player had, which is observed. (You might recall we had a more complicated model in the beta-binomial regression post that had pi depend on ABi: we’ll get back to that).
Our approach is going to be to pick some “true” α0;β0, then simulate pi for each player. Since we’re just picking any α0;β0 to start with, we may as well estimate them from our data, since we know those are plausible values (though if we wanted to be more thorough, we could try a few other values and see how our accuracy changes).
To do this estimation, we can use our new ebbr package to fit the empirical Bayes prior.
These two hyperparameters are all we need to simulate a few thousand values of pi, using the
There’s another component to this model: ABi, the distribution of the number of at-bats. This is a much more unusual distribution:
The good news is, we don’t need to simulate these ABi values, since we’re not trying to estimate them with empirical Bayes. We can just use the observed values we have! (In a different study, we may be interested in how the success of empirical Bayes depends on the distribution of the ns).
Thus, to recap, we will:
- Estimate α0;β0, which works because the parameters are not observed, but there are only a few and we can predict them with confidence.
- Simulate pi, based on a beta distribution, so that we can test our ability to estimate them.
- Use observed ABi, since we know the true values and we might as well.
Shrinkage on simulated data
The beta-binomial model is easy to simulate, with applications of the