

Simulation of empirical Bayesian methods (using baseball statistics)
ModelingPredictive Analyticsposted by Dave Robinson September 24, 2017 Dave Robinson

Previously in this series:
- The beta distribution
- Empirical Bayes estimation
- Credible intervals
- The Bayesian approach to false discovery rates
- Bayesian A/B testing
- Beta-binomial regression
- Understanding empirical Bayesian hierarchical modeling
- Mixture models and expectation-maximization
- The ebbr package
We’re approaching the end of this series on empirical Bayesian methods, and have touched on many statistical approaches for analyzing binomial (success / total) data, all with the goal of estimating the “true” batting average of each player. There’s one question we haven’t answered, though: do these methods actually work?
Even if we assume each player has a “true” batting average as our model suggests, we don’t know it, so we can’t see if our methods estimated it accurately. For example, we think that empirical Bayes shrinkage gets closer to the true probabilities than raw batting averages do, but we can’t actually measure the mean-squared error. This means we can’t test our methods, or examine when they work well and when they don’t.
In this post we’ll simulate some fake batting average data, which will let us know the true probabilities for each player, then examine how close our statistical methods get to the true solution. Simulation is a universally useful way to test a statistical method, to build intuition about its mathematical properies, and to gain confidence that we can trust its results. In particular, this post demonstrates the tidyverse approach to simulation, which takes advantage of the dplyr
, tidyr
, purrr
and broom
packages to examine many combinations of input parameters.
Setup
Most of our posts started by assembling some per-player batting data. We’re going to be simulating (i.e. making up) our data for this analysis, so you might think we don’t need to look at real data at all. However, data is still necessary to estimating the parameters we’ll use in the simulation, which keeps the experiment realistic and ensures that our conclusions will be useful.
(Note that all the code in this post can be found here).
library(Lahman)
library(dplyr)
library(tidyr)
# Grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
group_by(playerID) %>%
summarize(gamesPitched = sum(G)) %>%
filter(gamesPitched > 3)
# include the "bats" (handedness) and "year" column for later
career <- Batting %>%
filter(AB > 0) %>%
anti_join(pitchers, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB))
Choosing a distribution of p and AB
In the beta-binomial model we’ve been using for most of these posts, there are two values for each player ii:
α0;β0α0;β0 are “hyperparameters”: two unobserved values that describe the entire distribution. pipi is the true batting average for each player- we don’t observe this, but it’s the “right answer” for each batter that we’re trying to estimate. ABiABi is the number of at-bats the player had, which is observed. (You might recall we had a more complicated model in the beta-binomial regression post that had pipi depend on ABiABi: we’ll get back to that).
Our approach is going to be to pick some “true” α0;β0α0;β0, then simulate pipi for each player. Since we’re just picking any α0;β0α0;β0 to start with, we may as well estimate them from our data, since we know those are plausible values (though if we wanted to be more thorough, we could try a few other values and see how our accuracy changes).
To do this estimation, we can use our new ebbr package to fit the empirical Bayes prior.
library(ebbr)
prior <- career %>%
ebb_fit_prior(H, AB)
prior
## Empirical Bayes binomial fit with method mle
## Parameters:
## # A tibble: 1 × 2
## alpha beta
## <dbl> <dbl>
## 1 72.1 215
These two hyperparameters are all we need to simulate a few thousand values of pipi, using the rbeta
function:
alpha0 <- tidy(prior)$alpha
beta0 <- tidy(prior)$beta
qplot(rbeta(10000, alpha0, beta0))
There’s another component to this model: ABiABi, the distribution of the number of at-bats. This is a much more unusual distribution:
ggplot(career, aes(AB)) +
geom_histogram() +
scale_x_log10()
The good news is, we don’t need to simulate these ABiABi values, since we’re not trying to estimate them with empirical Bayes. We can just use the observed values we have! (In a different study, we may be interested in how the success of empirical Bayes depends on the distribution of the nns).
Thus, to recap, we will:
- Estimate α0;β0α0;β0, which works because the parameters are not observed, but there are only a few and we can predict them with confidence.
- Simulate pipi, based on a beta distribution, so that we can test our ability to estimate them.
- Use observed ABiABi, since we know the true values and we might as well.
Shrinkage on simulated data
The beta-binomial model is easy to simulate, with applications of the rbeta
and rbinom
functions:
# always set a seed when simulating
set.seed(2017)
career_sim <-