# Simulation of empirical Bayesian methods (using baseball statistics)

ModelingPredictive Analyticsposted by Dave Robinson September 24, 2017

Previously in this series: The beta distribution Empirical Bayes estimation Credible intervals The Bayesian approach to false discovery rates Bayesian A/B...

Previously in this series:

We’re approaching the end of this series on empirical Bayesian methods, and have touched on many statistical approaches for analyzing binomial (success / total) data, all with the goal of estimating the “true” batting average of each player. There’s one question we haven’t answered, though: do these methods actually work?

Even if we assume each player has a “true” batting average as our model suggests, we don’t know it, so we can’t see if our methods estimated it accurately. For example, we think that empirical Bayes shrinkage gets closer to the true probabilities than raw batting averages do, but we can’t actually measure the mean-squared error. This means we can’t test our methods, or examine when they work well and when they don’t.

In this post we’ll simulate some fake batting average data, which will let us know the true probabilities for each player, then examine how close our statistical methods get to the true solution. Simulation is a universally useful way to test a statistical method, to build intuition about its mathematical properies, and to gain confidence that we can trust its results. In particular, this post demonstrates the tidyverse approach to simulation, which takes advantage of the dplyrtidyrpurrr and broom packages to examine many combinations of input parameters.

## Setup

Most of our posts started by assembling some per-player batting data. We’re going to be simulating (i.e. making up) our data for this analysis, so you might think we don’t need to look at real data at all. However, data is still necessary to estimating the parameters we’ll use in the simulation, which keeps the experiment realistic and ensures that our conclusions will be useful.

(Note that all the code in this post can be found here).

### Choosing a distribution of p and AB

In the beta-binomial model we’ve been using for most of these posts, there are two values for each player ii:

piBeta(α0,β0)pi∼Beta(α0,β0)
HiBinom(ABi,pi)Hi∼Binom(ABi,pi)

α0;β0α0;β0 are “hyperparameters”: two unobserved values that describe the entire distribution. pipi is the true batting average for each player- we don’t observe this, but it’s the “right answer” for each batter that we’re trying to estimate. ABiABi is the number of at-bats the player had, which is observed. (You might recall we had a more complicated model in the beta-binomial regression post that had pipi depend on ABiABi: we’ll get back to that).

Our approach is going to be to pick some “true” α0;β0α0;β0, then simulate pipi for each player. Since we’re just picking any α0;β0α0;β0 to start with, we may as well estimate them from our data, since we know those are plausible values (though if we wanted to be more thorough, we could try a few other values and see how our accuracy changes).

To do this estimation, we can use our new ebbr package to fit the empirical Bayes prior.

These two hyperparameters are all we need to simulate a few thousand values of pipi, using the rbeta function:

There’s another component to this model: ABiABi, the distribution of the number of at-bats. This is a much more unusual distribution:

The good news is, we don’t need to simulate these ABiABi values, since we’re not trying to estimate them with empirical Bayes. We can just use the observed values we have! (In a different study, we may be interested in how the success of empirical Bayes depends on the distribution of the nns).

Thus, to recap, we will:

• Estimate α0;β0α0;β0, which works because the parameters are not observed, but there are only a few and we can predict them with confidence.
• Simulate pipi, based on a beta distribution, so that we can test our ability to estimate them.
• Use observed ABiABi, since we know the true values and we might as well.

## Shrinkage on simulated data

The beta-binomial model is easy to simulate, with applications of the rbeta and rbinom functions:

 
 About author Dave Robinson I am a Data Scientist at Stack Overflow. In May 2015 I received my PhD in Quantitative and Computational Biology from Princeton University, where I worked with Professor John Storey. My interests include statistics, data analysis, genomics, education, and programming in R. 1 
 hbspt.forms.create({ portalId: '1865444', formId: '2db28ac9-d988-42c8-bf91-29f8f7fcfac1' }); LATEST POSTS View all University of Maryland Introduces AI Interdisciplinary Institute AI and Data Science Newsposted by ODSC Team Apr 24, 2024 In the latest move that sees institutions of higher learning integrating AI, The University of Maryland... AI is Revolutionizing Cardiovascular Risk Assessments AI and Data Science Newsposted by ODSC Team Apr 24, 2024 In a new study conducted by Cedars-Sinai, researchers have leveraged artificial intelligence to significantly advance how... Microsoft Unveils New Cost Effective AI Model – Phi-3 AI and Data Science Newsposted by ODSC Team Apr 24, 2024 Microsoft announced on Tuesday the launch of Phi-3-mini, a new lightweight AI model aimed at providing... POPULAR POSTS 12 Excellent Datasets for Data Visualization in 2022 AI is Revolutionizing Cardiovascular Risk Assessments Microsoft Unveils New Cost Effective AI Model – Phi-3 Tags Machine Learning111Azure69East 202265ODSC East 2015|Speaker Slides64Microsoft49West 202249East 202449Deep Learning48East 202048East 202347West 202146Accelerate AI43East 202142Conferences41West 202340Europe 202039Europe 202138R34West 201834AI33 Related postsA Practical Guide to RAG Pipeline Evaluation (Part…How To Unlock Trust and Success Before You Start an…A Practical Guide to RAG Pipeline Evaluation (Part… 
 About us Proactively envisioned multimedia based expertise and cross-media growth strategies. Seamlessly visualize quality intellectual capital without superior collaboration and idea-sharing. Holistically pontificate installed base portals after maintainable products. Content Map Modeling1534Conferences690Featured Post659Business + Management525AI and Data Science News458Tools & Languages367Machine Learning343NLP & LLMs235Deep Learning232Blogs from ODSC Speakers202Career Insights202Blog201Data Visualization135Statistics127R125Python117Predictive Analytics99Technology77Guest contributor74Research72 Tags Machine Learning111Azure69East 202265ODSC East 2015|Speaker Slides64Microsoft49West 202249East 202449Deep Learning48East 202048East 202347West 202146Accelerate AI43East 202142Conferences41West 202340Europe 202039Europe 202138R34West 201834AI33 ODSC Privacy Policy View ODSC Privacy Policy Copyright Open Data Science 2024. All Rights Reserved var likebtn_wl = 1; (function(d, e, s) {a = d.createElement(e);m = d.getElementsByTagName(e)[0];a.async = 1;a.src = s;m.parentNode.insertBefore(a, m)})(document, 'script', '//w.likebtn.com/js/w/widget.js'); if (typeof(LikeBtn) != "undefined") { LikeBtn.init(); } adroll_adv_id = "XDCO6MZFKZB6HAQENDPVJ4"; adroll_pix_id = "WXI33UTPCNAORJUY22JFZS"; (function () { var _onload = function(){ if (document.readyState && !/loaded|complete/.test(document.readyState)){setTimeout(_onload, 10);return} if (!window.__adroll_loaded){__adroll_loaded=true;setTimeout(_onload, 50);return} var scr = document.createElement("script"); var host = (("https:" == document.location.protocol) ? "https://s.adroll.com" : "http://a.adroll.com"); scr.setAttribute('async', 'true'); scr.type = "text/javascript"; scr.src = host + "/j/roundtrip.js"; ((document.getElementsByTagName('head') || [null])[0] || document.getElementsByTagName('script')[0].parentNode).appendChild(scr); }; if (window.addEventListener) {window.addEventListener('load', _onload, false);} else {window.attachEvent('onload', _onload)} }()); document.addEventListener( 'wpcf7mailsent', function( event ) { if( "fb_pxl_code" in event.detail.apiResponse){ eval(event.detail.apiResponse.fb_pxl_code); } }, false ); Modeling Data Analytics Data Engineering Data Visualization Deep Learning Generative AI Machine Learning NLP and LLMs Python Business & Use CasesCareer AdviceWrite for usCommunity ODSC Community Slack Channel Meetups Substack Medium Upcoming WebinarsAi X Podcast Apple Spotify SoundCloud Training ODSC Conferences ODSC EAST ODSC WEST ODSC EUROPE ODSC APAC MEETUPSAI+ TrainingNewsletterSpeak at ODSC /* <![CDATA[ */ var codePrettifyLoaderBaseUrl = "https:\/\/opendatascience.com\/wp-content\/plugins\/code-prettify\/prettify"; /* ]]> */ /* <![CDATA[ */ var leadin_wordpress = {"userRole":"visitor","pageType":"post","leadinPluginVersion":"11.0.32"}; /* ]]> */ /* <![CDATA[ */ var rss_retriever = {"ajax_url":"https:\/\/opendatascience.com\/wp-admin\/admin-ajax.php"}; /* ]]> */ /* <![CDATA[ */ var pp_ajax_form = {"ajaxurl":"https:\/\/opendatascience.com\/wp-admin\/admin-ajax.php","confirm_delete":"Are you sure?","deleting_text":"Deleting...","deleting_error":"An error occurred. Please try again.","nonce":"f6db743269","disable_ajax_form":"false","is_checkout":"0","is_checkout_tax_enabled":"0"}; /* ]]> */ /* <![CDATA[ */ var aiStrings = {"play_title":"Play %s","pause_title":"Pause %s","previous":"Previous track","next":"Next track","toggle_list_repeat":"Toggle track listing repeat","toggle_track_repeat":"Toggle track repeat","toggle_list_visible":"Toggle track listing visibility","buy_track":"Buy this track","download_track":"Download this track","volume_up":"Volume Up","volume_down":"Volume Down","open_track_lyrics":"Open track lyrics","set_playback_rate":"Set playback rate","skip_forward":"Skip forward","skip_backward":"Skip backward","shuffle":"Shuffle"}; var aiStats = {"enabled":"","apiUrl":"https:\/\/opendatascience.com\/wp-json\/audioigniter\/v1"}; /* ]]> */ /* <![CDATA[ */ var ajax_object = {"ajaxurl":"https:\/\/opendatascience.com\/wp-admin\/admin-ajax.php","readmore":"Read more","article":"Article","show_post_quick_view":"on","show_mosaic_overlay":"on","enable_sidebar_affix":"on","particle_color":"#e8e8e8"}; /* ]]> */ We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.ConfirmNoPrivacy policy You can revoke your consent any time using the Revoke consent button.Revoke consent jQuery( '#request' ).val( '' ); _linkedin_data_partner_id = "44953"; (function(){var s = document.getElementsByTagName("script")[0]; var b = document.createElement("script"); b.type = "text/javascript";b.async = true; b.src = "https://snap.licdn.com/li.lms-analytics/insight.min.js"; s.parentNode.insertBefore(b, s);})();