# Simulation of empirical Bayesian methods (using baseball statistics)

ModelingPredictive Analyticsposted by Dave Robinson September 24, 2017

Previously in this series: The beta distribution Empirical Bayes estimation Credible intervals The Bayesian approach to false discovery rates Bayesian A/B...

Previously in this series:

We’re approaching the end of this series on empirical Bayesian methods, and have touched on many statistical approaches for analyzing binomial (success / total) data, all with the goal of estimating the “true” batting average of each player. There’s one question we haven’t answered, though: do these methods actually work?

Even if we assume each player has a “true” batting average as our model suggests, we don’t know it, so we can’t see if our methods estimated it accurately. For example, we think that empirical Bayes shrinkage gets closer to the true probabilities than raw batting averages do, but we can’t actually measure the mean-squared error. This means we can’t test our methods, or examine when they work well and when they don’t.

In this post we’ll simulate some fake batting average data, which will let us know the true probabilities for each player, then examine how close our statistical methods get to the true solution. Simulation is a universally useful way to test a statistical method, to build intuition about its mathematical properies, and to gain confidence that we can trust its results. In particular, this post demonstrates the tidyverse approach to simulation, which takes advantage of the dplyrtidyrpurrr and broom packages to examine many combinations of input parameters.

## Setup

Most of our posts started by assembling some per-player batting data. We’re going to be simulating (i.e. making up) our data for this analysis, so you might think we don’t need to look at real data at all. However, data is still necessary to estimating the parameters we’ll use in the simulation, which keeps the experiment realistic and ensures that our conclusions will be useful.

(Note that all the code in this post can be found here).

### Choosing a distribution of p and AB

In the beta-binomial model we’ve been using for most of these posts, there are two values for each player ii:

piBeta(α0,β0)pi∼Beta(α0,β0)
HiBinom(ABi,pi)Hi∼Binom(ABi,pi)

α0;β0α0;β0 are “hyperparameters”: two unobserved values that describe the entire distribution. pipi is the true batting average for each player- we don’t observe this, but it’s the “right answer” for each batter that we’re trying to estimate. ABiABi is the number of at-bats the player had, which is observed. (You might recall we had a more complicated model in the beta-binomial regression post that had pipi depend on ABiABi: we’ll get back to that).

Our approach is going to be to pick some “true” α0;β0α0;β0, then simulate pipi for each player. Since we’re just picking any α0;β0α0;β0 to start with, we may as well estimate them from our data, since we know those are plausible values (though if we wanted to be more thorough, we could try a few other values and see how our accuracy changes).

To do this estimation, we can use our new ebbr package to fit the empirical Bayes prior.

These two hyperparameters are all we need to simulate a few thousand values of pipi, using the rbeta function:

There’s another component to this model: ABiABi, the distribution of the number of at-bats. This is a much more unusual distribution:

The good news is, we don’t need to simulate these ABiABi values, since we’re not trying to estimate them with empirical Bayes. We can just use the observed values we have! (In a different study, we may be interested in how the success of empirical Bayes depends on the distribution of the nns).

Thus, to recap, we will:

• Estimate α0;β0α0;β0, which works because the parameters are not observed, but there are only a few and we can predict them with confidence.
• Simulate pipi, based on a beta distribution, so that we can test our ability to estimate them.
• Use observed ABiABi, since we know the true values and we might as well.

## Shrinkage on simulated data

The beta-binomial model is easy to simulate, with applications of the rbeta and rbinom functions:

 
 About author Dave Robinson I am a Data Scientist at Stack Overflow. In May 2015 I received my PhD in Quantitative and Computational Biology from Princeton University, where I worked with Professor John Storey. My interests include statistics, data analysis, genomics, education, and programming in R. 1 
 hbspt.forms.create({ portalId: '1865444', formId: '2db28ac9-d988-42c8-bf91-29f8f7fcfac1' }); LATEST POSTS View all The Top Machine Learning Research of June 2024 Machine Learningposted by ODSC Team Jul 12, 2024 As we saw last month, modern AI and Machine Learning are moving faster than the speed... LangGraph: The Future of Production-Ready AI Agents Europe 2024Modelingposted by ODSC Community Jul 12, 2024 Editor’s note: Eden Marco is a speaker for ODSC Europe this September 5th-6th. Be sure to... Retrieval-Augmented Generation (RAG): A Synergistic Approach to NLU and NLG APAC 2024Modelingposted by ODSC Community Jul 12, 2024 Editor’s note: Shalvi Mahajan is a speaker for ODSC APAC  on August 13th. Be sure to... Tags Machine Learning111Azure69East 202265ODSC East 2015|Speaker Slides64East 202458Microsoft49West 202249Deep Learning48East 202048East 202347West 202146Accelerate AI43East 202142Conferences41West 202340Europe 202039Europe 202138cybersecurity35R34West 201834 Related postsTime Series Forecasting and Simulations: Python…The Top Machine Learning Research of June 20249 Open-Source Tools to Generate Synthetic Data 
 About us Proactively envisioned multimedia based expertise and cross-media growth strategies. Seamlessly visualize quality intellectual capital without superior collaboration and idea-sharing. Holistically pontificate installed base portals after maintainable products. Content Map Modeling1587Conferences707Featured Post679AI and Data Science News545Business + Management530Tools & Languages367Machine Learning348NLP & LLMs247Deep Learning232Career Insights206Blogs from ODSC Speakers202Blog201Data Visualization136Statistics127R125Python117Predictive Analytics99Technology77Guest contributor74Research72 Tags Machine Learning111Azure69East 202265ODSC East 2015|Speaker Slides64East 202458Microsoft49West 202249Deep Learning48East 202048East 202347West 202146Accelerate AI43East 202142Conferences41West 202340Europe 202039Europe 202138cybersecurity35R34West 201834 ODSC Privacy Policy View ODSC Privacy Policy Copyright Open Data Science 2024. All Rights Reserved var likebtn_wl = 1; (function(d, e, s) {a = d.createElement(e);m = d.getElementsByTagName(e)[0];a.async = 1;a.src = s;m.parentNode.insertBefore(a, m)})(document, 'script', '//w.likebtn.com/js/w/widget.js'); if (typeof(LikeBtn) != "undefined") { LikeBtn.init(); } adroll_adv_id = "XDCO6MZFKZB6HAQENDPVJ4"; adroll_pix_id = "WXI33UTPCNAORJUY22JFZS"; (function () { var _onload = function(){ if (document.readyState && !/loaded|complete/.test(document.readyState)){setTimeout(_onload, 10);return} if (!window.__adroll_loaded){__adroll_loaded=true;setTimeout(_onload, 50);return} var scr = document.createElement("script"); var host = (("https:" == document.location.protocol) ? "https://s.adroll.com" : "http://a.adroll.com"); scr.setAttribute('async', 'true'); scr.type = "text/javascript"; scr.src = host + "/j/roundtrip.js"; ((document.getElementsByTagName('head') || [null])[0] || document.getElementsByTagName('script')[0].parentNode).appendChild(scr); }; if (window.addEventListener) {window.addEventListener('load', _onload, false);} else {window.attachEvent('onload', _onload)} }()); document.addEventListener( 'wpcf7mailsent', function( event ) { if( "fb_pxl_code" in event.detail.apiResponse){ eval(event.detail.apiResponse.fb_pxl_code); } }, false ); Modeling Data Analytics Data Engineering Data Visualization Deep Learning Generative AI Machine Learning NLP and LLMs Python Business & Use CasesCareer AdviceWrite for usCommunity ODSC Community Slack Channel Meetups Substack Medium Upcoming WebinarsAi X Podcast Apple Spotify SoundCloud Training ODSC Conferences ODSC EAST ODSC WEST ODSC EUROPE ODSC APAC MEETUPSAI+ TrainingNewsletterJobsSpeak at ODSC /* <![CDATA[ */ var codePrettifyLoaderBaseUrl = "https:\/\/opendatascience.com\/wp-content\/plugins\/code-prettify\/prettify"; /* ]]> */ /* <![CDATA[ */ var leadin_wordpress = {"userRole":"visitor","pageType":"post","leadinPluginVersion":"11.1.22"}; /* ]]> */ /* <![CDATA[ */ var rss_retriever = {"ajax_url":"https:\/\/opendatascience.com\/wp-admin\/admin-ajax.php"}; /* ]]> */ /* <![CDATA[ */ var pp_ajax_form = {"ajaxurl":"https:\/\/opendatascience.com\/wp-admin\/admin-ajax.php","confirm_delete":"Are you sure?","deleting_text":"Deleting...","deleting_error":"An error occurred. Please try again.","nonce":"f1630d74a9","disable_ajax_form":"false","is_checkout":"0","is_checkout_tax_enabled":"0"}; /* ]]> */ /* <![CDATA[ */ var aiStrings = {"play_title":"Play %s","pause_title":"Pause %s","previous":"Previous track","next":"Next track","toggle_list_repeat":"Toggle track listing repeat","toggle_track_repeat":"Toggle track repeat","toggle_list_visible":"Toggle track listing visibility","buy_track":"Buy this track","download_track":"Download this track","volume_up":"Volume Up","volume_down":"Volume Down","open_track_lyrics":"Open track lyrics","set_playback_rate":"Set playback rate","skip_forward":"Skip forward","skip_backward":"Skip backward","shuffle":"Shuffle"}; var aiStats = {"enabled":"","apiUrl":"https:\/\/opendatascience.com\/wp-json\/audioigniter\/v1"}; /* ]]> */ /* <![CDATA[ */ var ajax_object = {"ajaxurl":"https:\/\/opendatascience.com\/wp-admin\/admin-ajax.php","readmore":"Read more","article":"Article","show_post_quick_view":"on","show_mosaic_overlay":"on","enable_sidebar_affix":"on","particle_color":"#e8e8e8"}; /* ]]> */ We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.ConfirmNoPrivacy policy You can revoke your consent any time using the Revoke consent button.Revoke consent jQuery( '#request' ).val( '' ); _linkedin_data_partner_id = "44953"; (function(){var s = document.getElementsByTagName("script")[0]; var b = document.createElement("script"); b.type = "text/javascript";b.async = true; b.src = "https://snap.licdn.com/li.lms-analytics/insight.min.js"; s.parentNode.insertBefore(b, s);})();