From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 2 From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 2
What’s the best way to model the probability that one player beats another in a digital game a client of your... From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 2

What’s the best way to model the probability that one player beats another in a digital game a client of your employer designed? This is the second of a two-part series in which you’re a data scientist at a fictional mobile game development company that makes money by monetizing digital gameplay of any kind. The last article summarized the data and EDA. In this one, I’ll select and dial in Bayesian hierarchical models and write a function that spits out the probability a player beats another given their matchup history. 

[Related Article: From Idea to Insight: Using Bayesian Hierarchical Models to Predict Game Outcomes Part 1]

The Function

Write a function that takes the following inputs:

    * Player 1

    * Player 2

    * Previous matches: Array of matches that previously occurred, where each

        match is represented as a 3-tuple containing the two participating

        players, along with the outcome (the string “player 1” if the first

        player won, or “player 2” if the second player won).

The function should use the given match history to predict the outcome of the match. In particular, it should output an estimate of the probability that player 1 will win.

Model Selection

Before I dive into coding the function, I need to investigate my options for predicting outcomes. In ascending order of complexity, ideas include:

  1. Naive heuristic where the predicted winner is always the player with greater skill.
  2. Logistic linear regression
  3. Logistic Bayesian Hierarchical model

Let’s establish a baseline for accuracy based on the naive heuristic of the predicted winner always being the more skilled. To start, I’ll split the total data into a training, test, and validation sets.

Choices will be compared using: * Standard metrics * Performance vs naive approach * # of upsets

Get baseline accuracy based on naive strategy.

The baseline should include as much data as possible so it’s as representative as possible, therefore I use x, not x_train. Roughly 42% of matchups can be predicted using only the player with the higher skill. Now that I have a baseline, I’ll run more sophisticated—hopefully more accurate—strategies.

Naive Model

Do I even need a multi-level model? I’ll test if it’s any better than a regular linear frequentist logistic regression with all the players pooled together, unlike its Bayesian counterpart which models un-pooled players.

The p-value from a likelihood ratio test in which the null hypothesis is the simpler, un-nested model is no worse than the hierarchical model is p < .0000001, confirming that the multi-level model is worth the complexity because it’s significantly better than a null model.

Chosen Model

I propose a Bayesian multilevel model of game outcomes to reflect the player-game hierarchy latent in our data. This will let me estimate outcomes between players without enough matchup history by borrowing information from their peers.

where μjis a vector of n linear predictors containing varying slopes and intercepts; Xj is an n x p matrix, with β being a vector of p regression coefficients; XSj is an n x q subset of X for q

random effects, which are represented by γj, a vector of q random effects. The γj random intercepts and slopes are drawn from a normal distribution with mean zero and correlate via covariance matrix ψ.

The random effects γj are expressed this way to highlight that (i) they are the result of a multivariate normal process (putting the “random” in “random effects”) and (ii) they are interpreted as how much the effect γjq from associate j differs from the typical effect of β. In fact, if we factor the Xj, the coefficients in the model are β+γj for each associate j. Yj contains estimates of the probability player j will win game inj. But each ith game within player j has probability yij modeled as

Notice: (i) the three fixed effects: the all-player intercept, μα; and unchanging slopes for SkillSpread represented as β’s; (ii) the random effects: a player j specific intercept, αj, and slope for SkillSpread andSkillSpread2 that vary by player; and (iii) the extent to which a factor influences game outcome (the coefficients) are themselves being modeled.

Model Summary

I fit a Bayesian logistic hierarchical model (estimated using MCMC sampling with 4 chains of 2,000 iterations and a warmup of 1000) to predict a victory for player 1 using skill spread and a quadratic term for skill spread. The model included player 1 as random effects (formula = ~ spread_12 | player1_id). Priors over parameters were set as normal (mean = 0.00, SD = 0.003) distributions. The Region of Practical Equivalence (ROPE) percentage was defined as the proportion of the posterior distribution within the [-0.18, 0.18] range. The 89% Credible Intervals (CIs) were based on Highest Density Intervals (HDI). Parameters were scaled by the mean and the SD of the response variable. Effect sizes were labeled following Cohen’s (1988) recommendations.

The model’s explanatory power is moderate (R2’s median = 0.20, 89% CI [0.17, 0.23]. Within this model, the explanatory power related to the fixed effects alone (marginal R2’s median) is of 0.195 (89% CI [0.17, 0.23]). The model’s intercept, corresponding to player 1 losing (outcome = 0) to an equal-skilled player two (skill spread = 0), is at -0.1 (probability of direction = 86%, sd = 0.1).

Within this model: The effect of skill spread on outcome has a probability of 100% of being positive and can be considered as medium (median = 0.017, 89% CI [0.015, 0.019]) and not significant (89% in ROPE).

Ok, So How Do I Actually Use it?

This section dives into actually using the model to predict game outcomes.

I built 3 functions, 2 of which—is_in_history() and get_historical_matchups()—are helper functions for the workhorse function that predicts game outcomes: predict_play1_wins()

Using predict_play1_wins()

# example usage:

alice   = Player('Alice',   254)

bob     = Player('Bob',     469)

charlie = Player('Charlie', 118)

dave    = Player('Dave',     67)

eve     = Player('Eve',     148)


previous_matches = [(alice,   bob, 'player 2'),

                    (charlie, eve,  'player 1'),

                    (charlie, dave, 'player 2'),

                    # ... many more entries ...

                    (alice,   dave, 'player 1')]


# arpad and mark are not in the list of previous matches

arpad = Player('Arpad', 365)

mark = Player('Mark', 341)


result = predict_outcome(arpad, mark, previous_matches)

print('The probability that Arpad will beat Mark is:', result)

To predict a single matchup, run:

predict_play1_wins(576, 129, historical_matchups[[1]])

Output will include the probability of player 1 winning, and will look like this:

##      X player1_id player2_id player1_skill player2_skill winner_id

## 1 1434        576 129       286.6 251.6   129

## 2 1768        576 129       289.4 275.6   576

##   spread_12    matchup p1_wins prob_1_wins

## 1      35.0 576 vs 129       0 0.694

## 2      13.8 576 vs 129       1 0.570

To predict multiple matchups, run:

pmap_dfr(.l = list(unique_matchups$p1, 



     .f = predict_play1_wins) -> predicted_matchups

If a player ID hasn’t played before, get_historical_matchups() will create an NA in the winner column of the 3-tuple that’s passed to the historical_matchups argument in predict_play1_wins(). The function works like this:

  1. If either player hasn’t played the other—or if either player has no history at all—the assigned probability is computed as either:
  • the population mean of winning across all the games all player 1’s have played, if both players lack any history
  • the probability of the player winning across all games they’ve played against everyone else, sans the player who lacks history.

2. If both players have played each other at least once, the probability of player 1 winning is estimated using the mean of winning over 500 draws from the posterior distribution derived from the Bayesian logistic hierarchical model.

## Confusion Matrix and Statistics


##           Reference

## Prediction   0 1

##          0 28 1

##          1 130 211


##                Accuracy : 0.6459               

##                  95% CI : (0.5948, 0.6947)     

##     No Information Rate : 0.573                

##     P-Value [Acc > NIR] : 0.002498             


##                   Kappa : 0.1925               


##  Mcnemar's Test P-Value : < 0.00000000000000022


##             Sensitivity : 0.9953               

##             Specificity : 0.1772               

##          Pos Pred Value : 0.6188               

##          Neg Pred Value : 0.9655               

##              Prevalence : 0.5730               

##          Detection Rate : 0.5703               

##    Detection Prevalence : 0.9216               

##       Balanced Accuracy : 0.5862               


##        'Positive' Class : 1                    


Discussion & Limitations

Bayesian stats affords our model flexibility to be more confident in predictions about matchups with more history. Less history introduces more uncertainty.

Why predictions might be off—adapting strategies among players with > 1 game who lose first, learn opponent, then win—experience with other players in interim—rusty bc haven’t played in a while

Conclusion and Future Work


A Bayesian logistic hierarchical model is able to outperform the naive approach of always picking the most skilled player by nearly 17 percentage points, resulting in ~ 22% (32 of 317 in test) more wins correctly predicted.

Future Work

  1. How do player’s advance in skill? Is there a ceiling to the number?
  2. Leverage more features to improve predictions. Ideas include:
    • Gameplay profiles like how often a player engages the app
    • Derived player type information like how “strategic” they play strategy games, how “aggressively” they play action-oriented games, etc. Estimating a player’s gameplay tendency may help predict outcomes better in matches where players differ in play style. Consider two players of equal skill level, 1 of which plays aggressively, while 2 plays defensively. The outcome of this game might be easier to predict than one in which two aggressive (defensive) players are paired.


This section outlines important but not critical info related to predicting player outcomes given historical matchups.

[Related Article: Building Your First Bayesian Model in R]


  1. Both players appear in the given match history.
  2. Every combination of players may not exist in given match history.

Brandon Dey, ODSC

Brandon is a Consulting Data Scientist at Avanade, the joint venture between Microsoft and Accenture, in Portland, Oregon. Outside of work, he wonders if he's actually fooling anyone by referring to himself in the third person. Plus, he likes backpacking, long distance trail running, aerial photography, writing creative non-fiction, and attempting to write short stories with characters suspiciously similar to himself...