Within soccer’s nascent analytics movement, one metric dominates most discussions. It’s called Expected Goals or xG. Models for calculating xG differ,...

Within soccer’s nascent analytics movement, one metric dominates most discussions. It’s called Expected Goals or xG. Models for calculating xG differ, but the underlying concept is the same. In a nutshell, xG takes a shot’s characteristics – distance from goal, angle from goal, root cause, etc. – and assigns a probability that said shot will result in a goal. Accounting for these probabilities reveals which team creates better scoring opportunities. Given a season of data, xG analysis is a powerful way to measure a team’s performance and generate valuable insight.

In fact, one well known coach recently used xG to defend his team’s performances. This surprising revelation suggests the metric is being used in the professional sphere and not just by data literate fans. In the world of soccer analytics a lot of work occurs behind closed doors, with the data stored behind even thicker ones. Detailed explanations of xG models are publically accessible, but the sufficient data is not.

Recently, however, @fussbALEXperte published aggregated match level xG data for the first 30 rounds (out of a possible 34) of the Bundesliga, Germany’s highest soccer league. Let’s take a look at how the majority of the 2015-2016 season played out from the perspective of xG. One caveat – since the data is aggregated, we can’t use simulations to get a more accurate picture of xG probabilities.

The first step is to have a long version of the data in addition to its wide representation.

From there we get a visual representation of team xG performance at away and home games.

Soccer Score Graphs

It seems that most teams post higher xG numbers at home. Bayern Munich’s numbers especially stand out as they created 1.41 home xG on average compared to 1.15 xG in away games. Only Darmstadt and Hannover created better chances on the road this season compared to home games. Bayern Munich creates more quality goal-scoring opportunities at home than away, the disparity is larger than any other team. Meanwhile Bayer Leverkusen plays the role of Atlas.

On average Bayern Munich created chances worth 1.5 extra goals at home, while Bayer Leverkusen created chances of the same quality worth (the difference was only 0.04) at home and away.

Second to Bayern Munich is Borussia Dortmund. A look at the league table confirms this. However, the current league table shows more than 30 games, and I did not have a copy of the league table after 30 games. So, I webscraped the results for the first 30 rounds of play with RSelenium. With that data it was simple to recreate a snapshot of the table at that time. The table based on expected goals was also created.

When comparing the two tables, Bayern Munich, Dortmund, and Hannover (first, second, and last) rank as expected based on their corresponding xG. Other teams are ranked sporadically. For example, Schalke is 7 places higher than expected, while Werder Bremen is 12 places lower.

Going deeper into the difference between actual versus expected points and goal difference tells a more complete story. Schalke shows up again with 9 more points and a goal difference four goals higher than expected. Dortmund has 2 more points than expected, but a whopping 13 goal swing over expectations. Spare a thought for Werder Bremen who has 20 points less than expected. Because they got the short end of the variance stick, the club is in danger of being relegated to the second division.

As I mentioned before, this is a snapshot of the most used metric in current soccer analytics. Without individual shot level xG data, the above metric is not as robust as it could be. For more on Expected Goals, check out this blog post by Michael Caley, one of the leading voices in the amateur soccer analytics scene.

#ODSC ©2016 Please (Backlink) Share!

Gordon Fleetwood

Gordon studied Math before immersing himself in Data Science. Originally a die-hard Python user, R's tidyverse ecosystem gradually subsumed his workflow until only scikit-learn remained untouched. He is fascinated by the elegance of robust data-driven decision making in all areas of life, and is currently involved in applying these techniques to the EdTech space.