The phenomenon of the “yearly sports game release” is a well established tradition in the videogame industry. The biggest is, perhaps, the FIFA franchise,...

The phenomenon of the “yearly sports game release” is a well established tradition in the videogame industry. The biggest is, perhaps, the FIFA franchise, reigning supreme leader in its niche, simulated soccer, for most of its over twenty year history. EA Sports released the latest iteration, FIFA 17, a few weeks ago to the usual fanfare. Thousands worldwide adjust their priorities and work out how to best fit the expanded game into their schedules. A usual talking points upon each new release are player ratings, and meta-metrics derived from skills like speed, dribbling ability, and tackling prowess. These meta-metrics determine how well the virtual representations of real players perform in-game. Discussions can get …intense; sometimes even players get involved. At the root of these debates is the simple question of how well these metrics reflect the ability of the athletes themselves. A more complicated query is: Can we use these video game statistics alongside real soccer data?

Well, we can generate a fuzzy answer to this question by looking at a data set released on Kaggle roughly three months ago. Hugo Mathien compiled a database of stats including over 25,000 unsimulated flesh and blood matches over eight seasons, then supplemented the data by adding player attributes from the FIFA series. Just glancing at the data and its associated Kaggle page revealed many possible paths for exploration. This post will visually explore the videogame statistics. If it’s not in-depth enough for you, look for future posts on FIFA where I and other Data Science Writers will take a deeper dive into this rich data set.

I first sorted the data in decreasing temporal order. A cursory look at the data showed a number of missing values. Given the nature and size of the data set, I decided to drop these rows. I then created a separate data set which contained only the most recent statistics for each player.

Dropping missing values left me with 180,380 data points out of the original 183,253. Dropping duplicates shrunk the data to 10,226 players. I began my exploratory analysis by looking at objective facts including height and weight.




Unsurprisingly, most players are right-footed. The distribution of height and weight seem normal. The means are 181.9 cm and 168.4 lbs, and the standard deviations are 6.4 cm and 15 lbs respectively. The next step was to look at subjective statistics starting with speed. The speed ratings are not on the usual scales, but mirror the other metrics in being from 1-100. The distribution is skewed left with most players having ratings in the high 60’s.


Given the nature of soccer, the interplay between metrics is probably the best way to consider the data. (These metrics include jumping, stamina, goalkeeping reflexes, and vision, among others.) The overall rating is the most obvious product of this interplay. It also follows a normal distribution with a mean of 68.2 and a standard deviation of 6.3.


I built a simple Linear Regression model to dig deeper into the relationship between these metrics and the overall rating.


This preliminary pass shows that, the volley, dribbling, free kick accuracy, balance, vision, penalties, standing tackle, and sliding tackle metrics are not significant. Finally, I tried to visually separate goalkeepers from outfield players. My hypothesis was that the five goalkeeper traits – diving, handling, kicking, positioning, and reflexes – would be enough to provide this separation. I applied Principal Component Analysis to this five dimensional space to reduce it to two dimensions before plotting.


I suspected that the isolated, leftmost cluster consisted of goalkeepers.


The hypothesis turned out to be completely false. Still, this line of inquiry provides the thread for the next big question to pose to this data set. Can we use these statistics to find the usual player positions or even define new ones? Stay tuned for the next entry.


©ODSC 2016

Gordon Fleetwood

Gordon Fleetwood

Gordon studied Math before immersing himself in Data Science. Originally a die-hard Python user, R's tidyverse ecosystem gradually subsumed his workflow until only scikit-learn remained untouched. He is fascinated by the elegance of robust data-driven decision making in all areas of life, and is currently involved in applying these techniques to the EdTech space.