The endlessly fun NBA All-Star Weekend just wrapped up, culminating in a 148-145 triumph for Team Lebron over Team Steph. But now that the fun is over, it’s time to turn our attention to more a serious matter this NBA season, and that is the MVP race.
With the season more than halfway done and playoffs only two months away, the MVP conversation is beginning to ramp up. Reigning MVP Russell Westbrook of the Oklahoma City Thunder hasn’t matched his legendary form from last season but nonetheless is still having an MVP-caliber season. He’s currently averaging 25.4 ppg, 9.4 rpg, and a league-leading 10.4 apg.
Westbrook was the obvious choice for MVP last season due to him averaging a triple-double in 81/82 games. Given his astronomical performance, any sort of statistical analysis to prove he should be MVP was rendered unnecessary. This season is another story, with Westbrook regressing and the continual emergence of younger talent such as Giannis Antetokounmpo and Ben Simmons the MVP discussion is more of a debate than an open and shut case.
That’s why I am taking it on myself to determine the true MVP of this season by applying statistics and linear algebra on a rich set of NBA statistics. In this post, I’ll outline where and how I got the data, my methods, and announce the league MVP along with the MVP of each team.
Data Wrangling & Cleaning
You are most likely wondering how I obtained the data, you’ll be surprised to hear that I did not scrape or crawl it using tools like BeautifulSoup, Selenium, or Scrapy. I actually used Pandas to download the data, specifically the read_clipboard function. All I had to do was copy the data from the website and paste it using this function. It’s a simple tool, but in some cases, it can get the job done and won’t burden the website’s servers like web-scraping.
After the data acquisition, I joined the two datasets together while also dropping duplicate columns such as player and team names and other extraneous features.
This left me with a clean dataset of 502 players and 44 features, but before I could begin my analysis, I had to drop a sizable portion of players who’ve not seen considerable game time. The rule I used was that players had to have played in at least 18 games while also averaging at least 10 minutes per game. This process left me with 351 players.
And here are glossaries taken from the NBA for the general and advanced stats:
When I talk about determining the MVP, I am more or less talking about the best overall player in the league — whether or not the best overall player in the league should be MVP is another discussion that has occupied NBA fans for years. In order to do this, I had to calculate the magnitude of each player’s vector which is their stat line. To put it simply, the vector magnitude is the distance of the player’s statistics from 0.
If I were only using points and rebounds and a player was averaging 20 ppg and 10 rpg, his magnitude would be calculated by squaring each value, summing them, and then taking the square root of that value which comes out to 22.36. The magnitude measures the length of the line from (0,0) to (20, 10). Now imagine doing this but for the dozens of features. (For a quick refresher on vectors and magnitude, check out this tutorial.)
Before going ahead and calculating the magnitudes of each player, I had to standardize the data so as not to give extra influence to variables on larger scales. In my points and rebounds example, points are usually on a bigger scale than rebounds (dozens of players usually average 20ppg or more in a season, whereas you don’t ever see those kinds of numbers for rebounds) which would allow them a larger effect on the magnitude, but that doesn’t necessarily mean they’re more important than rebounds — again, another debate best left alone.
One problem I encountered in my work is the issue of multicollinearity. As you may have already guessed, a significant number of features in my data shown above have significant correlations among one another such as offensive and defensive rebounds. Instead of injecting subjectivity into my dataset by removing features I deemed unworthy, I opted to go with dimensionality reduction technique PCA, a proven technique for remedying this problem.
Using PCA, I compressed my standardized dataset down to 7 features which retained 80% of the explained variance. Going from 44 to 7 features while only losing 20% of the data’s essence is a strong indication of multicollinearity.
Now was time for the fun part.
I calculated the magnitude of every player’s 7-dimensional PCA vector which determined that Russell Westbrook should retain his crown as MVP this season. According to my results, Westbrook trounced the competition.
It’s important to not put too much faith in the magnitude value but it incredible to see that Westbrook has twice the score as 6th-place Joel Embiid.
The reasoning for Westbrook’s score is simply that he’s a great all-around player, he can impact the game with and without the ball and in and out of the paint. If you look at the league leaders for various stats on the NBA stats website you’ll often find Westbrook in the top ten in categories such as assists, steals, and fast break points. It’s a tremendous credit to his abilities as a player because top players are only great in a handful of areas.
Now that we’ve awarded the MVP, let’s take a look at each team’s MVP in the following table which displays the player with the best score for each team.
I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.