It’s that time of year, the snow has melted, flowers are starting to blossom, and the country is consumed by the fever of March Madness. Today, March 15th, marks the true kick-off of the 63-match tournament, famous for its thrilling competitive play and heart-stopping upsets.
The tourney has a been focus of the data science community ever since Kaggle created a competition to use data to predict the games. Part of the reason why this is such a compelling machine learning project is because of the headline-grabbing upsets that occur every year.
We here at ODSC have decided to throw our hat into the ring (as we did last year) again and submitted a bracket determined by data science instead of gut feeling.
How We Made the ODSC March Madness Bracket Model
The data we used to create our model came directly from the Kaggle competition page. Some of the best models in the past have incorporated outside data but to do so would require a lot more time and effort available to me.
The features of data were comprised of team stats such as points per game, three-point shooting percentage, and rebounds per game and other variables such strength of schedule, RPI, tournament seeding, and whether or not a team has won a national championship in the past.
This year instead of testing a wide range of machine learning models, we decided to go straight to the deep learning route. We trained a multi-layered perceptron model that achieved a validation accuracy score of 76.8% which beat the score I got with Gradient Boosting and Random Forest.
How We Made Our Predictions
Instead of just choosing the team with the higher probability, I decided to try something different in making my picks. The reason is that my initial bracket had hardly any upsets, my Final Four picks were number 1 seeds and there were no seeds lower than a 3 in my Elite Eight. I noticed that in my predictions in games where I expected to see an upset, the favored team would still be predicted to win but not as much as you would expect it to. Since machine learning models trained on two outcomes use 50% as their threshold to make predictions, I needed a way to adjust the threshold for each matchup.
My solution to this issue was to train a simple Logistic Regression model on only the difference in seeding between the teams. I take the probabilities generated from this model and use them to determine the thresholds which decide my matchup predictions. For example, if my neural network model says that a 6-seeded team has a 60 % chance of beating an 11-seeded team, but the thresholds from my Logistic Regression model say that a team that is five seeds lower has a 68.3 % chance of winning, then I choose the 11-seeded team to win. After making all the predictions using this method, I feel a lot better about my bracket.
And now presenting the Official Open Data Science March Madness Bracket!!
As you can see from my bracket, there are a fair number of upsets in my picks, but is still pretty conservative compared to most brackets. Check this space on Monday, when I post an update on how my bracket fared over the first two rounds of games this weekend.
I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.