You might remember that ODSC is participating in Kaggle’s March Madness this year, so we thought we’d give you an update. Here’s Open Data Science’s March Madness bracket breakdown.
After 48 games and millions of busted brackets, the 2018 NCAA Men’s Tourney has been whittled down to the Sweet 16. And as expected, plenty of exciting matches occurred over the weekend. The most exciting being the most significant result of #16 seed UMBC’s triumph over #1 seed Virginia, the first time that’s happened in tournament history. Such an upset busted nearly every bracket in the world and for good reason because no one would think to pick a 16 over a 1. That result certainly wrecked havoc on my own bracket, which I created based on predictions made by a deep learning model. So let’s check in to see how well my bracket fared after the first two rounds of the tournament.
Overall, I got 73% of my predictions right, which is only three percentage points lower than the cross-validated accuracy score I got when I tested my model. What’s interesting is the variance of my model’s performance by region. The South region was by far my worst showing, in which my two regional finalists are already eliminated. The West and Midwest regions fared a little better and unlike the South, I have the potential to gain more points due to Michigan and Kansas still being in the tournament. My best showing occurred in the East, where I was only four West Virginia points away from a perfect sub-bracket. Though I feel I created a decent bracket, the fact three of my Final Four and one of my finalists are already eliminated is quite disappointing. What’s more frustrating is that I had an incredibly good start, correcting predicting 18 of the first 22 matches and finished with a 78% accuracy for the first round.
But if there is one good thing I can take away from my bracket is that my technique for determining thresholds for predictions gave me a better model.
Here’s how I described that technique in my article where I discussed building my model:
“Instead of just choosing the team with the higher probability, I decided to try something different in making my picks. The reason is that my initial bracket had hardly any upsets, my Final Four picks were number 1 seeds and there were no seeds lower than a 3 in my Elite Eight. I noticed that in my predictions in games where I expected to see an upset, the favored team would still be predicted to win but not as much as you would expect it to. Since machine learning models trained on two outcomes use 50% as their threshold to make predictions, I needed a way to adjust the threshold for each matchup.
My solution to this issue was to train a simple Logistic Regression model on only the difference in seeding between the teams. I take the probabilities generated from this model and use them to determine the thresholds which decide my matchup predictions. For example, if my neural network model says that a 6-seeded team has a 60 % chance of beating an 11-seeded team, but the thresholds from my Logistic Regression model say that a team that is five seeds lower has a 68.3 % chance of winning, then I choose the 11-seeded team to win. After making all the predictions using this method, I feel a lot better about my bracket.”
In addition to my own bracket, I created a control bracket in which I did not use this threshold technique to make predictions, all my predictions used the default threshold of 0.5. My main bracket made two better predictions than the control bracket.
Check this space next week for another update on the ODSC March Madness bracket.
I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.