Seven Python Kernels from Kaggle You Need to See Right Now
BlogNLP/Text AnalyticsResearchStatisticsposted by George McIntire, ODSC July 10, 2017 George McIntire, ODSC
The ability to post and share kernels is probably my favorite thing about Kaggle. Learning from other users’ kernels has often provided inspiration for a number of my own projects. I also appreciate the attention to detail and descriptions provided by some users in their code as well. This is why we’d like to share with you seven awesome Python kernels we found on Kaggle.
1. Interactive Cricket Plots
I’m not a fan of cricket at all, I barely understand how the sport works, but I’m a huge of fan of this kernel where Kaggle user ash316 released a treasure trove of interactive and static graphics of statistics from the Indian Premier League. The kernel is an incredibly diverse and colorful collection of different of plots using the Matplotlib and Plotly libraries. As a frequent user of Plotly, I’ll most likely be referencing this kernel in my future plotting.
2. Plotting Model Boundaries
Plotting the decision boundaries of a machine learning model is a great way to demonstrate how a certain algorithm makes its classifications. In this kernel, Arthurok shows how to make plots of the decision boundaries for random forest and logistic regression models in Plotly.
3. XGBoost and Quora Questions
One of Kaggle’s most popular competitions, Quora Questions Pairs seeks to solve the problem of duplicate questions, no doubt a persistent issue for a website such as Quora. User Anokas’ submission to the competition involves using the optimized gradient boosting library xgboost. He does a great job of walking through his process for preprocessing and doing EDA on the training and testing datasets. Running a xgboost model on his data which yielded a score of 0.3546, enough to place him on the competition’s leaderboard. In the three months since Anokas posted his code, it has racked up over 37,000 views and dozens of positive comments. This is kernel is must-see for anyone looking to use the xgboost library.
4. Zillow EDA
The Zillow Prize competition is currently the second most lucrative on Kaggle, with a prize haul of $1,200,000. The goal of the competition is to see if teams of data scientist can come up with a model to better determine a price estimate for a given house. In this kernel Sudalairaj Kumar conducts a thorough exploratory data analysis to derive insights on the features of this vast amount of data on houses and their prices.
5. Titantic Tutorial
You really can’t call yourself a data scientist unless you’ve worked on the Titantic dataset, so it’s no surprise to see that one of Kaggle’s most popular kernels is about this data. Helge Bjorland, Senior Data Scientist at Telenor ASA, provides a meticulously organized approach to this famous dataset. He outlines his entire process from “Business Understanding” to “Deployment” with great attention to detail. I highly recommend data scientists of all levels, not just beginners to peruse through his code.
6. Digit Recognition and Deep Learning
Naturally, we have to include a deep learning kernel. A designated “Kernels Expert” Poonam Ligarde’s code on using Keras on the MNIST digits datasets has accumulated 13,000 views. She demonstrates a simple method of designing a neural network architecture using a Keras Sequential layer.
7. Instacart and Word2Vec
Grocery delivery startup Instacart recently made headlines when they released data on 3 million orders for a Kaggle competition in which data scientist try to best predict which products a customer will reorder. Freelance data scientist Omar Essam’s approach employs the Word2Vec route. He successfully used a Word2Vec model to improve his submission score to the competition.