It’s a without a doubt that Kaggle is one of the most important hubs in the data science ecosystem. They’ve been making some news recently with their acquisition by Google and the debut of the new “Learn” platform. The best thing, however, beyond technology, about Kaggle is its community. Kaggle users are known for their avid participation in competitions, but the thing that resonates most to me personally about the community is the constant willingness of developers, students and others to share code and data. All of this is exciting and new on Kaggle and for the Kaggle community.
In this post, I’ll highlight some of the most interesting recent datasets and kernels from the Kaggle community.
- This sprawling collection features geographical data on hundreds of thousands of fast food restaurants in the USA. It includes addresses and latitude & longitude coordinates, making for a very interesting data viz project.
- Mass shootings are an ever present topic in the media, so if you’re looking to do some kind of data journalism piece on the issue then this is the dataset for you. It includes a wealth of information such as news articles, detailed data on the shooter and victims, and congressional district
- This one is for the film and NLP buffs of the community. It is a corpus of word vectors trained on movie reviews. Word vectors are always fun to play with, so this should be even more fun.
- If I had to give one bit of criticism about Kaggle datasets, it’s that there aren’t enough machine learning datasets in the mix. So, when I come across a dataset that allows for the ability to train a supervised learning model then I jump on it. This is what the animal shelter outcomes data set it for. With this data, you can try to predict whether or not shelters animals end of up getting adopted.
- XGBoost is probably the hottest machine technique learning outside of neural networks right now. I highly recommend checking out this incredibly detailed Kernel because it explains how to use the algorithm on a housing prices datasets. And don’t let the fact that it’s in R discourage you; Python users can get something from the presentation of the data and results.
- A robust exploratory data analysis process is a major key for any machine learning process, so take good notes on this kernel.
- Ensemble methods are what wins Kaggle competitions so if you want to move up into that top 10 percent, this is where you start.
- Deep Learning algorithms can be a tough nut to crack. I really appreciate this kernel just for that reason because it provides a simple yet comprehensive introduction to Convolutional Neural Nets. This is the algorithm used for image processing.
- Time to get super meta. If you’re a true data nerd, then you’ll really appreciate this kernel analyzing kernels on Kaggle.
- An awesome introduction to topic modeling techniques like LDA and NNMF analyzing the scariest dataset you’ll ever come across. Gets bonus for the solid visuals.
Kaggle is always updating its datasets and its kernels so stay tuned to another version of this article in the future.