10 Tips to Get Started with Kaggle 10 Tips to Get Started with Kaggle
Kaggle is a well-known community website for data scientists to compete in machine learning challenges. Competitive machine learning can be a... 10 Tips to Get Started with Kaggle

Kaggle is a well-known community website for data scientists to compete in machine learning challenges. Competitive machine learning can be a great way to hone your skills, as well as demonstrate your skills. In this article, I will provide 10 useful tips to get started with Kaggle and get good at competitive machine learning with Kaggle. Let’s dive right in!

[Related Article: What’s New on Kaggle]

  1. Choose a data science programming environment. There are many machine learning programming environments to choose from, and you may end up using a number of them, but to get started with Kaggle you just need to choose one. The two most popular environments are R and Python.
    1. Demand for skills in both open source environments is growing all the time.
    2. R stems from academic use for statistical applications and has a very long history dating back to 1993. Python is a general-purpose programming language also dating back to the early ’90s.
    3. The ecosystem for both environments are quite mature: R has over 13,000 packages, and Python has widely used libraries like scikit-learn, pandas, NumPy, etc
    4. More recently, Python has excelled in deep learning tools such as Theano, TensorFlow, and Keras.
  2. Practice on commonly used test data sets. Once you get up to speed with a language, you need to start practicing on actual data sets. It’s a good idea to set up some realistic exercises to gain experience with simple, well-known data sets. It’s beneficial to work through a series of standard machine learning problems using the UCI Machine Learning Repository. You can look at each exercise as a mini Kaggle competition.
    1. Split the data set into a training set and test set. Then split the test set into a public and private leaderboard set to match Kaggle methodology.
    2. Write code in your language of choice and use a statistical learning algorithm designed to make predictions for each dataset. Keep practicing on as many small data sets as possible.
    3. Use Google to find machine learning solutions with a particular test dataset so you can get good at interpreting the results. You’ll never believe how many people have used the iris data set as an example.
  3. Explore many facets of data transformation. Data transformation (aka data wrangling, or data munging) involves various forms of data prep including merging data, aggregating data, cleansing data, handle missing data, making data consistent, and so much more. Since data transformation often takes up to 70% of a data science project’s time and budget, it’s worth gaining much experience.
  4. Feature engineering is king. Feature engineering is when you choose the best predictors for your problem in terms of predictive power. It’s been reported many times that Kaggle challenges are won with clever feature engineering, and not with the most sophisticated algorithms. Try to learn something about the problem domain as this will allow you to add creativity in your selection of feature variables. Couple this with the use of Forward and Backward Elimination techniques to automate the feature engineering process.
  5. Learn how to use ensembles. Ensembles refer to statistical learning algorithms that achieve enhanced predictive performance by constructing a set of classifiers and then classifying new data points by taking a weighted vote of their predictions. Instead of picking a single model, ensemble methods combine multiple models in a certain way to fit the training data. Many prize-winning Kaggle solutions use ensembles of multiple models.
  6. Learn how to beat overfitting. Overfitting refers to models that perform well on the training set, but not as well on the test set. In the Kaggle system, this extends to the scores you see on the leaderboard. These scores are an evaluation of the models on a random sample of a validation data set (usually 20% of the data set in size) used to identify challenge winners.
  7. Make use of the forum. The Kaggle user forums represent an excellent learning resource. Just browsing through the conversations can lead to insights. Feel free to ask questions, and you’ll be surprised at all the well-crafted answers you’ll receive. Make sure you utilize competition threads in order to understand winning solutions.
  8. Develop your own Kaggle toolbox. Build a special Kaggle toolbox with a variety of tools consisting of commonly used code sequences. With practice, you’ll become efficient when using these tools. In addition, try your hand at building a data pipeline that loads data, transforms it, and reliably evaluates a model. Design the pipeline to be reusable so you can deploy it on future competitions. A beginner will make the mistake of reinventing the same processes over and over again. Rather you should work to streamline your Kaggle challenge process with methods of reuse.  
  9. Practice on past Kaggle challenges. Now that you possess a good level of familiarity with your tools and how to use them, it’s time to practice on past Kaggle challenges. You can also post candidate solutions so they’ll be evaluated on the public and private leaderboard. It behooves you to work through a number of Kaggle challenges from the last few years. This tip is designed to help you learn how top performers approach competitive machine learning and to learn how to integrate their methods into your own approaches. Try to get into the head of past competition winners and use their methods and tools. It’s a good idea to pick a variety of different problem types that encourage you to acquire new techniques. Try to achieve a score in the top 10% or better in the public or private leaderboards.

    [Related Article: Seven Python Kernels from Kaggle You Need to See Right Now]

  10. Start competing! If you’ve found success with all the above tips, you’re now ready to compete on Kaggle. Consider working on one challenge at a time until you achieve a good score or hit a roadblock. Remember, they may be competitions, but you’re participating to learn and share knowledge (which will lead to valuable collaborations). Be creative, think outside of the box, and have fun!





Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.