I get asked all the time, “How can I become a data scientist?” The query comes from all sorts of people who see the many opportunities presented by this growing and dynamic field. My initial response is to obtain the educational foundation for doing data science: computer science, applied statistics, probability theory, and several flavors of mathematics such as calculus, linear algebra and partial differential equations (PDEs). Then comes the fun part, machine learning!
All data scientists must be able to perform statistical learning (machine learning) with ease. There are two types of machine learning to master: supervised machine learning (e.g. regression and classification) to make predictions, and also unsupervised machine learning for knowledge discovery (e.g. clustering).
Fortunately, today you can acquire all these skills necessary to learn data science for free! In this article, I will review how you can take advantage of some pretty amazing free resources to get you to the point where you can enter the profession.
The Rise of the MOOC
The first Massive Open Online Course (MOOC) offering, Coursera, was founded in 2012 and has become the epicenter for data science learning. Coursera Founder and Stanford Professor Andrew Ng’s famed “Machine Learning” course was the first course offered and is still extremely popular and a great place to start (although it uses the Octave programming language, an open source version of the commercial Matlab product). I went through this course the very first time it was offered and I continue to recommend it to anyone looking for a free educational resource in machine learning.
Another excellent option is the 10-course Data Science Specialization series created by three biostatistics professors at John’s Hopkins University and offered through Coursera. You can audit the courses for free (certificates require a fee). I had the opportunity to beta test all the courses in the series before they were available to the public, and I can attest to their quality.
Coursera offers many other data science learning options produced by leading universities from around the world.
Also founded in 2012 by MIT and Harvard University, edX is another quality free MOOC resource for data science. A lesser known learning resource for machine learning is “Learning from Data,” taught by Caltech professor Yaser Abu-Mostafa, author of a great book on the subject that bears the same title as the course. For catching up with the mathematics side of data science, you have many wonderful no-cost video presentations on Youtube, Kahn Academy, and MIT OpenCourseWare.
So Many Data Science Books, So Little Time
The field of data science is awash with excellent books. In addition to hardcopy versions, many are offered as free downloads. Here are some of my favorites:
– Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. Many data scientists call this book the “machine learning bible” because its authors are luminary professors from Stanford University. This book is all theory and mathematics.
– An Introduction to Statistical Learning by James, Witten, and Hastie. A less mathematical version of ESL above with R programming examples and
much less mathematics.
– Deep Learning by Goodfellow, Bengio, and Courville. Advanced, graduate-level book for data scientists who wish to move into deep learning. Great book!
– The Art of R Programming by Norman Matloff. A nice book with which to learn R, a popular programming language used by data scientists. I like Matloff’s approach since he is a computer science professor.
Doing Data Science by Rachel Schutt and Cathy O’Neil. This book is a fun overview by two practitioners who have long term perspectives of the field.
– Introduction to Linear Algebra by Gilbert Strang. This is my favorite book on the subject by an MIT professor who has devoted his life to the subject.
There are many other educational resources that you can use to learn data science – data science blogs, industry newsletters, technology sites like Stack Overflow, and Quora, LinkedIn groups, commercial whitepapers from leading vendors, Google alerts, and Twitter (many top data scientists have insightful feeds), just to name a few.
In addition, many companies and organizations produce technology webinars that focus on various areas of data science and machine learning. Usually lasting an hour, I’ve found most webinars to be of high quality. They tend to be very timely in the technology they highlight and the trends they follow. Webinars are a good way to stay in front of the rapidly changing face of the industry. Be sure to check out the “AI Learning Accelerator” by ODSC for a large collection of webinars.
Even with all the free resources available to enable you to seek out a career in data science, be forewarned, it’s not easy. You’re basically committing yourself to learn the equivalent of a 4-year university degree as a minimum starting point. But given aptitude and fortitude, you can feasibly become a data scientist for free. The next step would be to prove your skills by doing projects. I always tell newbie data scientists to do some public data hacking as resume-building material in order to get that first internship or junior position.