Data Science is an increasingly competitive field and participants are constantly working hard to build more levels of skill and experience. This trend has given rise to ever more demanding job descriptions for the position. To stay competitive, it makes sense to prepare yourself for new ways of working coupled with a variety of new tools. In an attempt to combat the “unicorn” mentality where many firms try to hire a single individual to fill roles for data scientist, data engineer, software developer, etc., this article takes a look at a number of important data science specific skills designed to help you make a splash in your career: here are the top data science skills for 2020.
[Related Article: Data Science Job Titles to Look Out for in 2020]
Git (a version control system that lets you manage and keep track of your source code history) and GitHub (a cloud-based hosting service that lets you manage Git repositories) are tools for developers that are of great help when managing different versions of software. They track all changes that are made to a code base and in addition they add ease in collaboration when multiple developers make changes to the same project at the same time.
For the role of data scientist, Git is becoming a serious job requirement and it takes time to get used to best practices for using Git. It is easy to start working on Git when you’re working solo, but when you join a team or collaboration with Git experts, you might struggle more than you think.
Preparing for Production
Historically, the data scientist is the staff member who answer business questions with machine learning. But now data science projects are more and more often developed for production systems. At the same time, advanced types of models now require more and more compute and storage resources, especially when working with deep learning.
In terms of job descriptions for the position of data scientist, it’s important to think about the accuracy of your model, but it’s becoming equally important to work directly with data engineering members of the team to place data science solutions in production environments. If you haven’t yet collaborated with data engineers to get your models into production, it’s a great time to start.
Let’s face it, the cloud is king for data science and machine learning in 2020 and beyond. Moving compute and storage resources to cloud vendors like AWS, Microsoft Azure or Google Cloud makes it very easy and fast to setup a machine learning environment that can be accessed remotely. This requires data scientists to have a basic understanding of cloud infrastructure.
Knowledge of cloud is not mandatory but it’s getting that way. If you have this experience, it definitely works as a valuable skill-set. Some services to take a look are: Google Colaboratory, Google ML Kit, Kaggle, IBM Watson, and NVIDIA Cloud.
Deep learning, a class of machine learning best suited for specific problem domains like image recognition and NLP, has received a lot of press in 2019. But for more routine data science applications using structured/tabular data, routine machine learning algorithms like XGBoost, are recommended. As a result, it has been accepted for most data scientists to consider that image recognition and NLP as mere specializations of data science that not everyone needs to master.
Moving into 2020, however, the use cases for image classification and NLP are getting more and more frequent even in typical enterprise applications. I can therefore recommend that all data scientists acquire at least basic knowledge of deep learning. Even if you do not have direct applications of deep learning in your current job, experimenting with an appropriate data set will allow you to understand the steps required if the need arises in the future.
Math and Statistics
Knowledge of various machine learning techniques is integral to being a data scientist. Machine learning experience is a primary differentiator from data analyst. A fundamental understanding of the mathematical foundation for machine learning is critical to avoid just guessing at hyperparameter values when tuning algorithms. Knowledge of Calculus (e.g. partial differential equations), linear algebra, statistics (including Bayesian theory), and probability theory is important to understanding how machine learning algorithms work.
I always tell my students they should strive to understand the theoretical basis of machine learning found in “The Machine Learning Bible,” Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman.
One question I routinely hear in my intro to data science classes, is “Should a data scientist know SQL?” Most emphatically – YES! Many times your data sets for a data science project come from an enterprise relational database, so SQL is your conduit for acquiring data. You should be well versed in SQL to gain maximum benefit for data acquisition. In addition, using R packages like sqldf is a great way to query data in a data frame using SQL.
I’m guilty of ignoring this aspect of data science, but the technology is coming on strong. The idea behind AutoML tools is to expand the capabilities of a resource, the data scientist that is in short supply. By automating many of the routine tasks carried out by the data scientist, training and evaluating machine learning models, more work can be achieved with a smaller team. Nice concept, but I’m still not 100% convinced, and that’s likely why I haven’t dug into AutoML. Nevertheless, the technology is being taken seriously by many companies, so to widen your experience with all available tools, it would be wise to take a closer look.
I always tell my newbie data science students to seek out new data sets and experiment, experiment, experiment! Data scientists can never get enough practice working with previously unknown data sources. Fortunately, the world is alive with data. It’s just a matter of matching your passions (environmental, economic, sports, crime stats, whatever) with available data so you can carry out the steps of the “data science process” to better hone your skills. The experience you gain from your own pet data experiments will only help you professionally down the line.
Data visualization is a remarkable thing you can do with data. Data visualization is the best way to showcase the results coming from a machine learning algorithm. It’s a primary ingredient to data storytelling (see final top skill below). With only a few non-technical words of description during a presentation for project stakeholders, key results will be understood if you have a well-crafted visualization.
I’m always looking for new data visualization techniques (using newly discovered packages to make the process easy) as I read articles, blogs, and books. This skill is a key to success for data science projects.
It’s always important to increase your data storytelling skills. This is probably the most difficult for data scientists, since it’s a “soft” skill with a lot of creativity required.
[Related Article: The Best Soft Skill for Data Scientists]
This skill is all about networking, and interpersonal skills. It’s a path toward differentiating yourself among your data science peers (because few do it well). Engage with stakeholders and they will lift you when an organization needs it. In addition, good communication with senior management will keep you updated about upcoming projects. So without showing them code you have to explain highly technical results. Stay away from crystal ball explanations so people won’t think data science is “magic.” Preparing yourself in advance is the best way to perform well.