The year 2021 flew past in a flash for us in the data science industry. This perhaps is due to the frenetic pace at which many areas of technology have evolved in just a single trip around our star. It’s hard for any one person to fully keep a pulse on everything that’s important to know in data science, and for that reason, I’d like to offer this annual skills update based on what I’ve seen and heard throughout the year. Many of the suggestions I included in last year’s skills round-up still apply. This year, I’ll offer a number of additional thoughts about skills that have helped me (and many of my colleagues) keep in front of the profession. As we move into 2022, here’s my list of the 10 most in-demand data science skills in no particular order. Enjoy!
1—GBM Algorithms: Gradient boosting is one of the most powerful algorithms used in the field of machine learning. A vast majority of my colleagues and I use boosting techniques for work in data science. Gradient boosting can be used for predicting not only a continuous response variable (regression) but also a categorical response variable (classification). If you haven’t already climbed aboard the gradient boosting bandwagon, then 2022 is a good time to do so. To learn more about the various gradient boosting algorithms that are out there (XGBoost, H2O.ai, CatBoost, LightGBM), check out this blog that includes extensive benchmark results for the implementations, along with an important video presentation by the main author of XGBoost, Tianqi Chen.
2—ModelOPS for enterprise-class model operationalization: There is a lot of buzz around new “ModelOps” platforms designed to help manage the many models floating around enterprises these days. The key to operationalizing models is the ability to address the critical challenges centered on governance and scale required to effectively unlock the transformational value of enterprise AI and machine learning investments. ModelOps represents a holistic approach for quickly and iteratively advancing models through the machine learning life cycle so they are deployed more rapidly and deliver desired business value. If you find yourself employed in a large enterprise environment, it may behoove you to familiarize yourself with ModelOps solutions, and be ready to jump into the pool with this technology. Check out my overview article on ModelOps from earlier this year.
3– Feature Stores: Feature stores represent another important technology attractive to enterprises that intend to build many models based on common entities within an organization. The main benefit of a feature store is that it encapsulates the logic of feature transformations to automatically prep new data and serve up examples for training or inference. If your organization finds itself replicating effort to code feature transformations or copying/pasting feature engineering code from project to project, then a feature store could greatly simplify your data pipeline. Check out my overview article on feature stores from earlier this year.
4—Synthetic data: The rising importance of synthetic data for deep learning is real, especially for computer vision applications. A recent industry prediction by Gartner projects that by 2025, synthetic data will reduce personal customer data collection, avoiding 70% of privacy violation sanctions. The new methods use a radically different approach compared to classic graphics tools, and achieve new highs of photorealism. Synthetic data refers to computer-generated images and simulations used to train computer vision models. Synthetic data is emerging to be an essential element in building accurate and capable AI models, as it provides developers with vast amounts of perfectly labeled data on-demand. There is a great new book on the subject: Synthetic Data for Deep Learning if you’d like a learning resource to get started.
5—Missing data handling: Handling missing data in data sets is a problem that all data scientists must learn to manage. Imputation is a technique used by many professionals. It is the process of replacing missing data with substituted values so that you won’t have to use the brute force method of deleting whole observations. Fortunately, this is a fertile area of research in statistical learning, and many papers addressing this need routinely appear in the arXiv pre-print server. I enjoy consuming these papers to get ideas for how I might gain an advantage when doing my project work. I recently found a new book titled “Statistical Methods for Handling Incomplete Data, 2nd Edition,” which provides a survey of current research on the subject. One warning, this is a graduate-level text, and it may take a while to forge through.
6— Strong grasp of statistics: A good understanding of statistics is a vital skill for data scientists at all levels because data scientists regularly need to apply statistical concepts and techniques. Being familiar with topics like probability distributions, hypothesis testing, variance, correlation, p-values, and other elements of statistics helps data scientists collect, organize, analyze, interpret, transform and visualize data in preparation for machine learning. One great option for getting up to speed with statistics is the free OpenIntro Statistics book and courseware.
7—Importance of clean coding style for maintainable code: There are times when this skill might not be strictly required, e.g. when you’re writing quick-and-dirty code for ad-hoc analyses, but if you plan to re-use or distribute your code, adopting a clean and professional coding style will result in far greater productivity. If you plan to post code on your GitHub repo, by all means, make sure the code is nice and tidy. Writing clean Python or R code can seem daunting at first, but it’s really just a matter of obtaining a good model. A while back, I found a nice book called “Machine Learning for Hackers” that included professional-looking R code.
8– Knowledge of data storage options: Data scientists should be aware of the pros and cons associated with different data storage technologies (e.g., SQL, NoSQL, Graph, etc.) depending on the problem domain in which they are working. An awareness for how data is stored can help in the design of more effective data pipelines. Specifically, it is useful for data scientists to know about databases, data warehouses, and data lake architectures, along with the new data fabric/mesh solutions.
Additionally, data scientists should also be familiar with file storage options, e.g. AWS S3, Google Cloud Storage, Azure Blob Storage, etc. Amazon S3 has quickly become the de facto storage for the cloud, widely used as a data lake.
9– Domain knowledge: Data scientists should strive to understand the business domain for a given project. Such knowledge helps data scientists comprehend the data and derive context from it, which in turn leads to better decision-making processes within an organization. In the case where the data scientist lacks the required knowledge of the business domain, it is critical to have access a subject matter expertise (SME) for questions and clarifications. One byproduct of this collaboration is a detailed data dictionary for the data sets involved. It is important for data scientists to participate in a business conversation and provide meaningful input regarding projects, the questions, and the potential outcomes. Inability to do so may act as a deterrent to building successful data products.
10– Causal inference:
Data scientists often get asked questions related to causality: (i) did a recent social media campaign drive sign-ups, (ii) do customer support inquiries increase sales, or (iii) did updating a statistical model drive revenue? Project stakeholders often require data scientists to use techniques that can answer questions like these, which are centered around issues of causality and are solved with causal inference.
Causal inference is the study of how actions, interventions, or treatments affect outcomes of interest. Causal inference bridges the gap between prediction and decision-making – useful because prediction models alone are of no help when reasoning what might happen if we change a system or take an action, even for prediction models with high accuracy. Causal inference has seen increased interest in the data science community and will continue to be a high-priority skill moving into 2022.
Editor’s note: How to Learn These Data Science Skills at ODSC East 2022
At ODSC East 2022, coming this April 19th-21st in Boston, MA and virtually, you can learn all of these in-demand data science skills and more. By attending this event, you can learn so many new data science skills, ranging from machine learning fundamentals to how to apply automation to specific industries. Some current sessions include:
- Dealing with Bias in Machine Learning
- Statistics for Data Science
- Mastering Gradient Boosting with CatBoost
- Engineering a Performant Machine Learning Pipeline: From Dask to Kubeflow
- Multi-Task Reinforcement Learning
- End to end Machine Learning with XGBoost
- Next Generation of Distributed Computing with Dask
- AI Observability: How To Fix Issues With Your ML Model
- …and many more to come!
We also just added three new focus areas: Machine Learning Safety and Security, Machine Learning for Biotech and Pharma, and the AI Startup Showcase and Lab.
Register now for 70% off all ticket types before prices go up soon!