The year 2020 has been a significant outlier for just about everything happening on planet Earth, but fortunately, data science continues to roll forward. The pandemic, in particular, has reinforced the notion to “work smart” in terms of advancing your skills for maximizing professional prospects. I found myself reinventing myself in a number of ways this past year, and I made some observations along the way that I’d like to pass on to you in this article. The unusual circumstances that are still facing us as we transition into the New Year, also present unique opportunities. It’s just a matter of developing a workable strategy. Here’s my list of top data science skills in no particular order.
1. Get Your Feet Wet with GPUs
Now is the time to understand what the rage is all about with respect to GPUs. The easiest way to get started with GPUs for machine learning (unless you can afford to buy an NVIDIA DGX Station™ A100 to stick under your desk at work), is to start with a cloud GPU service. Here is a short-list of some options that may be suitable for your needs:
- Colab – Google Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary Python code through the browser, and is especially well suited to machine learning. Specifically, Colab is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources including GPUs.
- Kaggle – Kaggle (owned by Google) provides free access to NVIDIA TESLA P100 GPUs. These GPUs are useful for training deep learning models, however, they don’t accelerate most other workflows such as Python libraries like pandas and scikit-learn. You can use up to a quota limit per week of GPU. The quota resets weekly and is 30 hours or sometimes higher depending on demand and resources.
- NVIDIA NGC – The NGC™ catalog is a hub for GPU-optimized software for deep learning (DL), machine learning (ML), and high-performance computing (HPC) that accelerates deployment to development workflows so data scientists, developers, and researchers can focus on building solutions, gathering insights, and delivering business value.
- Amazon SageMaker – You can use Amazon SageMaker to easily train deep learning models on Amazon EC2 P3 instances, fast GPU instances in the cloud. With up to 8 NVIDIA V100 Tensor Core GPUs and up to 100 Gbps networking bandwidth per instance, you can iterate faster and run more experiments by reducing training times from days to minutes.
- Cloud GPUs on Google Cloud Platform – High-performance GPUs on Google Cloud for machine learning, scientific computing, and 3D visualization.
- Lambda GPU Cloud for Deep Learning – Lambda offers Lambda GPU Cloud, a dedicated GPU cloud service for Deep Learning.
2. Creative Data Visualization & Data Storytelling
Data visualization coupled with data storytelling remains a top skill to be cultivated by all data scientists. This integral step of the data science process is one skill that sets data scientists apart from their data engineering colleagues. Data scientists assume the important and unique role of interacting with project stakeholders when delivering the results of a data science project.
Aside from the typical reports and numeric results, a compelling and well-thought-out data visualization is the best way to showcase the results coming from a machine learning algorithm. It’s also a primary ingredient of the final data storytelling stage of the project where the data scientist strives to come up with a concise, non-technical description of the results where key findings area easily understood.
This is one area where I always feel lacking because I tend to be more left brain in my thinking. In order to come up with the right-brain elements, the more creative and visual elements, I’m always looking for new data visualization techniques using newly discovered R packages and Python libraries to make the result more compelling.
Editor’s note: Learn more about data visualization and data storytelling in our upcoming Ai+ Training sessions, “Getting Started with D3.js for Data Visualization” and “From Numbers to Narrative: Data Storytelling & Visualization,” and see why these are important data science skills for 2021.
Although I was fully indoctrinated in R during grad school, and although I use R for most first-cut data experiments, I find myself using Python more and more these days. It’s hard to ignore Python since most of the cool blog articles and learning materials use Python as the language of choice – Medium, most deep learning papers appearing on arXiv refer to GitHub repos with Python code using frameworks like Keras, TensorFlow, and Pytorch, and most everything happening on Kaggle involves Python.
R used to have the edge with the 16,891 packages available to supplement the base language, but Python claims to have an order of magnitude more than that. For many reasons, I’ve decided to make Python my primary language in 2021. A robust knowledge of Python is an important data science skill to learn.
I don’t like how many data science language rankings include SQL along with R and Python among needed data science skills. SQL is a great data query language but it’s not a general-purpose programming language. In any case, I always recommend to my “Intro to Data Science” classes that every data scientist should be proficient with SQL. Many times your data sets for a data science project come directly from an enterprise relational database, so SQL is your conduit for acquiring data. In addition, you can use SQL directly in R and Python as a great way to query data in a data frame.
5. GBM over Deep Learning
AI and deep learning continue to be at the top of the industry “hype cycle” and I suspect that 2021 will be no different. Deep learning is the perfect tool for many problem domains such as image classification, autonomous vehicles, NLP, and many others. But when it comes to tabular data, i.e. typical business data, deep learning may not be the optimal choice. Instead, GBM (gradient boosted machines) is the machine learning algorithm that usually achieves the best accuracy on structured/tabular data beating other algorithms such as the much-hyped deep neural networks (deep learning). Some of the top GBMs include XGBoost, LightGBM, H2O, and catboost.
6. Data Transformation
It’s often mentioned in undertones when data scientists meet – the data munging (aka data wrangling, data transformation) process takes a majority of the time and cost budget of a given data science project. I’ve confirmed this estimate many times in my professional experience. Transforming data isn’t the sexiest work, but getting it right can mean success or failure down the line with machine learning. For a task that important, a data scientist should make certain to build up their data science toolbox with code to address many common needs. If you use R that means using dplyr and if you use Python, then Pandas is your tool of choice.
7. Mathematics and Statistics
Maintaining a firm knowledge of the underpinnings of machine learning algorithms requires a foundation in mathematics and statistics. This area is typically the final learning effort for many data scientists because math/stats may not be on their personal short-list to get up-to-speed with. But I believe a fundamental understanding of the mathematical foundations for machine learning is critical to avoid just guessing at hyperparameter values when tuning algorithms. The following areas of mathematics are important: differential calculus, partial differential equations, integral calculus (AUC-ROC curves), linear algebra, statistics, and probability theory. These areas are all important for understanding how machine learning algorithms work.
One goal of all data scientists is to be able to consume “The Machine Learning Bible,” Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman. This is one of those books you never finish reading. I dip into the text from time to time just to refresh my theoretical background.
To brush up on your mathematics, check out MIT Professor Gilbert Strang’s OpenCourseWare content.
8. Performing Data Experiments
I always tell my newbie data science students to seek out new data sets and experiment, experiment, experiment! Data scientists can never get enough practice working with previously unfamiliar data sources. Fortunately, the world is alive with data. It’s just a matter of matching your passions (environmental, economic, sports, crime stats, whatever) with available data so you can carry out the steps of the “data science process” to better hone your skills. The experience you gain from your own pet data experiments will only help you professionally down the line.
9. Domain Knowledge
As an independent data science consultant, I don’t specialize in any particular industry. It’s one reason I like to remain independent, I can work on all sorts of interesting projects from across a wide spectrum of problem domains – manufacturing, non-profit, education, sports, fashion, real estate to just mention a few. When I engage a new client from a new industry, I try to quickly increase my domain knowledge early on. I speak to SMEs (subject matter experts) from the client organization, I review available data sources, I read everything I can find on the subject including white papers, blog posts, journals, books, research papers; all in an attempt to hit the ground running.
Acquiring the required domain knowledge for a given data science project can go a long way even if you don’t intend to make a career out of it. You may be surprised how different industries have a lot in common. So learning a lot about one domain can help in others.
10. Ethical Machine Learning
I always save my “data science superpowers” talk for the last lecture in my “Intro to Data Science” classes. I give my students a short, but powerful list of cases where data scientists opt to say “no” when asked to use their skills for nefarious purposes. I tell them of the data scientists who work to develop technology designed to create undetectable “deep fake” images and videos. I tell them of the time I witnessed a data science manager for a large public gaming company who told a Meetup crowd that he and his team worked with psychologists to figure out ways to addict children to their games. And I tell them of Rebekah Jones, the State of Florida data scientist who refused to doctor up COVID-19 data to make the state’s public health status look better. I’ve been faced with ethical roadblocks in my own career when I had to say “no” on a moral basis. I tell my students to think ahead and plan for the day when they may also have to take a stand against using their skills to hurt others. Looking ahead to 2021, the political climate may be ripe for such dilemmas.