With the current generation of data scientists working tirelessly to satisfy the accelerating demand for data-driven insights on behalf of industries pretty much across the board, it’s natural to take a step back and ask what the “data scientist 2.0” might look like in the next 5-10 years. In this article, we’ll take out our crystal ball and try to characterize this next generation.
Ideally, the next generation of data scientist should be much more evolved past someone who is technically proficient to the degree necessary to land a cushy job with a life-changing salary (although those things are certainly nice). In contrast, next-generation data scientists should be encouraged to become good problem solvers who follow the scientific method, to think deeply about the appropriate use of the data science process, and to use data responsibly and for the common good.
Technical Skills Fully Honed
In the 2012 Harvard Business Review article “Data Scientist: The Sexiest Job of the 21st Century,” DJ Patil claims to have coined this term in 2008 with Jeff Hammerbacher to describe their jobs at LinkedIn and Facebook, respectively. Patil asserts that a data scientist is “a new breed” of computer scientist. I latched onto the term for myself around 2012, a bit late because I was being cautionary in seeing whether the term was just a fad.
I the past 5 years or so, due in large part to the shortage of data science skill sets, we’ve seen a lot of this current generation of data scientists migrating from other fields. Some of these fields have been a big step away from the progenitors of the field of data science, namely computer science and applied statistics. We now see nouveau data scientists coming from the physical sciences, social sciences, in addition to many MBAs who are jumping ship from their chosen area of specialization. This means that the skills brought to the table by people from these disparate fields are potentially inconsistent. Sure, there are many data-driven disciplines outside of data science, but the training and technical knowledge of these practitioners may be on the periphery.
The next generation of data scientist will maintain a breadth of hard technical skills such as mathematics, statistics, probability theory, machine learning, coding, data visualization, and data storytelling. Coding is important, so a good foundation in writing code along with good coding practices like agile software development techniques, code reviews, debugging, and version control are particularly valuable.
Important steps in the data science process need to be fully cultivated such as: exploratory data analysis (EDA), creative feature engineering, managing the vast number of models to choose from and their components (hyperparameters, optimization methods, evaluation metrics, etc.), data transformations (polynomial, log, binary categorical variables). These are critical parts of data science, even though it’s easy to overlook them as being insignificant.
Slow Down and Proceed Methodically
Next generation data scientists should avoid many of the common traps the current generation falls into such as moving too quickly from a data set to applying a trendy algorithm and ignoring all the important steps in between. It’s important to slow down and think. It’s all too easy to quickly run a piece of code that makes predictions and then declare success when the algorithm converges. Really, that’s the easy part. The more difficult part is proceeding with careful consideration and making sure the results are correct and interpretable.
The next generation shouldn’t try to impress with complex learning models that don’t work that well and are mismatched with the problem being solved. There should be an emphasis on spending more time getting the data into shape. It’s perfectly acceptable to spend a significant portion of the project time down in the trenches working with the data. Don’t be embarrassed to admit you spend 80% of your time making sure the data is good.
The next generation should leave faith at the door in terms of tools and methods used, and academic departments (Stanford, we know you’re out there!). There is a sincere need to be open, accepting, flexible, and interdisciplinary.
Soft Skills are King
Many data scientists are able to perform resampling methods like cross-validation and the bootstrap, and many do it poorly. Most start off doing it poorly. What matters isn’t where you start out, it’s how you proceed from there. It’s important for the next generation to cultivate good habits and remain open to continuous learning.
A few good habits include: persistence, thinking flexibly, thinking about thinking, and striving for accuracy. Try not to over or underestimate your abilities, give yourself reality checks by making sure you can code what you speak and by interacting with other data scientists about methods and approaches.
Apply the Scientific Method
Ideally, next-generation data scientists should ascribe to the “scientific method” in the way they test hypotheses and welcome challenges and alternative theories. Sometimes that means finding holes in ideas, and devising tests as true scientists.
It’s also important to ask a lot of questions. Adopt the perspective of innate curiosity, and don’t worry about appearing stupid. Don’t be afraid to ask for clarification. And maybe most importantly, do not confuse correlation and causation. In fact, it’s a good idea to err on the side of assuming you’re looking at correlation.
Next generation data scientists should remain skeptical about the statistical models they use in terms of how they may fail, as well as the implications and consequences of the models they’re building.
Proceed with Ethics
And finally, keep in mind that the data generated by user behavior becomes the building blocks of data-driven products, which simultaneously are used by users and influence user behavior. It’s important to realize that algorithms are not only capable of predicting the future, but also of directing the future. Next generation data scientists shouldn’t let their salaries blind them to the point that their models are used for unethical purposes. Instead, they should seek out opportunities to solve problems of social value and consider the impact and consequences of their models.
Editor’s note: Ready to learn more about the future of data science? Attend ODSC East 2019 this April 30-May 3 and hear from industry-leading experts in-person!