Working as a Data Scientist — Expectation Versus Reality!
Career Insightsposted by ODSC Community February 13, 2023 ODSC Community
Working in Data Science and Machine Learning (ML) professions can be a lot different from the expectation of it. While pursuing education in the field, the nature of the problems you solve as a part of a course project might be different compared to the real world. I started working in Data Science right after graduating with an MS degree in Electrical and Computer Engineering from the University of California, Los Angeles (UCLA). During my MS, I got the opportunity to work on many types of data and ML projects, including web scraping to collect data, parsing big data, building unsupervised ML models, building supervised ML models, creating deep neural networks, working with text data using Natural Language Processing, and with speech data using audio processing techniques. As I was working on these projects, I knew I wanted to work as a Data Scientist once I graduate. After spending a few years actually working in the space, I realized there was a lot more to it than I ever expected, and knowing about it beforehand could have made my process much smoother.
In this article, I share 11 key differences between working for a company versus the expectation of it coming from academia.
1. A popular focus of a majority of Data Science courses, degrees, and online competitions is on creating a model that has the highest accuracy or best fit. On the contrary, model accuracy may not be the most or only important factor in the industry.
In most projects, the goal is to create the “best” model. While the “best” model may be the one with the highest accuracy in an academic project, it is not necessarily the best model for the job. What a “best” model might mean varies from project to project in the real world. For instance, for a classification problem, you may end up choosing a model that has lower overall accuracy, but higher accuracy for the classes that are most important to the business. Consider a genre classification problem on content descriptions that needs to classify the text into one of five classes — entertainment, education, news, sports, and gaming.
classes [“entertainment”, “education”, “news”, “sports”, “gaming”]
Model 1: 72% accuracy, per-class [0.75, 0.76, 0.60, 0.69, 0.79]
Model 2: 70% accuracy, per-class [0.60, 0.64, 0.80, 0.79, 0.68]
Let’s assume that in this example that accuracy is the relevant evaluation metric. What model would you pick?
Let me guess…… Model 1?
Now let’s say the business has the most users consuming news and sports content, and it is more important for your company to do better in those categories. With that context, between Model 1 and Model 2, it may be better to select Model 2.
2. In industrial applications of Data Science, model complexity, model explainability, efficiency, and ease of deployment play a large role, even if that means you’re settling for a slightly less accurate model.
This is even more common for first-time baseline models. If a random forest is not giving a big enough win over logistic regression, then the latter might be preferred for its lower complexity.
3. You may need to think of where and how to get data and work collaboratively with Data Engineering to get the data in place before you can even begin to build models in the industry.
In courses/projects, it is common to have data available. Processing and modeling are the focus. In industry, the data piece is critical. You may not have all the data you need to build a model. You might even need to write custom data crawling code, find public datasets, and find pragmatic ways to augment data to solve the problem. While dealing with larger quantities of data, you will likely be working with Data Engineers to create ETL (extract, transform, load) pipelines to get data from new sources. You will need to learn to query different databases depending on which ones your company uses.
4. In industrial applications, data is most likely not going to be in a clean state, to begin with.
The real-world data is messier than any practice dataset you might have used in the past. Practitioners have reported that understanding the data along with data cleaning and data transformation takes them the longest. Getting the data piece right is a critical piece of the puzzle.
5. In the industry, not every problem that comes your way needs machine learning modeling to solve.
Sometimes a series of data aggregation steps or simple look-ups may suffice and it can be beneficial for a Data Scientist to consider such solution options. Remember the principle of Occam’s razor. Occam’s razor is the problem-solving principle that suggests searching for solutions constructed with the smallest possible set of elements.
6. In the industry, deep learning is not always the preferred approach.
While research is making headway in deep learning, classic/traditional machine learning models can solve a large chunk of tasks fairly well and remain the preferred starting point for many industry applications. You don’t want to go for complex systems when a simpler one solves the need.
7. There’s a lot more to a data science role than building ML models.
A common misconception I have learned about after hiring and speaking with juniors in the field is the expectation that your main job is going to be building ML models. You may not build models for every project. Some projects may not require ML at all. Even if you are working on building ML models, it might end up being only 20–30% of your role. Be sure to understand your role before you accept the job. This will help match your expectations to reality.
8. You could be working entirely on data analytics under a Data Scientist job title.
Data Science is an umbrella role with common roles such as Data Analytics, research, ML model building, ML Ops, and ML engineering underneath. The definition of the role of a Data Scientist can be different between organizations and is usually dependent on the expectation of the company’s leadership.
9. Working in the industry, you are going to use existing solutions and tools when available, rather than creating custom models each time.
I have coded k-NN and K-means from scratch for some of my MS courses. You will never need to do that in the real world. You may not even build classification models if one already exists at a reasonable cost or is open-sourced and works well for your data. For instance, there are many tools to help you with sentiment classification and work well for a variety of text data types — TextBlob, Vader, Transformers, etc. You will explore available options to assess the need for building something custom.
10. Domain knowledge is an important factor for building the right solutions in the industry.
The business goals have a large role to play in the solutions you build. When you are building a model, what features do you select from the available data? What features should you try to get that you don’t already have? As we saw in the 1st bullet, the context of the business and its goals plays an important role in what data you need, how you filter your data, what visualizations and analytics are impactful knowledge to share, what models you build, and how you evaluate your models.
11. Model explainability is an important skill for a Data Scientist’s job.
A Data Scientist role may involve conversing with non-technical business stakeholders that do not have a background in statistics and engineering. It is one of the most underrated skills, but being able to explain a model’s output, why and what it is doing, and why it is a good solution for the business is critical for success in any company. Often, the business and product managers are looking for guidance from you, as they don’t understand the data and data science as well as you do.
Working in organizations of different sizes yields different experiences as well. In general, there are a lot more differences that people experience in the industry as a Data Scientist aside from some of the common ones featured in this article.
Coursework and projects are a great way to start your data science journey. The skills learned during practice projects and coursework are critical in preparing yourself for the Data Scientist job. However, on an actual job, there might be a lot more skills that help you be successful. Keeping these expectations in mind will prepare you for what you need to learn on the job.
Don’t feel like you are expected to know it all when you start a job. Ask questions, seek mentorship, take efforts to learn what you don’t know but need to know to do your job, and take feedback constructively. It can take some time but don’t forget to acknowledge the little wins and progress you make along the way. All the best and happy reading!
Article originally posted here. Reposted with permission.
About the Author
Jyotika Singh is a researcher, mentor, author, Python programmer, and Data Science practitioner. She currently works as the Director of Data Science at Placemakr where she leads data intelligence and algorithmic development functions for optimizing operations and revenue. Previously, Jyotika was heading the Data Science team at ICX Media (acquired by Salient Global) and developed novel patented solutions in Machine Learning and Artificial Intelligence that led to the business foundation. Jyotika has been working on Natural Language Processing and Social Media data for 8 years. She is a public speaker and has spoken at over 15+ conferences in Python and Data Science. Jyotika has been recognized with several awards in the technology and data space. LinkedIn | Twitter