Junior data scientists are flooding the field as more and more people are transitioning from other areas, some very loosely related to data-driven professions. As a result, there often is a disconnect with the skillsets these “newbies” bring to the table. After all, there is only so much that can be gleaned from the many online programs advertised to get new data scientists acclimated enough to function adequately in their new roles. The reality is that junior data scientists don’t always have the experience level required to address all the nuances of real-life projects.
In this article, I’ve put together a short list of common mistakes made by data scientists who are new to the field. I don’t offer this list as any form of derision, but rather as a learning tool to use for pitfalls to watch out for and maybe take extra effort to avoid. Let’s dig in and see some cautionary items in no particular order.
- Not clearly identifying the problem to solve. The first step in the data science process is to have a clear goal in mind and understanding what steps you’ll need to take to achieve that goal. Beware of looking for a solution before fully understanding the problem. Avoid communication gaps with other stakeholders.
- Confusing theory with reality. It’s often possible to find a statistically significant correlation between two variables, but there is not always action that can be taken from this knowledge. Just because something has statistical significance doesn’t necessarily mean it has practical significance.
- Failing to understand the difference between “what data says” and “what data means.”
- Avoid sampling bias, because having incorrect data can very easily lead to incorrect conclusions and actions that will negatively affect the outcome of a data science project.
- Don’t spend a lot of time and effort on sophisticated approaches which are only slightly better, or even worse, than simple approaches; or not even checking whether the simple ones might work for the problem at hand.
- Failing to cleanse the data and doing exploratory data analysis (EDA) before clean data is available. Working with dirty or inconsistent data will affect model selection and beyond.
- Do not commit what various statisticians have called a “Type III error” in statistical testing or “solving the wrong problem.”
- Don’t assume that extrapolation for prediction is equally as good as interpolation on the best-fitted model.
- Avoid performing the wrong analysis for the wrong type of variable. Continuous variables are ideal for doing the most analysis, but many inexperienced data scientists try to do complex analysis on data that is only categorical.
- Avoid an over-dependence on p-values – believing that p-values are the panacea for every kind of relationship strength.
- Forgetting that causation is not the same as correlation.
- Failing to understand that scaling will change the distribution of the data.
- Not checking the impact of confounding and interaction variables.
- Inappropriately or inadequately applying the different metrics to evaluate machine learning algorithms, i.e. RMSE, MSE, R-squared, AUROC, confusion matrix, log loss, etc.
- Stop avoiding the mathematical foundations of machine learning. Know how algorithms work under the hood makes you a much better data scientist.
Of course, there are many more cautionary tales along the path toward becoming a highly qualified data scientist. As you gain more experience, you should keep a journal of the tips, tricks, and traps you’ve learned. Avoiding repeating the same mistakes over and over again is probably the most beneficial article of experience you can attain.
Want more data science career tips? Stay tuned to our career insights section for more advice!