Ten years ago, in November 2011, University of Virginia Psychology Professor Brian Nosek started a crowdsourced collaboration of scientists now known as the “Reproducibility Project.” This project took as its first aim the attempt to reproduce the results of 100 papers published in 2008 in top psychology journals. The results, published 4 years later in Science, showed that 64% of the papers – almost two thirds! – contained non-reproducible claims. Attempts to replicate their experiments did not yield statistically significant results. Follow-up work has since shown that the problem is just as prevalent in physics, chemistry, biology, medicine, and other sciences, including data science.
Statistical modeling will always be an error-prone endeavor. Mistakes are easy to make and hard to detect. For a data scientist working in a commercial setting, such mistakes have costs for the company’s bottom line: expenses rise beyond the predicted, revenue drops, and other metrics of interest, too, can shift unfavorably. But if two-thirds of results published in top journals are non-reproducible, with these journals adhering to the highest standards of peer review and this research being conducted in optimal academic settings, away from the pressures and constraints of commercial life – if these results are non-reproducible, what does that bode for industrial data science?
Here’s a clue. In 2005, Stanford professor of medicine John Ioannidis published “Why Most Published Research Findings Are False,” a theoretical analysis that predicted Nosek’s results a full six years before the Reproducibility Project was ever founded. According to Ioannidis, there are several factors that lead research to be especially prone to erroneous conclusions.
One of them – the use of small study groups, which may explain some of the findings in the psychology publications – is thankfully absent in the world of Big Data. But that’s no reason for data scientists to rejoice: all of Ioannidis’s other factors are at the heart of data science work. These include
- small effect sizes,
- large numbers of tested relationships,
- flexibility in designs, definitions, outcomes, and analytical modes,
- financial interests,
- prejudices among stakeholders,
- being a hot field,
- solo and siloed investigators,
- no need to pre-register tested hypotheses and ability to cherry-pick the best hypothesis after results are known,
- no result replication,
- no data sharing, and
- the only statistical requirement for success being the classical 95% confidence level.
The final criterion on this list is particularly damning, because even though elsewhere in science this is considered the most basic requirement in order to show the validity of one’s results, in data science, where most machine learning algorithms don’t naturally lend themselves to the calculation of confidence levels, data scientists have long stopped considering even this lowest bar as any form of a requirement.
I have been a data scientist for almost 30 years, many of which in chief data scientist level positions where I was able to observe the work of many data scientists across a large number of organizations, and the reality of our profession is even grimmer than what Ioannidis’s results would suggest. A large fraction of the companies I have seen employing data scientists do not think of data science as a science at all. It is considered an engineering discipline, and data scientists in those companies are considered combinations of software developers, dashboard creators, and data pipeline engineers.
Elsewhere, where data scientists are actually allowed to perform the occasional data research, the situation is generally no better. Working in isolation and under constant pressure to declare success and move on to the next project, data scientists – many of whom find themselves in this predicament fresh out of academic studies – do not follow the scientific process. A large portion doesn’t even know or understand it.
Ten years ago, I started running weekly “review” sessions for data scientists who wanted them, and three years ago – seeing firsthand the severity of the situation and the price we all pay when data science abandons the scientific method – I founded Otzma Analytics, and made these reviews my main occupation.
These review sessions, or “analytics audits” as I sometimes call them, are the closest to academia’s peer review process a working data scientist can hope for. And after a decade of doing them regularly, I can attest – without this conclusion being based on a small sample – that almost without exception these reviews bring to light serious issues that require major revisions to the analysis.
It’s not rocket science. The review simply asks the basic questions that all of us should ask during the course of all our research activities. If you use data, your first question should be “What is wrong with this data?”; if you reach a conclusion, you should ask “What is wrong with my results?”
As an example of how dire the situation is, how sparsely we, as a community, ask these questions: during the present Covid crisis many hundreds of AI tools were developed and deployed into hospitals, which were meant to help triage patients and speed up diagnoses, and yet study after study shows that none of them work and using them can endanger patients. In every case, the underlying reason is that the basic questions were never asked, not about the fitness of the data and not about the validity of the results.
The problem is apparent even among the silicon giants that everyone seems to be looking up to when it comes to data use. If more researchers at Google knew to ask “Which populations are over-represented in my data? Which are under-represented?”, perhaps they wouldn’t have had to face the backlash that came in 2015, when Google’s photo labeling service was found to label African Americans as “gorillas”. Nor would their “fix”, as of three years down the road, at the end of 2018, had been to simply remove the label “gorilla” from the service’s lexicon. In fact, according to a recent NIST study, unless your face recognition system was developed in Asia, error rates for recognizing Asian, Black, and Native American populations are going to be 10 to 100 times higher than the corresponding figures for Caucasians.
The root of the issue isn’t in how clever or talented or knowledgeable the data scientists involved are. The root problem is that we have all been boxed into a “success mindset”. We are measured by how quickly we retrieve insights and how well those insights conform to what project sponsors want to hear.
That is not a scientific mindset. As I demonstrate in my video series “How to Lose Money on Analytics”, it’s a mindset that leads us astray by making us follow buzzwords and rush blindly to new technologies. “Science,” by contrast, to quote Raising Heretics by Linda McIver, “is about doing everything to prove your theory wrong”.
It is high time we acknowledged and addressed this crisis in Data Science, and, to me, the first step is the reinstatement of the most basic of all scientific tools, the one common to all academia: the peer review.
In my upcoming talk at ODSC APAC 2021, I will tell the story of some real-world analytics reviews I’ve conducted over the years and invite the audience to discover on their own where problems permeated the analysis. I will discuss successful techniques in conducting such reviews, and will show the power of these “analytics audits” to pick up on problematic analysis and fix wrong conclusions before they become costly mistakes. As I will demonstrate, none of this is magic. These are tools that anyone can pick up – and that everyone should use – both to monitor their own work and to help peers with, by reviewing theirs. I look forward to seeing you at the talk!
About the author/ODSC APAC 2021 Speaker on Bad Data Science:
Dr. Michael Brand is the Head and Founder of Data Science consultancy Otzma Analytics. He served as Chief Data Scientist at Telstra, as Senior Principal Data Scientist at Pivotal, as Chief Scientist at Verint Systems, and as CTO Group Algorithm Leader at PrimeSense (where he worked on developing the Xbox Kinect). He is also an Adjunct Associate Professor for Data Science and AI at Monash University and served as Director of the Monash Centre for Data Science.