Game of Thrones, touchscreen devices, and superhero movies: those things have little in common apart from the fact that they went through a meteoric rise in the 2010s. Still, if you believe what AI experts have to say, they pale in comparison to the largest tech phenomenon of the decade: Big Data.
The ‘old’ generation of data scientists still remember the days they were telling their manager (usually, someone with no expertise in machine learning) that gathering data would take months. And, honestly, it really did. For example, in the early days of the digital era, continuously sharing data on social media was still uncommon behavior, and it would take months for companies to collect what can be collected in just days (if not hours) today. It’s 2020 now, and boy, things have changed! Data is everywhere. It’s falling into our laps. And there’s so much of it that many of us don’t even know what to do with it any more. It took less than a decade to go from a world where just gathering data was a chore to one where nearly every data type is hoarded and available for analysis. What was once an onerous, time-consuming process is now right there, at our fingertips.
Machine learning scientists have been waiting for a moment exactly like this. With this massive amount of at-the-ready data, we can finally test some of the most advanced machine learning theories instead of, well, just theorizing. Big data paved the way for tangible progress in ML in the real world, not just with the clean, academic data sets that simply don’t exist in most modern organizations.
Datageddon is Coming (No Joke!)
Now, if you are a fan of the Silicon Valley TV series, you might remember Gavin Belson yelling at his Marketing team that Datageddon was coming, announcing dark times for Tech companies. The thing is, he might be right. And I know: not what you would expect to hear from a data scientist (that is, not the majority of them), but there’s truth here. Let me explain.
There is no question that we are at a tipping point in the history of AI. In fact, there have never been this many everyday applications making use of machine learning. Let’s face it: no matter how annoyed or even creeped out we might feel when we stop and think about what really happens with our data behind the scenes, none of our beloved search engines, our social media feeds, or our favorite picture-altering apps could exist without the rise of Big Data.
But Big Data isn’t all good. As we create more and more data, we’re starting to get to a point where we need to ask ourselves: is it too much? How much of our data is redundant? How much is just harmful? And how can we tell?
Wait a minute: what about Moore’s law?
According to Moore’s law, we should expect the speed and capability of our computers to increase exponentially, and hence, hardware should continuously catch up with the ever-increasing demands of data scientists.
However, it’s unclear how quickly such an increase will occur. With the amount of data almost doubling every year, is it really that hard to imagine a world where Big Data would overwhelm hardware capabilities? Especially with the rise of video and other compute-intensive data types?
Now, think for a second about what could happen if the total amount of data worldwide eventually came to exceed that of data storage and/or compute power capabilities. Even if we were to ignore the environmental consequences, there are a lot of us that can tell a story of a company that kept seeing their data scientists fight over GPU time even after investing thousands in additional GPUs.
And that’s, of course, only part of the story: we haven’t even touched that massive amount of time and money spent on data preparation itself yet. In some sense, Datageddon is already upon us. In many ways, Big Data has made the lives of data scientists more complicated, and many find ourselves sentimental about the early days when building a model could be done on a laptop and wouldn’t require teaching oneself sophisticated DevOps tools (as if machine learning itself wasn’t already hard enough). It’s exciting having all that data but it’s increasingly hard to handle (let alone understand) it.
The Big Data Prep Crisis
When you think about it, getting excited about the explosion of the amount of data worldwide is a little bit like getting excited over free internet in a country where access to power is sparse: what is the point of having more of something that you just can’t use? Worse: what if that additional something were to cost you more money and more time?
While technologists are (rightfully) excited by the loads of new data made available to them, they tend to forget that data can’t be used raw for supervised machine learning but instead needs to be cleaned, transformed, and oftentimes labeled. And while data centers and labeling companies are rubbing their hands with glee at the prospect of more money to be made, the rest can start scratching their heads thinking about how to solve a brand new challenge: processing data more quickly, accurately, and ethically than ever before so that it can be used to train quality machine learning models.
This is all to say that, yes, there might be a little too much data to understand and utilize. That’s something most data scientists would’ve laughed at in 2010, but we really might be there now. Processing, storage, and labeling are big bottom-line costs and, when you couple that with the fact that data processing comes at a heavy cost on the environment, the idea of hoarding and analyzing everything seems less like a given.
Luckily, there is an important aspect to Big Data that many have missed: while more data is better, not all data is created equal. So while increasing the size of a training set truly does help improve the accuracy of the model trained with it, there is actually a significant fraction of that data that is either useless (like redundancies), or even harmful (like a mislabeled data point) to the model. And that’s a real problem.
Data vs. Information
Collecting more training data has traditionally been the number one response of data scientists whenever their models were underperforming. To use machine learning jargon, more data is necessary to obtain better generalization power; simply put, a model needs to be exposed to more examples to confidently predict a new, unseen example. This is especially true of odd, edge-case examples.
Yet the ‘value’ of that additional data depends on three main factors: is it actually something ‘new’ at all, how much additional information is in there, and is that information accurate and/or trustworthy? The issue is that data scientists currently don’t have an easy way to determine those factors, so they just rely on ‘stacking up’ on data, so that at least some of it qualifies as valuable. That seems fair enough until the day the sheer volume is just too much, or until some data is corrupted or plain wrong. Fundamentally, we have been making a terrible mistake for years: we used the words ‘data’ and ‘information’ interchangeably when each of those two point to entirely different concepts.
Data vs. information: each one of the images above represents one example of a training set, yet there is more formation for an Autonomous-Driving Machine Learning algorithm to learn from on the left compared to the image on the right.
Data Quality vs. Informational Value
Which brings us to an important point: with ever more data to deal with, it is not sufficient anymore for data scientists to purge their datasets of low-quality data; they also have to pay attention to data redundancy and duplicates. To some extent, they have known that all along whenever they have used stratified sampling to solve their unbalanced class problems (meaning: when their data contained too many representations of the same class). And while data redundancy seems incidental at first, its impact goes way beyond the loss of some precious time and compute power: it might actually result in biased models and incorrect results.
Luckily, the industry has long discussed the impact of data quality. That said, a major shortcoming in the concept of data quality itself is that its quality is, again, pertaining to data and not information.
A low-resolution picture might still be a valuable training example if it contains the right information. Or the picture of Mount Everest might still be a mine of information to build a landmark-detection algorithm even if it would be totally useless in the context of facial recognition. Bottom line: that processing data for value isn’t a one-off task that could easily be out-sourced to a third-party company. And that’s why it is time to shift the conversation from data quality to data value, the challenge of all being that value is relative to the application that the data will be used for.
To many, the solution to our Big Data predicament seems straightforward, so much so that it has become second nature to data scientists: random sampling.
Whether it is for prototyping a model or engineering a new feature, we all have resorted to random sampling. But after years of longing for more data, random sampling almost feels like going backwards. And then there is random sampling’s dangerous cousin: manually curating the data. Not a bad idea, right? After all, it’s only natural for machine learning professionals to think they have the insight to understand exactly what their models need.
The thing is, it isn’t. Letting a human decide what data to feed a machine learning model comes with two major issues. First, this is the best way to inject one’s underlying cognitive bias, such as confirmation bias, in a dataset, and then eventually in the model built with it. The second one, maybe even sneakier, comes from this:
As a human being tasked with hand-picking data to build an object detection algorithm, we would most likely pass on the previous image, feeling that this picture does not represent a good example of what a person is supposed to be: that’s because we’re inclined to cherry-pick. Yet, for a machine learning model, that picture might actually make up a very information-rich, relevant record. Imagine now doing the same thing for each record showing an occluded and truncated person: a biased dataset is born!
It all comes from the fact that humans do not fully understand how models learn and tend to artificially ‘guide’ the model, making assumptions modeled on the way that we function. It’s because they’re conscious of this problem that ML experts always work hard to enforce a certain data variety; in fact, they go as far as to use what they call “data augmentation”, which in layman terms is actually a process of data distortion. Regardless, any human-generated rules to curate a dataset would be subject to the same biases associated with manually curating the data.
Saved by ML?
So what do we know? We know that modern machine learning applications don’t suffer from a lack of data. In fact, they often suffer from the opposite: a glut of data, much of which is either redundant or in other ways not-particularly-useful or actively harmful (think mislabeled training data that can confuse a model). We know that labeling and relabeling that massive universe of data comes with its own pitfalls––accidental bias and cost being just two. Further, we know that asking machine learning scientists to hand-curate the data a model will learn come has similar issues. People don’t inherently know which data will be most helpful and, frankly, your ML team’s time can be spent much more efficiently than having them comb through data row by row.
In other words, the answer to this crisis isn’t more and more data. That’s actually the cause of the crisis. The answer is smarter data curation.
Now, how exactly can we do that? Because our community hasn’t really grappled with the excesses of Big Data, the market hasn’t yet created a simple, catch-all solution. But if we know the solution boils down to creating new data curation technologies, what approaches are promising? Active learning, for starters, is a solid foundation, but while active learning is fundamentally a process meant to explore datasets and uncover the most useful information, it is also, generally, incremental and does require a sizeable amount of data to learn from in the first place. An ensemble approach, combining the best of active learning, few shot-learning, meta-learning, transfer learning, reinforcement learning, and a few other emerging techniques is a promising direction to start with. No matter the tactics and tools we use, what we need to do is fundamentally understand that we can’t continue to work with ever-larger data sets that are heavier on data than information.
Because like it or not, big data is going to become a big problem. You’re going to need a strategy soon. Because big data isn’t going away. It’s getting bigger. Thinking less about quantity and more about quality is the smartest thing you can do going forward.
About the author: Jennifer Prendki is the founder and CEO of Alectio, the first startup focused on Data Optimization and is a mission to help ML teams build models with less data. Prior to Alectio, Jennifer was the VP of Machine Learning at Figure Eight; she also led ML and Data Science at Atlassian and on the Search team at Walmart Labs. She is an accomplished speaker who enjoys addressing both technical and non-technical audiences.