There are three types of lies: lies, damned lies, and ‘big data.’
That’s the message Amazon machine learning director Neil Lawrence began his ODSC Europe 2016 lecture with before laying out the three largest challenges for open data science and our data-centered society.
As Lawrence sees it, those challenges are paradoxes that exist within a data society, quantifying the value of data grunt work and maintaining individuals’ privacy. In many cases, the solution is raising awareness so data analysts and developers can better consider their impact.
Challenge: Paradoxes of the Data Society
The breadth vs. depth paradox makes understanding data difficult. Modern measurement usually deals with either a large breadth of information or a large depth (amount) of individuals that the data is being collected on. Once researchers try to measure both, traditional modeling approaches fail.
The breadth vs. depth paradox is like seeing an entire woods, or the specific details of a single tree. Lawrence said ideally we could come to a solution like this photo, where you can see the details of a few trees, or model and analyze data on a particular group of people/subset of the population.
Researchers are increasingly able to quantify masses of data about people but seem less able to characterize society as a whole. For example, polls collect a lot of information but are inaccurate, evidenced by major discrepancies around Brexit and the 2016 election of President Donald J. Trump.
“We measure more and we seem to understand less,” Lawrence said. “In some ways, it feels like statistics is going backwards.”
Training more classical statisticians and emphasizing traditional statistics could help data scientists and the general public understand data available to them, qualify the conclusions others make and see where things can go wrong.
Challenge: Quantifying the Value of Data
Another large industry hurdle is how to quantify the work that goes into making data usable. Some of the most valuable workers in the data economy are those who go into companies, find useful datasets, clean them and make them available.
But the people who see the most recognition and financial gain are those who take cleaned datasets and develop profitable applications — it’s difficult to funnel that money back to the person who made the data available in the first place. This credit allocation issue permeates all aspects of the economy. In data science, the challenge is to incentivize and quantify the value of data workers, curaters and managers.
Visualizing data is a big step to help people realize data itself it is important, even if they don’t understand it. In addition, the entire industry should create “data readiness levels” to establish how ready a dataset is to be used for analyses.
Challenge: Privacy, Loss of Control, and Marginalization
In his lecture, Lawrence referred to a study in which researchers said they could predict an individual’s behavior better than their friends could be based on models involving their Facebook likes. Soon, he guesses these models will be able to predict people’s behavior more than they can. In doing so, models may place a lot of power into marketers’ hands.
Individuals are becoming very easy to monitor. And while some monitoring algorithms might have good purposes, like tracking hate speech, the same algorithms can be flipped to serve more worrying purposes like tracking political dissent.
What’s more, discrimination and marginalization can be built into algorithms whether or not developers are aware of it: the discrimination can be implicit, and certain groups may be underrepresented in the data that analyses are based on. Technological needs are different for a businessman living in Silicon Valley than a woman living in a village in Uganda, but developers don’t always ensure they seek out solutions for all those needs.
A woman who lives in a village in Uganda has different technological needs than many of the Silicon Valley developers who constantly develop tech, but she still probably has access to a mobile phone or smartphone. Lawrence said to be successful data analysts and software developers must be aware of this and work to fill all needs.
Analysts and developers need to work to ensure individuals retain control over their own data. Privacy-aware machine learning that generates noise corruption within the data is a potential solution. Depending on confidentiality, people could apply varying amounts of noise to their personal data to maintain privacy.
While this makes data weaker because the truth can’t be obtained for any individual, strategies like these could build a society around data that gives people control over their own data rather than storing it centrally.
The most important solution to these data science challenges is to increase awareness of “big data” pitfalls and ensure developers are aware of the wider set of problems that the rest of the world is facing.
- Traditional statistics will help data analysts navigate the pitfalls of the “big data” explosion.
- To be successful, the data science industry must find a way to allocate credit throughout the process of data analysis.
- Allowing individuals to maintain privacy is worth the minor setbacks that strategies like noise corruption cause in order to maintain individual data control and free will.