There is a Golden Rule in life. It’s a maxim that appears in various forms around the world:
One should never do that to another which one regards as injurious to one’s own self.
As a data scientist, I find this principle of reciprocity very appealing!
Treat others’ data as you would have others treat your data.
The recent spurt in fake news and incidents of profiling and targeting of swing voters during elections has brought a spotlight on data privacy. Here, we look at the following three facets of data ethics:
- Failures & Checklists
- Diversity & Empowerment
- The 5Cs of Data Ethics
Failures & Checklists
Catastrophic failures often occur because practitioners believe they are doing no harm.
Until the late 1800s, doctors didn’t believe they should clean their hands before surgery. This led to countless deaths due to infections. Even after scientists proved it to be true, it took many decades for doctors to accept the practice as part of their standard operating procedures.
A data scientist (or any tech entrepreneur) doesn’t want to do any harm. We sincerely believe that the (data) products that we are building will improve the lives of our users. However, failures do occur.
So how can a data scientist avoid these scenarios? One way is by using a checklist.
A checklist ensures that data is used ethically. Data scientists can review the process (using a checklist) 1) during the conceptualization of the project, 2) during project execution and finally 3) once the project is completed.
A sample checklist is given below:
- What kind of user consent is required?
- Have we explained clearly what users are consenting to?
- Have we tested for disparate error rates among different user groups?
- Do we have a plan to protect and secure user data?
- Have we tested our training data to ensure it is fair and representative?
- Does the team reflect diversity of opinions, backgrounds, and kind of thought?
- Does the algorithm look at the correct artifacts/features before making a prediction?
This list isn’t exhaustive, but evolving in nature. It forces us to ask difficult questions while we plan our projects.
To ensure that algorithms actually do what they’re supposed to do, data science teams can use tools like SHAP & LIME. These tools help us identify features that are used by the machine learning algorithm to make a prediction.
This saves us from an embarrassing scenario in which we use an incorrectly trained algorithm. An infamous example is of an algorithm that used snow in the background of an image to predict the existence of a wolf in the image.
Source: Why should I trust you?
Diversity & Empowerment
New-age companies have the maxim of moving fast and breaking things: Fail fast and cheap!
We want to build minimum viable products without understanding the consequences. Young engineers push their builds into production, but if they have reservations about the product, we do not empower them to roll-back products or features from the market.
In the book Ethics and Data Science the authors talk about creating a safe space for dissent in data science teams. A single team member may object to an approach, but if there is no support for ethical thinking within an organization they can easily be sidelined. Hence it’s important from an organizational perspective to empower the youngest member of a data science team.
Another method to avoid blind spots is by building diversity in a team. Developers from different backgrounds and with different expertise add significant value to a team’s productivity.
I would highly suggest watching this talk by Joy Buolamwini. She was a graduate student when she realized that face detection algorithms couldn’t recognize her face. It was as if, for the algorithm, she didn’t exist!
She discovered the algorithm couldn’t recognize her face because the training dataset didn’t have any samples with a darker skin tone. This was not due to any deliberate racist behavior, but simply because none of the algorithm developers realized their training data was incomplete — all of them were Caucasian.
This can have major consequences and underscores the fact that diversity and empowerment in data science teams ensure that blind spots are covered. This also allows us to have meaningful dialog within a team.
The 5 Cs of Data Ethics
To ensure that there is a mechanism to foster a dialog, the following guidelines have been suggested for building data products:
- Control (and Transparency)
- Consequences (and Harm)
Consent doesn’t mean anything unless the user has clarity on the terms and conditions of the contract. Usually, contracts are a series of negotiations. But in our online transactions, it’s always a binary condition: The user either accepts the terms or rejects them. Developers of data products should not only ensure that they take consent from the user — the users should also have clarity on 1) what data they are providing, 2) how their data will be used, and 3) what the downstream consequences of using the data could be.
Remember, “I have read and agreed to the terms and conditions” is one of the biggest lies on the web. Terms of service agreements are often too long and difficult to understand for a layperson. Hence it’s important that users know what they are consenting to, and that developers communicate it to them in the simplest terms.
Consistency is important to gain users’ trust. Often people who have the best intentions can interpret the terms of engagement in strange and unpredictable ways. Companies should present their controls so if a user changes their mind, they can simply delete the data.
Google recently gave users more control over their search history data. Users can now review and delete their search activity within Google Search and disable ad personalisation.
Companies should also make users aware of the consequences of sharing their data. A prime example: Essentially all users know their Twitter feeds are publicly available, however few users know researchers and profiling firms can use those tweets. This may have unintended consequences.
The “Unknown Unknowns”
We often hear project managers talking about the “Unknown Unknowns” — These are the unforeseen consequences, the risks that we cannot eliminate. However, all too often these risks are unknown because we don’t want to know them. When machine learning models are trained on biased data there is a danger that they can institutionalize discriminatory behavior. A good example is Amazon’s recruitment algorithm, which discriminated against women.
Similarly users have stopped trusting news agencies and consumer internet companies because they feel abused. 
Data science is an evolving field. Researchers are building it on ideas that were developed in the last few decades. Humans built buildings and bridges before their principles were codified. They are now building large-scale prediction systems that involve society. Early buildings and bridges collapsed in unforeseen ways, and in a similar manner these predictive systems will fail and expose serious conceptual flaws.
And it’s good to fail! Failure gives us the incentive to build more robust systems.
- Ethics and Data Science,by Mike Loukides, Hilary Mason & DJ Patil — This book should be required reading for anyone who is serious about data science
- Amazon created a hiring tool using AI. It immediately started discriminating women — By Jordan Weissmann
- Facebook-Cambridge Analytica Scandal — Wiki entry
- Facebook Cambridge Analytica: A timeline of the data hijacking scandal — CNBC
- Artificial intelligence the revolution hasn’t happened yet — @mijordan3
- The Dark secret at the heart of AI — Will Knight at MIT Review
Disclaimer: These are my own personal views and do not represent the views/strategies of my employer, Edelweiss.