Existing facial recognition software is often trained and tested on non-diverse datasets.
As these examples of software are implemented more into daily life in the United States, this racial bias is negatively impacting people of color. Stephanie Kim, a software engineer at Algorithmia, spoke about the causes and impacts of racial model overfitting at ODSC East 2018. She pinpointed a few steps researchers can take to prevent racial bias from permeating their models, too.
How facial recognition adopted its racial bias
Oftentimes, researchers who create facial recognition models only have access to open source collections of images — it can be time-consuming and costly to create their own. But open source collections are often limited in diversity, so researchers are limited on the diverse datasets to use to train and test their models.
“When the distribution of your training data is dissimilar to your testing or real-world or testing data distribution, then your model won’t give you as accurate results as when training on a more diverse dataset that better represents real-world conditions,” Kim said during the ODSC lecture.
One popular open source dataset Labeled Faces in the Wild was estimated to be 77.5 percent male and 83.5 percent white, Kim said, citing a 2014 study on diversity. OpenFace, a widely adopted open-source facial recognition library, uses this dataset in two different parts of its development pipeline before other users add any of their own elements.
Lack of training data diversity has meant facial recognition model overfitting and built-in racial biases.
Caption: OpenSource uses the dlib C++ library for face detection, and that library used the Labeled Faces in the Wild dataset. OpenSource also uses Labeled Faces in the Wild for its own testing.
The real-world impact
In 2015, Google Photos classified black people as gorillas. Their initial fix was to remove certain labels such as the term “gorillas” rather than fix the model itself.
This instance of racial bias led to public uproar. Now, models’ bias can mean non-offending black citizens will be incorrectly classified and possibly arrested.
Police officers increasingly use facial recognition for surveillance and identification to apprehend those with warrants out for their arrest. But police facial recognition systems don’t work as well on black people’s faces, according to an FBI co-authored study cited in The Perpetual Line-Up out of the Georgetown Law Center of Privacy and Technology.
Police may use public surveillance cameras or smartphone software loaded with mugshot data to identify people. In Florida, the police department also has access to DMV images, so their software can also try to match against non-offending citizens.
This is where model overfitting comes into play: Because black people are arrested more often in the United States, there are more black people’s mugshots in the database the facial recognition software searches. But that software fails more easily on their faces because of a lack of representation in training data, so more black people will be incorrectly classified and arrested, further compounding the issue.
Solutions to minimize or erase racial bias in facial recognition
Use diverse training sets: MIT researcher Joy Buolamwini and Microsoft’s Timnit Gebru developed an approach to evaluating gender and racial bias in facial analysis algorithms. They used the Fitzpatrick skin type classification system rather than classifying by racial and ethnic labels because of the wide variety of skin types within those labels which would not be accounted for.
When they found that even datasets known for being more diverse than Labeled Faces in the Wild were filled with majority light-skinned individuals, they created their own dataset using the Fitzpatrick system to collect images more diverse in gender and skin type. The dataset — called the Pilot Parliaments Benchmark — includes 1,270 individuals and is available for use upon request. Kim recommended researchers use this dataset to test their own models for racial bias.
Caption: The Pilot Parliaments Benchmark was created using the Fitzpatrick skin type classification system and incorporates faces from Parliaments in Europe and Africa. Among the darkest skin tones came from Parliament members in Senegal and Rwanda, while the lightest skin tones came from Parliament members in Iceland and Sweden.
Create their own training sets: Researchers can also minimize bias by extracting their own training sets, but carefully: Kim said researchers must think through their extraction methods to minimize the potential for homogeneous datasets.
For instance, if researchers scraped the top 2,000 celebrity images off IMDb, the dataset will be disproportionately white because of the overrepresentation of white celebrities in Hollywood. By critically evaluating how to collect a training dataset, researchers can minimize the potential to create a racially biased model.
Build diverse teams: Oftentimes research teams and developers test on their own images, so having a diverse team will allow racial bias to be detected early on. Having a diverse team also increases the unique perspectives in the room to combine and have better informed applications and models
Never say a model is “good enough”: Throughout the process of finding data for training and testing, and evaluating the model for accuracy, never settle for “good enough.” The impact on society if it’s not good enough is real and negative.
Caption: Kim concluded her lecture by telling the ODSC audience never to say their facial recognition model is “good enough” and to ensure they use diverse training datasets.
- The racial bias that exists in facial recognition negatively impacts people of color, and could potentially lead to arrests of law-abiding black citizens.
- Racial bias in facial recognition software is largely due to non-diverse training datasets. Researchers can combat this by being intentional about diversity in the training sets they use.
- Existing open source datasets and facial recognition models usually are usually not diverse or are implemented using non-diverse datasets, so evaluate them carefully.