How to make a racist AI without really trying How to make a racist AI without really trying
A cautionary tutorial. Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive... How to make a racist AI without really trying

A cautionary tutorial.

Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive that Microsoft had to shut it down and never speak of it again. And you assumed that you would never make such a thing, because you’re not doing anything weird like letting random jerks on Twitter re-train your AI on the fly.

My purpose with this tutorial is to show that you can follow an extremely typical NLP pipeline, using popular data and popular techniques, and end up with a racist classifier that should never be deployed.

There are ways to fix it. Making a non-racist classifier is only a little bit harder than making a racist classifier. The fixed version can even be more accurate at evaluations. But to get there, you have to know about the problem, and you have to be willing to not just use the first thing that works.

This tutorial is a Jupyter Python notebook was originally hosted on GitHub Gist.

Let’s make a sentiment classifier!

Sentiment analysis is a very frequently-implemented task in NLP, and it’s no surprise. Recognizing whether people are expressing positive or negative opinions about things has obvious business applications. It’s used in social media monitoring, customer feedback, and even automatic stock trading (leading to bots that buy Berkshire Hathaway when Anne Hathaway gets a good movie review).

It’s simplistic, sometimes too simplistic, but it’s one of the easiest ways to get measurable results from NLP. In a few steps, you can put text in one end and get positive and negative scores out the other, and you never have to figure out what you should do with a parse tree or a graph of entities or any difficult representation like that.

So that’s what we’re going to do here, following the path of least resistance at every step, obtaining a classifier that should look very familiar to anyone involved in current NLP. For example, you can find this model described in the Deep Averaging Networks paper (Iyyer et al., 2015). This model is not the point of that paper, so don’t take this as an attack on their results; it was there as an example of a well-known way to use word vectors.

Here’s the outline of what we’re going to do:

  • Acquire some typical word embeddings to represent the meanings of words
  • Acquire training and test data, with gold-standard examples of positive and negative words
  • Train a classifier, using gradient descent, to recognize other positive and negative words based on their word embeddings
  • Compute sentiment scores for sentences of text using this classifier
  • Behold the monstrosity that we have created

And at that point we will have shown “how to make a racist AI without really trying”. Of course that would be a terrible place to leave it, so afterward, we’re going to:

  • Measure the problem statistically, so we can recognize if we’re solving it
  • Improve the data to obtain a semantic model that’s more accurate and less racist

Software dependencies

This tutorial is written in Python, and relies on a typical Python machine-learning stack: numpy and scipy for numerical computing, pandas for managing our data, and scikit-learn for machine learning. Later on we’ll graph some things with matplotlib and seaborn.

You could also replace scikit-learn with TensorFlow or Keras or something like that, as they can also train classifiers using gradient descent. But there’s no need for the deep-learning abstractions they provide, as it only takes a single layer of machine learning to solve this problem.

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import seaborn
import re
import statsmodels.formula.api

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
In [2]:
# Configure how graphs will show up in this notebook
%matplotlib inline
seaborn.set_context('notebook', rc={'figure.figsize': (10, 6)}, font_scale=1.5)

Step 1: Word embeddings

Word embeddings are frequently used to represent words as inputs to machine learning. The words become vectors in a multi-dimensional space, where nearby vectors represent similar meanings. With word embeddings, you can compare words by (roughly) what they mean, not just exact string matches.

Successfully training word vectors requires starting from hundreds of gigabytes of input text. Fortunately, various machine-learning groups have already done this and provided pre-trained word embeddings that we can download.

Two very well-known datasets of pre-trained English word embeddings are word2vec, pretrained on Google News data, and GloVe, pretrained on the Common Crawl of web pages. We would get similar results for either one, but here we’ll use GloVe because its source of data is more transparent.

GloVe comes in three sizes: 6B, 42B, and 840B. The 840B size is powerful, but requires significant post-processing to use it in a way that’s an improvement over 42B. The 42B version is pretty good and is also neatly trimmed to a vocabulary of 1 million words. Because we’re following the path of least resistance, we’ll just use the 42B version.

Why does it matter that the word embeddings are “well-known”?

I’m glad you asked, hypothetical questioner! We’re trying to do something extremely typical at each step, and for some reason, comparison-shopping for better word embeddings isn’t typical yet. Read on, and I hope you’ll come out of this tutorial with the desire to use modern, high-quality word embeddings, especially those that are aware of algorithmic bias and try to mitigate it. But that’s getting ahead of things.

We download glove.42B.300d.zip from the GloVe web page, and extract it into data/glove.42B.300d.txt. Next we define a function to read the simple format of its word vectors.

In [3]:
def load_embeddings(filename):
    Load a DataFrame from the generalized text format used by word2vec, GloVe,
    fastText, and ConceptNet Numberbatch. The main point where they differ is
    whether there is an initial line with the dimensions of the matrix.
    labels = []
    rows = []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
            values = np.array([float(x) for x in items[1:]], 'f')
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

embeddings = load_embeddings('data/glove.42B.300d.txt')
(1917494, 300)

Step 2: A gold-standard sentiment lexicon

We need some input about which words are positive and which words are negative. There are many sentiment lexicons you could use, but we’re going to go with a very straightforward lexicon (Hu and Liu, 2004), the same one used by the Deep Averaging Networks paper.

We download the lexicon from Bing Liu’s web site (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon) and extract it into data/positive-words.txt and data/negative-words.txt.

Next we define how to read these files, and read them in as the pos_words and neg_words variables:

In [4]:
def load_lexicon(filename):
    Load a file from Bing Liu's sentiment lexicon
    (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), containing
    English words in Latin-1 encoding.
    One file contains a list of positive words, and the other contains
    a list of negative words. The files contain comment lines starting
    with ';' and blank lines, which should be skipped.
    lexicon = []
    with open(filename, encoding='latin-1') as infile:
        for line in infile:
            line = line.rstrip()
            if line and not line.startswith(';'):
    return lexicon

pos_words = load_lexicon('data/positive-words.txt')
neg_words = load_lexicon('data/negative-words.txt')

Step 3: Train a model to predict word sentiments

Our data points here are the embeddings of these positive and negative words. We use the Pandas .loc[] operation to look up the embeddings of all the words.

Some of these words are not in the GloVe vocabulary, particularly the misspellings such as “fancinating”. Those words end up with rows full of NaN to indicate their missing embeddings, so we use .dropna() to remove them.

In [5]:
pos_vectors = embeddings.loc[pos_words].dropna()
neg_vectors = embeddings.loc[neg_words].dropna()

Now we make arrays of the desired inputs and outputs. The inputs are the embeddings, and the outputs are 1 for positive words and -1 for negative words. We also make sure to keep track of the words they’re labeled with, so we can interpret the results.

In [6]:
vectors = pd.concat([pos_vectors, neg_vectors])
targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
labels = list(pos_vectors.index) + list(neg_vectors.index)

Hold on. Some words are neither positive nor negative, they’re neutral. Shouldn’t there be a third class for neutral words?

I think that having examples of neutral words would be quite beneficial, especially because the problems we’re going to see come from assigning sentiment to words that shouldn’t have sentiment. If we could reliably identify when words should be neutral, it would be worth the slight extra complexity of a 3-class classifier. It requires finding a source of examples of neutral words, because Liu’s data only lists positive and negative words.

So I tried a version of this notebook where I put in 800 examples of neutral words, and put a strong weight on predicting words to be neutral. But the end results were not much different from what you’re about to see.

How is this list drawing the line between positive and negative anyway? Doesn’t that depend on context?

Good question. Domain-general sentiment analysis isn’t as straightforward as it sounds. The decision boundary we’re trying to find is fairly arbitrary in places. In this list, “audacious” is marked as “bad” while “ambitious” is “good”. “Comical” is bad, “humorous” is good. “Refund” is good, even though it’s typically in bad situations that you have to request one or pay one.

I think everyone knows that sentiment requires context, but when implementing an easy approach to sentiment analysis, you just have to kind of hope that you can ignore context and the sentiments will average out to the right trend.

Using the scikit-learn train_test_split function, we simultaneously separate the input vectors, output values, and labels into training and test data, with 10% of the data used for testing.

In [7]:
train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = 
    train_test_split(vectors, targets, labels, test_size=0.1, random_state=0)

Now we make our classifier, and train it by running the training vectors through it for 100 iterations. We use a logistic function as the loss, so that the resulting classifier can output the probability that a word is positive or negative.

In [8]:
model = SGDClassifier(loss='log', random_state=0, n_iter=100)
model.fit(train_vectors, train_targets)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=100, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=0, shuffle=True, verbose=0,

We evaluate the classifier on the test vectors. It predicts the correct sentiment for sentiment words outside of its training data 95% of the time. Not bad.

In [9]:
accuracy_score(model.predict(test_vectors), test_targets)

Let’s define a function that we can use to see the sentiment that this classifier predicts for particular words, then use it to see some examples of its predictions on the test data.

In [10]:
def vecs_to_sentiment(vecs):
    # predict_log_proba gives the log probability for each class
    predictions = model.predict_log_proba(vecs)

    # To see an overall positive vs. negative classification in one number,
    # we take the log probability of positive sentiment minus the log
    # probability of negative sentiment.
    return predictions[:, 1] - predictions[:, 0]

def words_to_sentiment(words):
    vecs = embeddings.loc[words].dropna()
    log_odds = vecs_to_sentiment(vecs)
    return pd.DataFrame({'sentiment': log_odds}, index=vecs.index)

# Show 20 examples from the test set
fidget -9.931679
interrupt -9.634706
staunchly 1.466919
imaginary -2.989215
taxing 0.468522
world-famous 6.908561
low-cost 9.237223
disapointment -8.737182
totalitarian -10.851580
bellicose -8.328674
freezes -8.456981
sin -7.839670
fragile -4.018289
fooled -4.309344
undecided -2.816172
handily 2.339609
demonizes -2.102152
easygoing 8.747150
unpopular -7.887475
commiserate 1.790899

More than the accuracy number, this convinces us that the classifier is working. We can see that the classifier has learned to generalize sentiment to words outside of its training data.

Step 4: Get a sentiment score for text

There are many ways to combine sentiments for word vectors into an overall sentiment score. Again, because we’re following the path of least resistance, we’re just going to average them.

Rob Speer

Rob Speer

Rob Speer is an alumnus of the MIT Media Lab, and now brings his expertise in natural language processing as the Co-Founder and Chief Science Officer at Luminoso.