# On word embeddings – Part 2: Approximating the Softmax

Deep LearningModelingNLP & LLMsposted by Sebastian Ruder April 17, 2017 Sebastian Ruder

### Table of contents:

This is the second post in a series on word embeddings and representation learning. In the previous post, we gave an overview of word embedding models and introduced the classic neural language model by Bengio et al. (2003), the C&W model by Collobert and Weston (2008), and the word2vec model by Mikolov et al. (2013). We observed that mitigating the complexity of computing the final softmax layer has been one of the main challenges in devising better word embedding models, a commonality with machine translation (MT) (Jean et al. [^{10}]) and language modelling (Jozefowicz et al. [^{6}]).

In this post, we will thus focus on giving an overview of various approximations to the softmax layer that have been proposed over the last years, some of which have so far only been employed in the context of language modelling or MT. We will postpone the discussion of additional hyperparameters to the subsequent post.

Let us know partially re-introduce the previous post’s notation both for consistency and to facilitate comparison as well as introduce some new notation: We assume a training corpus containing a sequence of (T) training words (w_1, w_2, w_3, cdots, w_T) that belong to a vocabulary (V) whose size is (|V|). Our models generally consider a context (c) of ( n ) words. We associate every word with an input embedding ( v_w ) (the eponymous word embedding in the Embedding Layer) with (d) dimensions and an output embedding ( v’_w ) (the representation of the word in the weight matrix of the softmax layer). We finally optimize an objective function (J_theta) with regard to our model parameters (theta).

Recall that the softmax calculates the probability of a word (w) given its context (c) and can be computed using the following equation:

(p(w : | : c) = dfrac{text{exp}({h^top v’_w})}{sum_{w_i in V} text{exp}({h^top v’_{w_i}})} )

where (h) is the output vector of the penultimate network layer. Note that we use (c) for the context as mentioned above and drop the index (t) of the target word (w_t) for simplicity. Computing the softmax is expensive as the inner product between (h) and the output embedding of every word (w_i) in the vocabulary (V) needs to be computed as part of the sum in the denominator in order to obtain the normalized probability of the target word (w) given its context (c).

In the following we will discuss different strategies that have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency. Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax.

# Softmax-based Approaches

## Hierarchical Softmax

Hierarchical softmax (H-Softmax) is an approximation inspired by binary trees that was proposed by Morin and Bengio [^{3}]. H-Softmax essentially replaces the flat softmax layer with a hierarchical layer that has the words as leaves, as can be seen in Figure 1.

This allows us to decompose calculating the probability of one word into a sequence of probability calculations, which saves us from having to calculate the expensive normalization over all words. Replacing a softmax layer with H-Softmax can yield speedups for word prediction tasks of at least (50 times) and is thus critical for low-latency tasks such as real-time communication in Google’s new messenger app Allo.

We can think of the regular softmax as a tree of depth (1), with each word in (V) as a leaf node. Computing the softmax probability of one word then requires normalizing over the probabilities of all (|V|) leaves. If we instead structure the softmax as a binary tree, with the words as leaf nodes, then we only need to follow the path to the leaf node of that word, without having to consider any of the other nodes.

Since a balanced binary tree has a depth of (text{log}_2 (:|V|:)), we only need to evaluate at most (text{log}_2 (:|V|:)) nodes to obtain the final probability of a word. Note that this probability is already normalized, as the probabilities of all leaves in a binary tree sum to ( 1 ) and thus form a probability distribution. To informally verify this, we can reason that at a tree’s root node (Node 0) in Figure 1), the probabilities of branching decisions must sum to (1). At each subsequent node, the probability mass is then split among its children, until it eventually ends up at the leaf nodes, i.e. the words. Since no probability is lost along the way and since all words are leaves, the probabilities of all words must necessarily sum to (1) and hence the hierarchical softmax defines a normalized probability distribution over all words in (V).

To get a bit more concrete, as we go through the tree, we have to be able to calculate the probability of taking the left or right branch at every junction. For this reason, we assign a representation to every node. In contrast to the regular softmax, we thus no longer have output embeddings (v’_w) for every word (w) — instead, we have embeddings (v’_n) for every node (n). As we have (|V|-1) nodes and each one possesses a unique representation, the number of parameters of H-Softmax is almost the same as for the regular softmax. We can now calculate the probability of going right (or left) at a given node (n) given the context (c) the following way:

( p(: text{right}: | : n, c) = sigma (h^top v’_{n})).

This is almost the same as the computations in the regular softmax; now instead of computing the dot product between (h) and the output word embedding (v’_{w}), we compute the dot product between (h) and the embedding (v’_{w}) of each node in the tree; additionally, instead of computing a probability distribution over the entire vocabulary words, we output just one probability, the probability of going right at node (n) in this case, with the sigmoid function. Conversely, the probability of turning left is simply ( 1 – p(:text{right}: | : n,c)).

The probability of a word (w) given its context (c) is then simply the product of the probabilities of taking right and left turns respectively that lead to its leaf node. To illustrate this, given the context “the”, “dog”, “and”, “the”, the probability of the word “cat” in Figure 2 can be computed as the product of the probability of turning left at node 1, turning right at node 2, and turning right at node 5. Hugo Lachorelle gives a more detailed account in his excellent lecture video. Rong [^{7}] also does a good job of explaining these concepts and also derives the derivatives of H-Softmax.

Obviously, the structure of the tree is of significance. Intuitively, we should be able to achieve better performance, if we make it easier for the model to learn the binary predictors at every node, e.g. by enabling it to assign similar probabilities to similar paths. Based on this idea, Morin and Bengio use the synsets in WordNet as clusters for the tree. However, they still report inferior performance to the regular softmax. Mnih and Hinton [^{8}] learn the tree structure with a clustering algorithm that recursively partitions the words in two clusters and allows them to achieve the same performance as the regular softmax at a fraction of the computation.

Notably, we are only able to obtain this speed-up during training, when we know the word we want to predict (and consequently its path) in advance. During testing, when we need to find the most likely prediction, we still need to calculate the probability of all words, although narrowing down the choices in advance helps here.

In practice, instead of using “right” and “left” in order to designate nodes, we can index every node with a bit vector that corresponds to the path it takes to reach that node. In Figure 2, if we assume a `0`

bit for turning left and a `1`

bit for turning right, we can thus represent the path to “cat” as `011`

.

Recall that the path length in a balanced binary tree is (text{log}_2 |V|). If we set (|V| = 10000), this amounts to an average path length of about ( 13.3 ). Analogously, we can represent every word by the bit vector of its path that is on average (13.3) bits long. In information theory, this is referred to as an information content of (13.3) bits per word.

### A note on the information content of words

Recall that the information content (I(w)) of a word (w) is the negative logarithm of its probability (p(w)):

(I(w) = -log_2 p(w) ).

The entropy ( H ) of all words in a corpus is then the expectation of the information content of all words in the vocabulary:

( H = sum_{iin V} p(w_i) : I(w_i) ).

We can also conceive of the entropy of a data source as the average number of bits needed to encode it. For a fair coin flip, we need (1) bit per flip, whereas we need (0) bits for a data source that always emits the same symbol. For a balanced binary tree, where we treat every word equally, the word entropy (H) equals the information content (I(w)) of every word (w), as each word has the same probability. The average word entropy (H) in a balanced binary tree with (|V| = 10000) thus coincides with its average path length:

( H = – sum_{iin V} dfrac{1}{10000} log_2 dfrac{1}{10000} = 13.3).

We saw before that the structure of the tree is important. Notably, we can leverage the tree structure not only to gain better performance, but also to speed up computation: If we manage to encode more information into the tree, we can get away with taking shorter paths for less informative words. Morin and Bengio point out that leveraging word probabilities should work even better; as some words are more likely to occur than others, they can be encoded using less information. They note that the word entropy of their corpus (with (|V| = 10,000)) is about ( 9.16 ).

Thus, by taking into account frequencies, we can reduce the average number of bits per word in the corpus from ( 13.3 ) to ( 9.16 ) in this case, which amounts to a speed-up of 31%. A Huffman tree, which is used by Mikolov et al. [^{1}] for their hierarchical softmax, generates such a coding by assigning fewer bits to more common symbols. For instance, “the”, the most common word in the English language, would be assigned the shortest bit code in the tree, the second most frequent word would be assigned the second-shortest bit code, and so on. While we still need the same number of codes to designate all words, when we predict the words in a corpus, short codes appear now a lot more often, and we consequently need fewer bits to represent each word on average.

A coding such as Huffman coding is also known as entropy encoding, as the length of each codeword is approximately proportional to the entropy of each symbol as we have observed. Shannon [^{5}] establishes in his experiments that the lower bound on the information rate in English is between (0.6) to (1.3) bits per character; given an average word length of (4.5), this amounts to (2.7) – (5.85) bits per word.

To tie this back to language modelling (which we already talked about in the previous post): perplexity, the evaluation measure of language modelling, is (2^{H}) where (H) is the entropy. A unigram entropy of ( 9.16 ) thus entails a still very high perplexity of ( 2^{9.16} = 572.0). We can render this value more tangible by observing that a model with a perplexity of (572) is as confused by the data as if it had to choose among (572) possibilities for each word uniformly and independently.

To put this into context: The state-of-the-art language model by Jozefowicz et al. (2016) achieves a perplexity of (24.2) per word on the 1B Word Benchmark. Such a model would thus require an average of around (4.60) bits to encode each word, as ( 2^{4.60} = 24.2 ), which is incredibly close to the experimental lower bounds documented by Shannon. If and how we could use such a model to construct a better hierarchical softmax layer is still left to be explored.

## Differentiated Softmax

Chen et al. [^{9}] introduce a variation on the traditional softmax layer, the Differentiated Softmax (D-Softmax). D-Softmax is based on the intuition that not all words require the same number of parameters: Many occurrences of frequent words allow us to fit many parameters to them, while extremely rare words might only allow to fit a few.

In order to do this, instead of the dense matrix of the regular softmax layer of size (d : times : |V| ) containing the output word embeddings ( v’_w in mathbb{R}^d ), they use a sparse matrix. They then arrange ( v’_w) in blocks sorted by frequency, with the embeddings in each block being of a certain dimensionality (d_k ). The number of blocks and their embedding sizes are hyperparameters that can be tuned.

In Figure 3, embeddings in partition (A) are of dimensionality (d_A) (these are embeddings of frequent words, as they are allocated more parameters), while embeddings in partitions (B) and (C) have (d_B) and (d_C) dimensions respectively. Note that all areas not part of any partition, i.e. the non-shaded areas in Figure 1, are set to (0).

The output of the previous hidden layer (h) is treated as a concatenation of features corresponding to each partition of the dimensionality of that partition, e.g. (h) in Figure 3 is made up of partitions of size (d_A), (d_B), and (d_B) respectively. Instead of computing the matrix-vector product between the entire output embedding matrix and (h) as in the regular softmax, D-Softmax then computes the product of each partition and its corresponding section in (h).

As many words will only require comparatively few parameters, the complexity of computing the softmax is reduced, which speeds up training. In contrast to H-Softmax, this speed-up persists during testing. Chen et al. (2015) observe that D-Softmax is the fastest method when testing, while being one of the most accurate. However, as it assigns fewer parameters to rare words, D-Softmax does a worse job at modelling them.

## CNN-Softmax

Another modification to the traditional softmax layer is inspired by recent work by Kim et al. [^{13}] who produce input word embeddings (v_w) via a character-level CNN. Jozefowicz et al. (2016) in turn suggest to do the same thing for the output word embeddings (v’_w) via a character-level CNN — and refer to this as CNN-Softmax. Note that if we have a CNN at the input and at the output as in Figure 4, the CNN generating the output word embeddings (v’_w) is necessarily different from the CNN generating the input word embeddings (v_w), just as the input and output word embedding matrices would be different.

While this still requires computing the regular softmax normalization, this approach drastically reduces the number of parameters of the model: Instead of storing an embedding matrix of (d : times : |V| ), we now only need to keep track of the parameters of the CNN. During testing, the output word embeddings (v’_w) can be pre-computed, so that there is no loss in performance.

However, as characters are represented in a continuous space and as the resulting model tends to learn a smooth function mapping characters to a word embedding, character-based models often find it difficult to differentiate between similarly spelled words with different meanings. To mitigate this, the authors add a correction factor that is learned per word, which significantly reduces the performance gap between regular and CNN-softmax. By adjusting the dimensionality of the correction term, the authors are able to trade-off model size versus performance.

The authors also note that instead of using a CNN-softmax, the output of the previous layer (h) can be fed to a character-level LSTM, which predicts the output word one character at a time. Instead of a softmax over words, a softmax outputting a probability distribution over characters would thus be used at every time step. They, however, fail to achieve competitive performance with this layer. Ling et al. [^{14}] use a similar layer for machine translation and achieve competitive results.

# Sampling-based Approaches

While the approaches discussed so far still maintain the overall structure of the softmax, sampling-based approaches on the other hand completely do away with the softmax layer. They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute. However, sampling-based approaches are only useful at training time — during inference, the full softmax still needs to be computed to obtain a normalised probability.

In order to gain some intuitions about the softmax denominator’s impact on the loss, we will derive the gradient of our loss function (J_theta) w.r.t. the parameters of our model (theta).

During training, we aim to minimize the cross-entropy loss of our model for every word (w) in the training set. This is simply the negative logarithm of the output of our softmax. If you are unsure of this connection, have a look at Karpathy’s explanation to gain some more intuitions about the connection between softmax and cross-entropy. The loss of our model is then the following:

(J_theta = – : text{log} : dfrac{text{exp}({h^top v’_{w}})}{sum_{w_i in V} text{exp}({h^top v’_{w_i}})} ).

Note that in practice (J_theta) would be the average of all negative log-probabilities over the whole corpus. To facilitate the derivation, we decompose (J_theta ) into a sum as (text{log} : dfrac{x}{y} = text{log} : x – text{log} : y ):

(J_theta = – : h^top v’_{w} + text{log} sum_{w_i in V} text{exp}(h^top v’_{w_i}) )

For brevity and to conform with the notation of Bengio and Senécal [^{4}, ^{15}] (note that in the first paper, they compute the gradient of the *positive* logarithm), we replace the dot product ( h^top v’_{w} ) with ( – mathcal{E}(w) ). Our loss then looks like the following:

(J_theta = : mathcal{E}(w) + text{log} sum_{w_i in V} text{exp}( – mathcal{E}(w_i)) )

For back-propagation, we can now compute the gradient (nabla ) of (J_theta ) w.r.t. our model’s parameters (theta):

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) + nabla_theta text{log} sum_{w_i in V} text{exp}(- mathcal{E}(w_i)) )

As the gradient of ( text{log} : x ) is (dfrac{1}{x} ), an application of the chain rule yields:

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) + dfrac{1}{sum_{w_i in V} text{exp}(- mathcal{E}(w_i))} nabla_theta sum_{w_i in V} text{exp}(- mathcal{E}(w_i) )

We can now move the gradient inside the sum:

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) + dfrac{1}{sum_{w_i in V} text{exp}(- mathcal{E}(w_i))} sum_{w_i in V} nabla_theta : text{exp}(- mathcal{E}(w_i)) )

As the gradient of (text{exp}(x)) is just (text{exp}(x)), another application of the chain rule yields:

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) + dfrac{1}{sum_{w_i in V} text{exp}(- mathcal{E}(w_i))} sum_{w_i in V} text{exp}(- mathcal{E}(w_i)) nabla_theta (- mathcal{E}(w_i)) )

We can rewrite this as:

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) + sum_{w_i in V} dfrac{text{exp}(- mathcal{E}(w_i))}{sum_{w_i in V} text{exp}(- mathcal{E}(w_i))} nabla_theta (- mathcal{E}(w_i)) )

Note that ( dfrac{text{exp}(- mathcal{E}(w_i))}{sum_{w_i in V} text{exp}(- mathcal{E}(w_i))} ) is just the softmax probability (P(w_i) ) of (w_i) (we omit the dependence on the context (c) here for brevity). Replacing it yields:

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) + sum_{w_i in V} P(w_i) nabla_theta (- mathcal{E}(w_i)) )

Finally, repositioning the negative coefficient in front of the sum yields:

(nabla_theta J_theta = : nabla_theta mathcal{E}(w) – sum_{w_i in V} P(w_i) nabla_theta mathcal{E}(w_i) )

Bengio and Senécal (2003) note that the gradient essentially has two parts: a positive reinforcement for the target word (w) (the first term in the above equation) and a negative reinforcement for all other words (w_i), which is weighted by their probability (the second term). As we can see, this negative reinforcement is just the expectation (mathbb{E}_{w_i sim P}) of the gradient of (mathcal{E} ) for all words (w_i) in (V):

(sum_{w_i in V} P(w_i) nabla_theta mathcal{E}(w_i) = mathbb{E}_{w_i sim P}[nabla_theta mathcal{E}(w_i)]).

The crux of most sampling-based approach now is to approximate this negative reinforcement in some way to make it easier to compute, since we don’t want to sum over the probabilities for all words in (V).

## Importance Sampling

We can approximate the expected value (mathbb{E}) of any probability distribution using the Monte Carlo method, i.e. by taking the mean of random samples of the probability distribution. If we knew the network’s distribution, i.e. (P(w)), we could thus directly sample (m) words ( w_1 , cdots , w_m ) from it and approximate the above expectation with:

( mathbb{E}_{w_i sim P}[nabla_theta mathcal{E}(w_i)] approx dfrac{1}{m} sumlimits^m_{i=1} nabla_theta mathcal{E}(w_i) ).

However, in order to sample from the probability distribution ( P ), we need to compute ( P ), which is just what we wanted to avoid in the first place. We therefore have find some other distribution ( Q ) (we call this the proposal distribution), from which it is cheap to sample and which can be used as the basis of Monte-Carlo sampling. Preferably, (Q) should also be similar to (P), since we want our approximated expectation to be as accurate as possible. A straightforward choice in the case of language modelling is to simply use the unigram distribution of the training set for ( Q ).

This is essentially what classical Importance Sampling (IS) does: It uses Monte-Carlo sampling to approximate a target distribution (P) via a proposal distribution (Q). However, this still requires computing (P(w)) for every word (w) that is sampled. To avoid this, Bengio and Senécal (2003) use a biased estimator that was first proposed by Liu [^{16}]. This estimator can be used when ( P(w) ) is computed as a product, which is the case here, since every division can be transformed into a multiplication.

Essentially, instead of weighting the gradient (nabla_theta mathcal{E}(w_i)) with the expensive to compute probability (P_{w_i}), we weight it with a factor that leverages the proposal distribution (Q). For biased IS, this factor is (dfrac{1}{R}r(w_i)) where ( r(w) = dfrac{text{exp}(- mathcal{E}(w))}{Q(w)} ) and ( R = sum^m_{j=1} r(w_j) ).

Note that we use ( r ) and ( R ) instead of ( w) and (W) as in Bengio and Senécal (2003, 2008) to avoid name clashes. As we can see, we still compute the numerator of the softmax, but replace the normalisation in the denominator with the proposal distribution (Q). Our biased estimator that approximates the expectation thus looks like the following:

( mathbb{E}_{w_i sim P}[nabla_theta mathcal{E}(w_i)] approx dfrac{1}{R} sumlimits^m_{i=1} r(w_i) nabla_theta : mathcal{E}(w_i))

Note that the fewer samples we use, the worse is our approximation. We additionally need to adjust our sample size during training, as the network’s distribution (P) might diverge from the unigram distribution (Q ) during training, which leads to divergence of the model, if the sample size that is used is too small. Consequently, Bengio and Senécal introduce a measure to calculate the effective sample size in order to protect against possible divergence. Finally, the authors report a speed-up factor of (19 ) over the regular softmax for this method.

## Adaptive Importance Sampling

Bengio and Senécal (2008) note that for Importance Sampling, substituting more complex distributions, e.g. bigram and trigram distributions, later in training to combat the divergence of the unigram distribution (Q) from the model’s true distribution (P) does not help, as n-gram distributions seem to be quite different from the distribution of trained neural language models. As an alternative, they propose an n-gram distribution that is adapted during training to follow the target distribution (P) more closely. To this end, they interpolate a bigram distribution and a unigram distribution according to some mixture function, whose parameters they train with SGD for different frequency bins to minimize the Kullback-Leibler divergence between the distribution ( Q ) and the target distribution ( P). For experiments, they report a speed-up factor of about (100).

## Target Sampling

Jean et al. (2015) propose to use Adaptive Importance Sampling for machine translation. In order to make the method more suitable for processing on a GPU with limited memory, they limit the number of target words that need to be sampled from. They do this by partitioning the training set and including only a fixed number of sample words in every partition, which form a subset (V’) of the vocabulary.

This essentially means that a separate proposal distribution (Q_i ) can be used for every partition (i) of the training set, which assigns equal probability to all words included in the vocabulary subset (V’_i) and zero probability to all other words.

## Noise Contrastive Estimation

Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen) [^{17}] is proposed by Mnih and Teh [^{18}] as a more stable sampling method than Importance Sampling (IS), as we have seen that IS poses the risk of having the proposal distribution (Q) diverge from the distribution (P) that should be optimized. In contrast to the former, NCE does not try to estimate the probability of a word directly. Instead, it uses an auxiliary loss that also optimises the goal of maximizing the probability of correct words.

Recall the pairwise-ranking criterion of Collobert and Weston (2008) that ranks positive windows higher than “corrupted” windows, which we discussed in the previous post. NCE does a similar thing: We train a model to differentiate the target word from noise. We can thus reduce the problem of predicting the correct word to a binary classification task, where the model tries to distinguish positive, genuine data from noise samples, as can be seen in Figure 4 below.

For every word (w_i) given its context (c_i ) of (n) previous words (w_{t-1} , cdots , w_{t-n+1}) in the training set, we thus generate (k) noise samples (tilde{w}_{ik}) from