On word embeddings – Part 1
Deep LearningModelingNLP/Text Analyticsposted by Sebastian Ruder April 2, 2017 Sebastian Ruder
Table of contents:
- A brief history of word embeddings
- Word embedding models
Unsupervisedly learned word embeddings have been exceptionally successful in many NLP tasks and are frequently seen as something akin to a silver bullet. In fact, in many NLP architectures, they have almost completely replaced traditional distributional features such as Brown clusters and LSA features.
Proceedings of last year’s ACL and EMNLP conferences have been dominated by word embeddings, with some people musing that Embedding Methods in Natural Language Processing was a more fitting name for EMNLP. This year’s ACL features not one but twoworkshops on word embeddings.
Semantic relations between word embeddings seem nothing short of magical to the uninitiated and Deep Learning NLP talks frequently prelude with the notorious king−man+woman≈queenking−man+woman≈queen slide, while a recent article in Communications of the ACM hails word embeddings as the primary reason for NLP’s breakout.
This post will be the first in a series that aims to give an extensive overview of word embeddings showcasing why this hype may or may not be warranted. In the course of this review, we will try to connect the disperse literature on word embedding models, highlighting many models, applications and interesting features of word embeddings, with a focus on multilingual embedding models and word embedding evaluation tasks in later posts.
This first post lays the foundations by presenting current word embeddings based on language modelling. While many of these models have been discussed at length, we hope that investigating and discussing their merits in the context of past and current research will provide new insights.
A brief note on nomenclature: In the following we will use the currently prevalent term word embeddings to refer to dense representations of words in a low-dimensional vector space. Interchangeable terms are word vectors and distributed representations. We will particularly focus on neural word embeddings, i.e. word embeddings learned by a neural network.
A brief history of word embeddings
Since the 1990s, vector space models have been used in distributional semantics. During this time, many models for estimating continuous representations of words have been developed, including Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Have a look at this blog post for a more detailed overview of distributional semantics history in the context of word embeddings.
Bengio et al. coin the term word embeddings in 2003 and train them in a neural language model jointly with the model’s parameters. First to show the utility of pre-trained word embeddings were arguably Collobert and Weston in 2008. Their landmark paper A unified architecture for natural language processing not only establishes word embeddings as a useful tool for downstream tasks, but also introduces a neural network architecture that forms the foundation for many current approaches. However, the eventual popularization of word embeddings can be attributed to Mikolov et al. in 2013 who created word2vec, a toolkit that allows the seamless training and use of pre-trained embeddings. In 2014, Pennington et al. released GloVe, a competitive set of pre-trained word embeddings, signalling that word embeddings had reached the main stream.
Word embeddings are one of the few currently successful applications of unsupervised learning. Their main benefit arguably is that they don’t require expensive annotation, but can be derived from large unannotated corpora that are readily available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of labeled data.
Word embedding models
Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.
The main difference between such a network that produces word embeddings as a by-product and a method such as word2vec whose explicit goal is the generation of word embeddings is its computational complexity. Generating word embeddings with a very deep architecture is simply too computationally expensive for a large vocabulary. This is the main reason why it took until 2013 for word embeddings to explode onto the NLP stage; computational complexity is a key trade-off for word embedding models and will be a recurring theme in our review.
Another difference is the training objective: word2vec and GloVe are geared towards producing word embeddings that encode general semantic relationships, which are beneficial to many downstream tasks; notably, word embeddings trained this way won’t be helpful in tasks that do not rely on these kind of relationships. In contrast, regular neural networks typically produce task-specific embeddings that are only of limited use elsewhere. Note that a task that relies on semantically coherent representations such as language modelling will produce similar embeddings to word embedding models, which we will investigate in the next chapter.
As a side-note, word2vec and Glove might be said to be to NLP what VGGNet is to vision, i.e. a common weight initialisation that provides generally helpful features without the need for lengthy training.
To facilitate comparison between models, we assume the following notational standards: We assume a training corpus containing a sequence of TT training words w1,w2,w3,⋯,wTw1,w2,w3,⋯,wT that belong to a vocabulary VV whose size is |V||V|. Our models generally consider a context of nnwords. We associate every word with an input embedding vwvw (the eponymous word embedding in the Embedding Layer) with dd dimensions and an output embedding v′wvw′ (another word representation whose role will soon become clearer). We finally optimize an objective function JθJθ with regard to our model parameters θθ and our model outputs some score fθ(x)fθ(x) for every input xx.
A note on language modelling
Word embedding models are quite closely intertwined with language models. The quality of language models is measured based on their ability to learn a probability distribution over words in VV. In fact, many state-of-the-art word embedding models try to predict the next word in a sequence to some extent. Additionally, word embedding models are often evaluated using perplexity, a cross-entropy based measure borrowed from language modelling.
Before we get into the gritty details of word embedding models, let us briefly talk about some language modelling fundamentals.
Language models generally try to compute the probability of a word wtwt given its n−1n−1previous words, i.e. p(wt|wt−1,⋯wt−n+1)p(wt|wt−1,⋯wt−n+1). By applying the chain rule together with the Markov assumption, we can approximate the product of a whole sentence or document by the product of the probabilities of each word given its nn previous words: