This article is the second article in a two-part series about the evolution of word embedding as told through the context of five research papers. It picks up in midst of the 1990s. To view the first article, click here.
A Shift Towards Automatic Feature Generation: Latent Dirichlet Allocation
Around the same time that latent semantic analysis came to the forefront, artificial neural networks that relied on conceptual representations were also being deployed. Some of the major neural innovations that arose during this time period as recalled by Swedish language technology company Gavagai include self-organizing maps and simple recurrent networks. Both techniques come in handy for large datasets, with the former succeeding at identifying categories and the latter at identifying patterns.
A few years down the road, latent Dirichlet allocation (LDA) made its debut, becoming one of the most widespread generative methods underpinning topic models. The 2003 paper  Latent Dirichlet Allocation (Blei et al.) breaks down the task of modeling text corpora, describing how LDA clusters documents on the basis of word occurrence. In this case, each document is typically represented by a vector of fixed length. While LDA can be applied to other problems that concern collections of discrete data, it is perhaps most commonly exploited in situations that seek to derive topics from documents within a corpus. “The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words” (Blei et al. 2003). Compared to probabilistic LSA, LDA is far less prone to overfitting when making generalizations. All in all, LDA exemplifies another important step forward in computationally representing words and extracting meaning from these representations.
Modern-day Models of Word Embedding
In the wake of LSA, LDA, and their shared roles in topic modeling, neural language models came into greater focus during subsequent years in the 2000s. The underlying components of neural language models are fundamentally the same as those of simple recurrent networks. Moving away from the elements of LSA and LDA that are grounded in information retrieval, neural language models return to considering words instead of documents as contexts. According to Gavagai, “the document-based models capture semantic relatedness (e.g. ‘boat’ – ‘water’) while the word-based models capture semantic similarity (e.g. ‘boat’ – ‘ship’).”
The 2008 paper  A Unified Architecture for Natural Language Processing details how researchers Ronan Collobert and Jason Weston strived to carve out a single system that would learn a range of relevant features to tackle high-level semantic tasks. To do so, they trained a deep neural network consisting of multiple layers. The first layer captured word-level features, the second incorporated sentence-level features, and all others were standard neural network layers. It was within the first layer that words got mapped into vectors to ultimately be processed by the succeeding layers. Since the tasks being carried out — part-of-speech tagging, chunking, named entity recognition, semantic role labeling, language models, and semantically related words — were all affiliated, only the final layers of the neural network needed to be task-specific.
An exciting outcome of Collobert and Weston’s unified architecture was the achievement of state-of-the-art performance in semantic role labeling without any explicit syntactic features — especially thrilling because syntax was often considered crucial for this task. Looking at the big picture, this particular research endeavor shows how word embedding can enhance the performance of scaffolded tasks. It additionally exemplifies how the rise of deep learning made neural language models the modern landscape for word embedding.
The Vector Boom
What really bolstered the prominence of word embedding in the NLP community was the 2013 launch of word2vec. Created by a research team led by Google’s Tomas Mikolov, word2vec is a two-layer neural net toolkit for training word embeddings and making use of pre-trained ones. word2vec relies on two training strategies: continuous bag-of-words (CBOW) and the skipgram. Consider the target word — the word that we are trying to predict. CBOW uses the n words before and after the target word to make the prediction. Skipgram can be thought of as the inverse of this: it attempts to use a given word to predict the surrounding n words before and after it. At the heart of both of these strategies is an algorithmic effort to find the best possible word vector representations for predicting nearby words.
The paper  Distributed Representations of Words and Phrases and their Compositionality takes stock of the newfound skipgram model, explaining how Mikolov et al. managed to train it on multiple orders of magnitude more data than ever before. The skipgram model’s underlying predictive framework divests from dense matrix multiplications and uses either the hierarchical softmax or Noise Contrastive Estimation instead — both of which reduce the computational complexity required. As a result of the increased amount of data, Mikolov et al. attained significantly improved quality for word and phrase representations, even for uncommon entities.
Although word2vec is not an instance of deep learning (CBOW and skipgram are examples of shallow neural networks), word embedding is essential to deep learning as we know it and as it shall continue to evolve. Deeplearning4j emphasizes that word embeddings “can form the basis of search, sentiment analysis, and recommendations” in a breadth of data-driven fields like “scientific research, legal discovery, e-commerce and customer relationship management.” Originating as a basic theory of meaning and how it can be quantified, word embedding has transformed as our computational resources have expanded in both intelligence and processing speed. As the NLP community plunges ever deeper into deep learning, we can only expect that word embedding will find even greater utility going forward.
Kaylen Sanders, ODSC
I currently study Computational Linguistics as an M.S. candidate at Brandeis University. I received my Bachelor's degree from the University of Pittsburgh where I explored linguistics, computer science, and nonfiction writing. I'm interested in the crossroads where language and technology meet.
- Most Influential Data Science Research Papers for 2018 101 views | by Daniel Gutierrez, ODSC | under Featured Post, Modeling, Research
- The Data Scientist’s Holy Grail – Labeled Data Sets 55 views | by Daniel Gutierrez, ODSC | under Modeling, Tools & Languages, Workflow
- Understanding the 3 Primary Types of Gradient Descent 55 views | by Daniel Gutierrez, ODSC | under Modeling