If you’re up-to-date with progress in natural language processing research, you’ve probably heard of word vectors in word2vec.
Word2vec is a neural network configuration that ingests sentences to learn word embeddings, or vectors of continuous numbers representing individual words.
[Related Article: Watch: State of the Art Natural Language Understanding at Scale]
The neural network accepts a word, which is first mapped to a one-hot vector (all 0s, except for a single 1). That vector is used as input to a fully-connected hidden layer with linear activation functions (e.g. no feature transformation) called an embedding layer. The embedding layer is followed by a softmax output layer with one neuron for each word in our corpus. The objective is to accurately predict the probability the word appears in the same sentence as each other word. Put simply, we’re attempting to model the word’s context accurately.
Once performance is deemed satisfactory, the output layer is detached. All that are left are the input and dense embedding layers. The latter gives us access to what we want — the learned word vectors.
Representing a word by a bunch of numbers can be baffling, as can determining why we’d even want to. There’s no perfect explanation for why this works. However, some intuition can explain why a word vector is acceptable and useful to represent this information.
Why Word Vectors?
Word vectors’ primary use is to intuit something about what they mean by how they relate to each other. That can be weird to think about, but it helps us solve very sophisticated tasks in natural language processing.
For example, think about the basic setup for an analogy: A is to B as X is to Y. For example, husband is to wife as king is to queen. We need to understand something about the relationship between a husband and wife to understand there’s a similar relationship between a king and queen. You could give someone collections of texts on the subject and let them slowly start to figure out what the commonality is.
This is very complex for a machine to understand. How do you encode information about how a husband and wife are related? How do we discover that commonality with royalty? And how do we determine the difference between regular and royal couples?
Word vectors offer an intuitive (albeit imperfect) solution. Using a vector representation in N dimensions, we can analyze the vectors’ orientations relative to one another and make comparisons.
The husband and wife word vectors will be oriented in a very specific way, with a given angle and distance from each other. We might observe a similar orientation between the king and queen word vectors, within a given tolerance. This is purely heuristic, but it seems to work pretty well. Plus, it’s the closest we have to capturing the concept of an analogy in a statistical way.
A similarly elementary operation will allow us to understand how concepts ‘mesh’ to create new ones. For example, we might add the vectors for man and royalty together. You can probably guess that man + royalty = king. Learned word vectors can arrive at the same conclusion.
Word Vectors In Practice
[Related Article: Watch: Understanding Unstructured Data with Language Models]
Naturally, the quality of our derived word vectors depends on the amount of training data we have, instability during training and other factors. These are difficult to account for when building a model that is, ultimately, heuristic in nature.
With that said, word vectors are a novel way to represent natural language information that might be lost in other encodings. The extremely active field of natural language computing will undoubtedly find new ways to exploit the approach soon.