Have you ever wondered how a chatbot can learn about the meaning of words in a text? Does this sound interesting? Well, in this blog we will describe a very powerful method, Word2Vec, that maps words to numbers (vectors) in order to easily capture and distinguish their meaning. We will briefly describe how Word2Vec works without going into many technical details. And although it was originally developed for working with text, the algorithm is very useful in other domains, such as music recommendations. Here, we will cover all of these interesting applications of Word2Vec, so let’s get started!
How it works
Word2Vec comes with two different implementations – CBOW and skip-gram model. We will explain the skip-gram model, which relies on a very simple idea. It is a neural network trained to do the following: given a specific word in a sentence (the input word), it can tell us the probability for every other word in our vocabulary of being “nearby” the input word. When I say “nearby”, there is actually a “window size” parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total). For example, if the input word is “bear”, the output probability is going to be much higher for word “animal” than for unrelated words like “lemon” or “boat”.
We’ll train the neural network by feeding it pairs of words that occur “nearby” in sentences. The example below shows some of the training samples (word pairs) we would take from the sentence “The big brown bear is sitting in a chair.” I’ve used a small window size of 2 just for the example. This means that, for each word, we will look at 2 words after and before to form training pairs.
The network is going to learn statistics from the number of times each pairing shows up. For the word “bear”, we expect that words that appear in context (nearby) of “bear” are words such as “animal”, or “big” and “brown” (like in the example). At the end, the network is probably going to get many more training samples of (“bear”, “animal”) than it is of (“bear”, “lemon”).
Every word in our vocabulary is represented as a vector of numbers. The length (dimension) of a vector is given as a parameter of the algorithm. If we set it to 200, this means that each word will be a 200-dimensional vector. The neural network will learn those vectors, so each time the network sees that two words occur together, such as (“bear”, “animal”), the vectors of these words will be slightly modified so that they get closer. In a sense, those learned vectors will represent the meaning of words, since similar words or those that often occur together will be represented by vectors who are close to each other.
Note that an algorithm captures the semantics of words only by how close they appear in a text. It does not learn about any other kind of semantics, but learned vectors usually capture much information and can be very useful in many applications, as you will see in the following examples. The quality of vectors depends mostly on the size of text (number of sentences) we used to train the model. If we have a small corpus of text documents, the results might be modest. In case you don’t have a large corpus, there are many pre-trained models that you can use off-the-shelf. For example, Google has revealed the Word2Vec model, which includes word vectors for a vocabulary of 3 million words and phrases that have been trained on roughly 100 billion words from the Google News dataset.
Once the neural network has learned word vectors, we can apply standard vector operations on them, and the results are really interesting! Here are a couple of results you get using the Google pre-trained model. First, this would not be a real Word2Vec blog without the famous example of “King – Man + Woman = Queen”. Indeed, simple algebraic operations were performed on word vectors, and it was shown that the vector(“King”) – vector(“Man”) + vector(“Woman”) result was closest to the vector representation of the word “Queen”. If we imagine vectors in 3-dimensional space, this should look like in the next picture.
We see that vectors learned these relevant relations:
1. Man is to woman as king is to queen (King – Man + Woman = Queen)
2. Building is to architect as software is to programmer (Software – Building + Architect = Programmer)
It was shown that, in a similar manner, this model learned the relations between countries and capitals, comparative and superlative forms of adjectives, verb tenses, and much more. This is how we get these, and some other results using Google’s word vectors, and Python gensim library:
model.wv.most_similar(positive=['king', 'woman'], negative = ['man']) [('queen', 0.711), ...]
The example from above. It works as follows: find words similar to “woman” and “king”, and not similar to “man”. The first word in returned array is “queen”.
model.wv.most_similar(positive=['software', 'architect'], negative = ["building"]) [('Software', 0.525), ('programmer', 0.517), ...]
The example from above: software – building + architect ~ programmer. This can be interpreted as “software is to programmer as building is to architect”. Makes sense.
model.wv.most_similar(positive=['France', 'Rome'], negative = ["Italy"]) [('Paris', 0.719), ...]
Captures the relations between countries and capitals: France – Italy + Rome ~ Paris. This can be interpreted as “France is to Paris as Italy is to Rome”.
model.wv.most_similar(positive=['dinosaur', 'human'], negative = ["monkey"]) [('dinosaurs', 0.508), ('fossil', 0.502), ...]
An interesting relation: dinosaur – monkey + human ~ fossil. This can be interpreted as “dinosaur is to fossil as monkey is to human”. Hmm… What does this mean? Humans are fossilized monkeys? Humans are what’s left over from monkeys? Interesting point 🙂
model.wv.doesnt_match("breakfast cereal dinner lunch".split()) 'Cereal'
An example how to find the word that does not belong to the given sequence.
model.wv.most_similar(positive=['Italy', 'footballer']) [('legend_Roberto_Baggio', 0.661), ('midfielder_Rino_Gattuso', 0.635), ('skipper_Fabio_Cannavaro', 0.617), …]
Here are some of the top results for words or phrases similar to both “Italy” and “football”.
model.wv.similarity('woman', 'man') 0.73723527
Here is a cosine similarity between vectors for “man” and “woman”. They are pretty close.
To wrap up, we saw a couple of examples where Word2Vec had great results. Of course, it is not perfect, and will make mistakes sometimes. Once again, the similarity between words is learned only based on how often words occur nearby in the text, and the accuracy of results depends mostly on the size of texts (number of sentences) on which the model was trained.
OK, after playing with vectors for a bit, let’s see some typical applications where Word2Vec can be used:
1. Text classification & Chatbots
Based on previous examples, it becomes obvious that we can use word vectors to extract similar words, synonyms, the overall “meaning” of text etc. For example, vectors are really useful for text classification, where we want to know to which topic or context a text refers. Let’s say that we have predefined topics, and keywords that describe them. For each text that we need to classify, in the simplest approach, we can calculate the average vector of all keywords from the text, and the average vector of topic keywords. Then, simply by comparing those vectors, we can determine the “similarity” of the text and the topic.
This technique can be applied to chatbots. If you are making a chatbot that needs to answer a question, it needs to understand the meaning (topic) of the question. We can extract keywords from the question and calculate the average vector of all keywords to determine the meaning (topic) as described above.
2. Recommend items that occur together
We have already explained how Word2Vec learn vectors using a neural network. Basically, it looks at sentences which are ordered sequences of words. If two words often occur close to each other within sentences, their vectors will be close and the words are considered as similar. The same principle can be applied elsewhere, not only to sentences, whenever we have sequences of items. Here are few examples:
Direct application of Word2Vec to a classical recommendation task was recently presented by Spotify. They abstracted the ideas behind Word2Vec to apply them not simply to words in sentences but to any object in any sequence, in this case to songs in a playlist. Songs are treated as words and other songs in a playlist as their surrounding context (nearby words). Now, in order to recommend songs to a user one merely has to examine songs (vectors) similar to the songs the user already likes.
The same principles can be applied to orders dataset, such as this one, with 3 million Instacart orders. For each user, dataset provide between 4 and 100 of their orders, with the sequence of products purchased in each order. By using Word2Vec we can infer which products are “similar”, where “similar” means that they often occur together within an order (they are purchased together). That way, we can recommend products while the user shops.
Recommendations in sport betting, where we use vector representation of users and bet types. We can use betting history, so that vectors of users and types of bets user played are pushed closer to each other. This way, we can recommend bets for the user profile based on cosine similarity of bet vectors and user vector. This is something we tried out and implemented in one of our projects in SmartCat. Details of this approach will be described in one of our next blogs!
In this blog, we presented a very powerful algorithm that can infer vectors representing words in sentences, or any items that occur in sequences. These vectors can be used to calculate similarities, or to create numerical features for various machine learning models. There are many application where Word2Vec is very useful, from chatbots to recommending music and bet tickets. We will write about some of these application in the future as well, so stay tuned!