How to Compute Sentence Similarity Using BERT and Word2Vec How to Compute Sentence Similarity Using BERT and Word2Vec
We often need to encode text data, including words, sentences, or documents into high-dimensional vectors. The sentence embedding is an important... How to Compute Sentence Similarity Using BERT and Word2Vec

We often need to encode text data, including words, sentences, or documents into high-dimensional vectors. The sentence embedding is an important step in various NLP tasks such as sentiment analysis and extractive summarization. A flexible sentence embedding library is needed to prototype fast and to tune for various contexts.

In the past, we mostly used encoders such as one-hot, term-frequency, or TF-IDF (a.k.a., normalized term-frequency). However, the semantic and syntactic information of words were not captured in these techniques. The recent advancements allow us to encode sentences or words in more meaningful forms. The word2vec technique and the BERT language model are two important ones. Note that, in this context, we use embedding, encoding, or vectorization interchangeably.

The open-source sent2vec Python library allows you to encode sentences with high flexibility. You currently have access to the standard encoders in the library. More advanced techniques will be added in later releases. In this article, I want to introduce this library and share lessons that I learned in this context.

If you are not familiar with the Word2Vec models, I recommend reading the article below, first. You will find out why Word2Vec models are simple yet revolutionary in machine learning.

— How to Use the “Sent2Vec” Python package

How to Install

Since the sent2vec is a high-level library, it has dependencies to spaCy (for text cleaning), Gensim (for word2vec models), and Transformers (for various forms of BERT model). So make sure to install these libraries before installing sent2vec using the code below.

pip3 install sent2vec

How to Use BERT Method

If you want to use the BERT language model (more specifically, distilbert-base-uncased) to encode sentences for downstream applications, you must use the code below. Currently, the sent2vec library only supports the DistilBERT model. More models will be supported in the future. Since this is an open-source project, you can also dig into the source code and find more details of implementation.

from scipy import spatial
from sent2vec.vectorizer import Vectorizer

sentences = [
    "This is an awesome book to learn NLP.",
    "DistilBERT is an amazing NLP model.",
    "We can interchangeably use embedding, encoding, or vectorizing.",

vectorizer = Vectorizer()
vectors_bert = vectorizer.vectors

dist_1 = spatial.distance.cosine(vectors_bert[0], vectors_bert[1])
dist_2 = spatial.distance.cosine(vectors_bert[0], vectors_bert[2])
print('dist_1: {0}, dist_2: {1}'.format(dist_1, dist_2))
# dist_1: 0.043, dist_2: 0.192

You can compute distance among sentences by using their vectors. In the example, as expected, the distance between vectors[0] and vectors[1] is less than the distance between vectors[0] and vectors[2].

Note that the default vectorizer is distilbert-base-uncased but it’s possible to pass the argument pretrained_weights to chose another BERT model. For example, you can use the code below to load the base multilingual model.

vectorizer = Vectorizer(pretrained_weights='distilbert-base-multilingual-cased')

How to Use Word2Vec Method

If you want to use a Word2Vec approach instead, you must pass a valid path to the model weights. Under the hood, the sentences will be split into lists of words using the sent2words method from the Splitter class. The library, first, extracts the most important words in sentences. Then, it computes the sentence embeddings using the average of vectors corresponding to those words. You can use the code below.

from scipy import spatial
from sent2vec.vectorizer import Vectorizer

sentences = [
    "Alice is in the Wonderland.",
    "Alice is not in the Wonderland.",

vectorizer = Vectorizer(pretrained_weights= PRETRAINED_VECTORS_PATH)
vectorizer.run(sentences, remove_stop_words=['not'], add_stop_words=[])
vectors_w2v = vectorizer.vectors

dist_w2v = spatial.distance.cosine(vectors_w2v[0], vectors_w2v[1])
print('dist_w2v: {}'.format(dist_w2v))
# dist_w2v: 0.11

It is possible to customize the list of stop-words by adding or removing to/from the default list. Two additional arguments (both lists) must be passed when the vectorizer’s method .run is called: remove_stop_words and add_stop_wordsPrior to any computation, it is crucial to investigate the stop word list. The final results can be easily skewed with a small change in this step.

Note that you can use a pre-trained model or a customized one. This is crucial to obtain meaningful outcomes. You need a contextualized vectorization, and the Word2Vec model takes care of that. You just need to send the path to the Word2Vec model (i.e., PRETRAINED_VECTORS_PATH) when you initialize the vectorizer class.

— What are the Best Sentence Encoders

The final outcomes of sentence encoding or embedding techniques are rooted in various factors such as a relevant stop-word list or a contextualized pre-trained model. You can find more explanations below.

  • Text Cleaning — Let’s say you use spaCy for the text cleaning step as I also used it in the sent2vec library. If you mistakenly forgot to remove “Not” from the default stop-word list, the sentence embedding results can be totally misleading. A simple word “Not” can thoroughly change the sentiment of a sentence. The default stop-word list differs in each environment. So, you must curate this list to your needs before any computation.
  • Contextualized Models — You must use contextualized models. For example, if the target data is in finance, you must use a model trained on the finance corpus. Otherwise, the outcomes of sentence embedding can be inaccurate. So, if you use the word2vec method and want to use the general English model, the sentence embedding results may be inaccurate.
  • Aggregation Strategy — When you compute sentence embedding using the word2vec method, you may need to use a more advanced technique to aggregate word vectors rather than taking an average of them. Currently, the sent2vec library only supports the “average” technique. Using a weighted average to compute the sentence embedding, is a simple enhancement that can improve the final outcomes. More advanced techniques will be supported in future releases.

To emphasize the significance of the word2vec model, I encode a sentence using two different word2vec models (i.e., glove-wiki-gigaword-300 and fasttext-wiki-news-subwords-300). Then, I compute the cosine similarity between two vectors: 0.005 that may interpret as “two unique sentences are very different”. Wrong! By this example, I want to demonstrate the vector representation of a sentence can be even perpendicular if we use two different word2vec models. In other words, if you blindly compute the sentence embedding with a random word2vec model, you may be surprised in the process.

— Sent2Vec is an Open-Source Library So …

The sent2vec is an open-source library. The main goal of this project is to expedite building proof of concepts in NLP projects. A large number of NLP tasks need sentence vectorization including summarization and sentiment analysis. So, please consider contributing and pushing this project forward. I also hope you can use this library in your exciting NLP projects.

Article originally posted here by Pedram Ataee. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.