fbpx
How to Compute Sentence Similarity Using BERT and Word2Vec How to Compute Sentence Similarity Using BERT and Word2Vec
We often need to encode text data, including words, sentences, or documents into high-dimensional vectors. Sentence embedding is an important step... How to Compute Sentence Similarity Using BERT and Word2Vec

We often need to encode text data, including words, sentences, or documents into high-dimensional vectors. Sentence embedding is an important step in various NLP tasks such as sentiment analysis and extractive summarization. A flexible sentence embedding library is needed to prototype fast and to tune for various contexts.

In the past, we mostly used encoders such as one-hot, term-frequency, or TF-IDF (a.k.a., normalized term-frequency). However, the semantic and syntactic information of words were not captured in these techniques. The recent advancements allow us to encode sentences or words in more meaningful forms. The word2vec technique and the BERT language model are two important ones. Note that, in this context, we use embedding, encoding, or vectorization interchangeably.

The open-source sent2vec Python library allows you to encode sentences with high flexibility. You currently have access to the standard encoders in the library. More advanced techniques will be added in later releases. In this article, I want to introduce this library and share lessons that I learned in this context.

How to Use the “Sent2Vec” Python package

How to Install

Since the sent2vec is a high-level library, it has dependencies to spaCy (for text cleaning), Gensim (for word2vec models), and Transformers (for various forms of BERT model). So make sure to install these libraries before installing sent2vec using the code below.

pip3 install sent2vec

How to Use BERT Method

If you want to use the BERT language model (more specifically, distilbert-base-uncased) to encode sentences for downstream applications, you must use the code below. Currently, the sent2vec library only supports the DistilBERT model. More models will be supported in the future. Since this is an open-source project, you can also dig into the source code and find more details of implementation.

from scipy import spatial
from sent2vec.vectorizer import Vectorizer
sentences = [
    "This is an awesome book to learn NLP.",
    "DistilBERT is an amazing NLP model.",
    "We can interchangeably use embedding, encoding, or vectorizing.",
]
vectorizer = Vectorizer()
vectorizer.bert(sentences)
vectors_bert = vectorizer.vectors
dist_1 = spatial.distance.cosine(vectors_bert[0], vectors_bert[1])
dist_2 = spatial.distance.cosine(vectors_bert[0], vectors_bert[2])
print('dist_1: {0}, dist_2: {1}'.format(dist_1, dist_2))
# dist_1: 0.043, dist_2: 0.192

You can compute distance among sentences by using their vectors. In the example, as expected, the distance between vectors[0] and vectors[1] is less than the distance between vectors[0] and vectors[2].

How to Use Word2Vec Method

If you want to use a word2vec approach instead, you must first split sentences into lists of words using the sent2words method from the Splitter class in this library. You can customize the stop-word list by revising (adding to or removing from) the default stop-word list. Prior to any computation, it is crucial to investigate the stop word list. The final results can be easily skewed with a small change in this step.

When you extract the most important words in sentences, you can compute the sentence embeddings using the word2vec method from the Vectorizer class. This method computes the average of vectors corresponding to the remaining (i.e., the most important) words using the code below.

from scipy import spatial
from sent2vec.vectorizer import Vectorizer
sentences = [
    "Alice is in the Wonderland.",
    "Alice is not in the Wonderland.",
]
splitter = Splitter()
splitter.sent2words(sentences=sentences, remove_stop_words=['not'], add_stop_words=[])
print(splitter.words)
# [['alice', 'wonderland'], ['alice', 'not', 'wonderland']]
vectorizer = Vectorizer()
vectorizer.word2vec(splitter.words, pretrained_vectors_path=PRETRAINED_VECTORS_PATH)
vectors_w2v = vectorizer.vectors
dist_w2v = spatial.distance.cosine(vectors_w2v[0], vectors_w2v[1])
print('dist_w2v: {}'.format(dist_w2v))
# dist_w2v: 0.11

As seen above, you can use different word2vec models by sending its path (i.e., PRETRAINED_VECTORS_PATH) to the word2vec method. You can use a pre-trained model or a customized one. This configuration is crucial to obtain meaningful outcomes. You need a contextualized vectorization, and the word2vec model takes care of that.

What is the Best Sentence Encoders

The final outcomes of sentence encoding or embedding techniques are rooted in various factors such as a relevant stop-word list or a contextualized pre-trained model. You can find more explanations below.

  • Text Cleaning — Let’s say you use spaCy for the text cleaning step as I also used it in the sent2vec library. If you mistakenly forgot to remove “Not” from the default stop-word list, the sentence embedding results can be totally misleading. A simple word “Not” can thoroughly change the sentiment of a sentence. The default stop-word list differs in each environment. So, you must curate this list to your needs before any computation.
  • Contextualized Models — You must use contextualized models. For example, if the target data is in finance, you must use a model trained on the finance corpus. Otherwise, the outcomes of sentence embedding can be inaccurate. So, if you use the word2vec method and want to use the general English model, the sentence embedding results may be inaccurate.
  • Aggregation Strategy — When you compute sentence embedding using the word2vec method, you may need to use a more advanced technique to aggregate word vectors rather than taking an average of them. Currently, the sent2vec library only supports the “average” technique. Using a weighted average to compute the sentence embedding, is a simple enhancement that can improve the final outcomes. More advanced techniques will be supported in future releases.

To emphasize the significance of the word2vec model, I encode a sentence using two different word2vec models (i.e., glove-wiki-gigaword-300 and fasttext-wiki-news-subwords-300). Then, I compute the cosine similarity between two vectors: 0.005 that may interpret as “two unique sentences are very different”. Wrong! By this example, I want to demonstrate the vector representation of a sentence can be even perpendicular if we use two different word2vec models. In other words, if you blindly compute the sentence embedding with a random word2vec model, you may be surprised in the process.

Sent2Vec is an Open-Source Library So …

The sent2vec is an open-source library. The main goal of this project is to expedite building proof of concepts in NLP projects. A large number of NLP tasks need sentence vectorization including summarization and sentiment analysis. So, please consider contributing and pushing this project forward. I also hope you can use this library in your exciting NLP projects.

Article originally posted here by Pedram Ataee, PhD. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1