fbpx
4 Easy Methods to Tokenize Your Data 4 Easy Methods to Tokenize Your Data
Recently, I have been exploring the world of Natural Language Processing (NLP). This field is in the intersection of Machine Learning,... 4 Easy Methods to Tokenize Your Data

Recently, I have been exploring the world of Natural Language Processing (NLP). This field is in the intersection of Machine Learning, Linguistics, and Computer Science and deals with how computers interpret and use language. It is one of the most exciting parts of Data Science as it can help businesses, for example, extract meaningful information from reviews.

In this article, I want to explain a crucial step in any NLP project, Tokenization, and how to implement it.

What is it?

Tokenization is the first data pre-processing method for almost any NLP project. It involves breaking down your input text into smaller segments such as words or sentences.

This is useful because it allows Machine Learning algorithms to operate effectively as your data is now in discrete packets instead of an unstructured heap.

Implementations

We will now go through some ways to implement Word Tokenization.

1. Python’s .split() Method

Python as an inbuilt method to string types/objects called .split() that works as follows:

# name a string
string = "Hello, I'm Egor. What's your name?"
# call the split method
string.split()

Output:

['Hello,', "I'm", 'Egor.', "What's", 'your', 'name?']

This works pretty well! The only issue is that it includes the punctuation as part of the word token. Ideally, the punctuation should be considered as its own token.

2. Natural Language Toolkit (NLTK)

NLTK is a Python package that contains many tools and models for NLP and is targeted for learning and research. The NLTK package provides a word tokenizer function conveniently named word_tokenizeThis can be implemented as follows:

# import the package
import nltk
#download the language models
nltk.download()

Output:

['Hello', ',', 'I', "'m", 'Egor', '.', 'What', "'s", 'your', 'name', '?']

Ah, this is much better! The punctuation has been tokenized.

3. spaCy

spaCy is another NLP library but is considered to be more ‘up-to-date’ than NLTK and provides production-ready software.

# download spacy model in jupyter notebook
!python -m spacy download en_core_web_sm

Output:

['Hello', ',', 'I', "'m", 'Egor', '.', 'What', "'s", 'your', 'name', '?']

We can see that spaCy’s output is the same as NLTK.

4. Gensim

Gensim is primarily a topic modelling library but also contains the tokenize function which can be used as follows:

# import package
from gensim.utils import tokenize

Output:

['Hello', 'I', 'm', 'Egor', 'What', 's', 'your', 'name']

The useful part of Gensim is that it splits based on punctuation, so they are not included in the token list!

Conclusion

In this article, we have shown how to tokenize your data for your NLP project using four different libraries. I have attached links to the libraries below so that you can explore their functionality more!

Originally posted here by Egor Howell. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1