

4 Easy Methods to Tokenize Your Data
ModelingNLP/Text Analyticsposted by ODSC Community June 8, 2022 ODSC Community

Recently, I have been exploring the world of Natural Language Processing (NLP). This field is in the intersection of Machine Learning, Linguistics, and Computer Science and deals with how computers interpret and use language. It is one of the most exciting parts of Data Science as it can help businesses, for example, extract meaningful information from reviews.
In this article, I want to explain a crucial step in any NLP project, Tokenization, and how to implement it.
What is it?
Tokenization is the first data pre-processing method for almost any NLP project. It involves breaking down your input text into smaller segments such as words or sentences.
This is useful because it allows Machine Learning algorithms to operate effectively as your data is now in discrete packets instead of an unstructured heap.
Implementations
We will now go through some ways to implement Word Tokenization.
1. Python’s .split() Method
Python as an inbuilt method to string types/objects called .split()
that works as follows:
# name a string
string = "Hello, I'm Egor. What's your name?"
# call the split method
string.split()
Output:
['Hello,', "I'm", 'Egor.', "What's", 'your', 'name?']
This works pretty well! The only issue is that it includes the punctuation as part of the word token. Ideally, the punctuation should be considered as its own token.
2. Natural Language Toolkit (NLTK)
NLTK is a Python package that contains many tools and models for NLP and is targeted for learning and research. The NLTK package provides a word tokenizer function conveniently named word_tokenize
. This can be implemented as follows:
# import the package import nltk #download the language models nltk.download()# define sentence string = "Hello, I'm Egor. What's your name?"# use tokenizer from nltk.tokenize import word_tokenize word_tokenize(string)
Output:
['Hello', ',', 'I', "'m", 'Egor', '.', 'What', "'s", 'your', 'name', '?']
Ah, this is much better! The punctuation has been tokenized.
3. spaCy
spaCy is another NLP library but is considered to be more ‘up-to-date’ than NLTK and provides production-ready software.
# download spacy model in jupyter notebook !python -m spacy download en_core_web_sm# import and tokenize import spacy model = spacy.load("en_core_web_sm") doc = model(string) tokens =[] for token in doc: tokens.append(token.text) tokens
Output:
['Hello', ',', 'I', "'m", 'Egor', '.', 'What', "'s", 'your', 'name', '?']
We can see that spaCy’s output is the same as NLTK.
4. Gensim
Gensim is primarily a topic modelling library but also contains the tokenize
function which can be used as follows:
# import package from gensim.utils import tokenize# define sentence string = "Hello, I'm Egor. What's your name?"# get tokens list(tokenize(string))
Output:
['Hello', 'I', 'm', 'Egor', 'What', 's', 'your', 'name']
The useful part of Gensim is that it splits based on punctuation, so they are not included in the token list!
Conclusion
In this article, we have shown how to tokenize your data for your NLP project using four different libraries. I have attached links to the libraries below so that you can explore their functionality more!
- NLTK : https://www.nltk.org
- spaCy : https://spacy.io
- Gensim : https://radimrehurek.com/gensim/