24 Useful Open Datasets for Natural Language Processing 24 Useful Open Datasets for Natural Language Processing
Natural language processing forms the foundation of innovation in artificial intelligence. We want machines that sound like us, understand us, and... 24 Useful Open Datasets for Natural Language Processing

Natural language processing forms the foundation of innovation in artificial intelligence. We want machines that sound like us, understand us, and take on tasks previously only possible through human interaction. As a result, the company or developer that finally cracks the language code will usher in a new era of human-machine collaboration, spurring the creation of unique NLP datasets.

We’re getting close to that reality. Until then, developers can build and train with these open-source NLP datasets specific to natural language processing.

General NLP Datasets

Wikipedia Links Data: With around 13 million documents and corresponding hyperlinks, this massive NLP dataset treats each page as an entity. It’s available through the Google Code archive.

Penn Treebank: The corpus was taken from the Wall Street Journal and remains one of the most popular sets for the evaluation of sequence labeling models. 

NLTK: While not a specific dataset, per se, this Python library offers over 100 corpora and related lexical resources for computational linguistics and other NLP fields. Plus, users can take advantage of NLTK book, a training course for working with the library.

Universal Dependencies: UD offers a framework for consistent annotation of grammar. It offers resources in over 100 languages, with 200 treebanks and over 300 community supporters.


Sentiment Analysis

IMDB reviews: A (relatively) small dataset of around 25,000 reviews takes advantage of how honest people are with their movie opinions. This NLP dataset can be good for those getting their feet wet in sentiment analysis. 

Standford Sentiment TreeBank: An NLP dataset originating with Rotten Tomatoes, this option offers longer phrases and more nuanced examples of text-based data.

The Blog Authorship Corpus: This collection of posts from bloggers leverages nearly 1.4 million words, with each blog offered as a separate dataset. 

Amazon Product Dataset: With over 140 million product reviews and their metadata, this dataset provides a large collection of tagged reviews, associated links, and relevant information gleaned from Amazon between 1996 and 2014.


word.net: This lexical database loosely resembles a thesaurus and connects words via synonym clusters. It includes well over 100,000 synonym sets connected to others through conceptual relationships.

20 News Groups: A collection of 20,000 documents covering 20 different newsgroups in a range of subjects, this collection remains popular for a variety of text projects, including classification or clustering.

UCI’s Spambase: Hewlett Packard originally created this dataset to help train spam filters. Now, it includes collections of emails labeled spam from both work and personal email accounts.

Billion Word Benchmark: This language modeling dataset comes from the WMT 2011 News Crawl and contains close to one billion words for evaluating novel language modeling techniques.


Must-C: A multilingual speech translation corpus, this set includes several hundred hours of audio taken from Ted Talks and supports multiple language directions. It falls under a Creative Commons license.

VOICES (Voices Obscured In Complex Environmental Settings): A dataset designed for speech and signal processing, this set was recorded using far-field microphones in noisy conditions. The recordings include multiple sessions using 12 microphones placed around the room.

TIMIT: A dataset designed for evaluating automatic speech recognition systems, this collection includes 630 speakers in eight dialects of American English. It includes transcripts of the recordings.

MaSS (Multilingual corpus of Sentence-aligned Spoken utterances): This extension of CMU Wilderness Multilingual Speech Dataset offers over 8000 clean parallel spoken utterances across eight languages. The recordings are readings from the New Testament.

Question and Answer 

Stanford Question and Answer Dataset (SQuAD): This reading comprehension dataset consists of questions posed by Wikipedia crowd workers. It combines 100,000 answerable questions with 50,000 unanswerable ones written adversarially by crowd workers.

Natural Questions: A corpus training set with over 300,000 training examples, over 7800 development examples, and over 7800 test examples. Each one provides a Google-based query and a corresponding wikipedia page. 

TriviaQA: This realistic question set is more challenging than typical benchmark datasets. Also, it includes 950,000 QA pairs, including both human-verified and machine-generated subsets. 

CLEVR (Compositional Language and Elementary Visual Reasoning): A synthetic visual question answering dataset containing 3D rendered objects with questions falling into different categories, this includes thousands of questions with accompanying attributes for the visual scene.

Chatbot Training

Ubuntu Dialog Corpus: Almost one million two-person conversations, these dialogs are taken from technical support Ubuntu chatlogs.

ConvAI3 Dataset: This contains more than 2000 dialogs from a PersonaChat competition. Plus, human evaluators chatted with bots submitted by different teams. 

MultiWOZ: This dataset is larger than all previous task-oriented corpora. It provides a collection of fully labeled conversations spanning multiple domains. 

Relational Strategies in Customer Service Dataset: RSiCS offers a collection of customer service dialogs spanning travel-related topics. It can improve the relational abilities of intelligent virtual agents.

Transforming the way AI relates to Humans and back with NLP Datasets

These NLP datasets could be just the thing developers need to build the next great AI language product. These open-source datasets for natural language processing offer excellent resources for building better language capabilities. 

Let us know if we’ve missed any of your favorite NLP datasets. Alternately, be the first to tell us about up-and-coming sets to watch for in the comments.

Learn More About NLP and NLP Datasets at ODSC West 2021

At our upcoming event this November 16th-18th in San Francisco, ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on NLP and NLP datasets. You can register now for 30% off all ticket types before the discount drops to 20% in a few weeks. Some highlighted sessions on NLP and NLP datasets include:

  • Transferable Representation in Natural Language Processing: Kai-Wei Chang, PhD | Director/Assistant Professor | UCLA NLP/UCLA CS
  • Build a Question Answering System using DistilBERT in Python: Jayeeta Putatunda | Data Scientist | MediaMath
  • Introduction to NLP and Topic Modeling: Zhenya Antić, PhD | NLP Consultant/Founder | Practical Linguistics Inc
  • NLP Fundamentals: Leonardo De Marchi | Lead Instructor | ideai.io

Elizabeth Wallace, ODSC

Elizabeth is a Nashville-based freelance writer with a soft spot for startups. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain - clearly - what it is they do. Connect with her on LinkedIn here: https://www.linkedin.com/in/elizabethawallace/