

Analysis of Emotion Data: A Dataset for Emotion Recognition Tasks
ModelingNLP/Text Analyticsposted by Parul Pandey January 12, 2022 Parul Pandey

Emotion Recognition is a common classification task. For instance, given a tweet, you create a model to classify the tweet as being either positive or negative. However, human emotions consist of myriad emotions and cannot be constrained to just these three categories. On the contrary, most of the datasets available for this purpose consist of only two polarities — positive, negative, and at times neutral.
However, recently, I came across a new dataset constructed from Twitter data that seems to fill this void. The dataset, aka emotion dataset, contains English language Twitter messages representing six basic emotions- anger, disgust, fear, joy, sadness, and surprise. In this article, we’ll get to know the background of data collection and explore it a bit.
Emotion Classification Dataset
The emotion dataset comes from the paper CARER: Contextualized Affect Representations for Emotion Recognition by Saravia et al. The authors constructed a set of hashtags to collect a separate dataset of English tweets from the Twitter API belonging to eight basic emotions, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The data has already been preprocessed based on the approach described in their paper. The dataset is stored as a pandas dataframe and ready to be used in an NLP pipeline. The distribution of the data and the list of hashtag examples for each emotion are provided below.

Accessing the datatset
We have seen the dataset. Let’s now see how to access it. I’ll demonstrate two ways to load the dataset and use it.
1. Loading data directly from the source
The dataset is already available as a pandas dataframe. There is an accompanying notebook showing how to use it for fine-tuning a pre-trained language model for emotion classification. It can be easily accessed as follows:
!wget https://www.dropbox.com/s/607ptdakxuh5i4s/merged_training.pkl # Defining a helper function to load the data import pickle def load_from_pickle(directory): return pickle.load(open(directory,"rb")) # Loading the data data = load_from_pickle(directory="merged_training.pkl")
Let’s look at few statistics of data.
data.head()

First few rows of the dataset | Image by Author
We can we the first five rows of the dataset containing tweets and their corresponding labels. From here, we can split the dataset into a test and a validation set and train an emotion classifier. It is also a good idea to do some exploratory text analysis to understand the data. We’ll get to that in a bit, but before that, let me show you another way to load the dataset.
2. Downloading data from the Hugging Face 🤗 Dataset Hub.
Hugging face datasets library provides API to download public datasets and preprocess them easily. You can refer to this video on Huggingface datasets to get started.

The main page of HuggingFace datasets Hub | source: Huggingface website
We’ll first download the datasets library and import the necessary modules
!pip install datasets
from datasets import load_dataset
emotion_dataset = load_dataset("emotion")
This creates an emotion_dataset object.
emotion_dataset

emotion_dataset object | Image by Author
What we get is a dictionary with each key corresponding to a different split. Let’s access the training data to see its contents.
emotion_train = emotion_dataset['train']
print(emotion_train[0])
print(emotion_train.column_names)
print(emotion_train.features)

Training dataset | Image by Author
The training dataset comprises six different classes — sadness, joy, love, anger, fear, and surprise.
Converting emotion datasets into a Pandas DataFrame
We can now easily convert the above dataset into a pandas dataframe and analyze it further.
import pandas as pdemotion_dataset.set_format(type="pandas") train = emotion_dataset["train"][:] test = emotion_dataset["test"][:] val = emotion_dataset["validation"][:]
Let’s quickly check if the conversion was successful.
train.head()

First few rows of the training dataset | Image by Author
Yes, indeed! Now you know how to access datasets from the Huggingface dataset hub- another addition to your data science skill. So what do you do when you have the data? You explore it, and that is precisely what we are going to do in the next section.
Exploratory Data Analysis of the emotion dataset
We’ll start by importing the necessary libraries and visualizing the data. As we already know, the data has been preprocessed, so that is a bonus. We’ll typically look for imbalance in the dataset and length of the tweets to start with. Beyond that, feel free to dive in further.
Notebooks to follow along: Exploratory Data Analysis of the emotion dataset
import numpy as np import pandas as pd import string import matplotlib import matplotlib.pyplot as plt import seaborn as sns sns.set(style='whitegrid', palette='muted', font_scale=1.2) colors = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"] sns.set_palette(sns.color_palette(colors)) matplotlib.rcParams['figure.figsize'] = 12, 8
Creating a column with label names.
The label column currently has integers. To make it more understandable, we’ll create a new column called description
containing the description of each integer in the label column.
labels_dict = {0:'sadness', 1:'joy', 2:'love', 3:'anger', 4:'fear', 5:'surprise'} train['description'] = train['label'].map(labels_dict ) train.head()
Image by Author
Analysis of the Description Column
Now, let’s analyze and see how the description
column looks like. I have only used the training dataset, but the process will remain the same if we wish to do it for the test dataset, too.
Examples of each emotion
Let’s look at an example of each of the emotions.

Examples of various tweets by emotions | Image by Author
Distribution of the labels in the training set
It’ll be informative to look at the distribution of the labels. This will also give us an idea of the imbalance in the dataset, if any.
train['description'].value_counts(normalize=True)
sns.countplot(train['description'],order = train['description'].value_counts(normalize=True).index)

Distribution of target column in the dataset | Image by Author
About 33 percent of the tweets are joyful, followed by sad and angry tweets.
Analyzing Text Statistics
We can now do some statistical analysis to explore the fundamental characteristics of the text data. Some of the analyses which can be helpful are:
- Text length analysis: calculating the length of the text, and
- word frequency analysis: calculating the word count in the form of unigrams, bigrams, and trigrams.
train['text_length'] = train['text'].astype(str).apply(len)
train['text_word_count'] = train['text'].apply(lambda x: len(str(x).split()))
Tweet length analysis
sns.distplot(train['text_length'])
plt.xlim([0, 512]);
plt.xlabel('Text Length');

Tweet length analysis | Image by Author
The histogram above shows that the length of the tweet ranges from around 2 to 300 characters.
Tweet word count analysis
Now let’s analyze the frequency of the words per tweet per class.
sns.boxplot(x="description", y="text_word_count", data=train)

Tweet word count analysis | Image by Author
Most of the tweets have an average of 15 words. Also, all the tweets appear to have more or less the same length. Hence, the length of the tweet isn’t a powerful indicator of polarity.
Distribution of top n-grams
An n-gram is a contiguous sequence of n items from a given sample of text or speech. It is also a good idea to look at various n-grams to understand which words mainly occur together. For instance, we look at the distribution of unigrams, bigrams, and trigrams across emotions- sadness, anger, and love.

Tweets from Sad category | Image by Author

Tweets from angry category | Image by Author

Tweets from ‘Love’ category | Image by Author
Now that we have done some preliminary exploration of the dataset, the next is to use this dataset to create an emotion classifier. This could be a great project to add to your resume and you can also share your trained model with the community. If you want to quickly spin up a notebook to explore the data, I have also made it available on Kaggle now.
Emotion Dataset for Emotion Recognition Tasks
Originally posted here. Reposted with permission.