Analysis of Emotion Data: A Dataset for Emotion Recognition Tasks Analysis of Emotion Data: A Dataset for Emotion Recognition Tasks
Emotion Recognition is a common classification task. For instance, given a tweet, you create a model to classify the tweet as... Analysis of Emotion Data: A Dataset for Emotion Recognition Tasks

Emotion Recognition is a common classification task. For instance, given a tweet, you create a model to classify the tweet as being either positive or negative. However, human emotions consist of myriad emotions and cannot be constrained to just these three categories. On the contrary, most of the datasets available for this purpose consist of only two polarities — positive, negative, and at times neutral.

However, recently, I came across a new dataset constructed from Twitter data that seems to fill this void. The dataset, aka emotion dataset, contains English language Twitter messages representing six basic emotions- anger, disgust, fear, joy, sadness, and surprise. In this article, we’ll get to know the background of data collection and explore it a bit.

Emotion Classification Dataset

The emotion dataset comes from the paper CARER: Contextualized Affect Representations for Emotion Recognition by Saravia et al. The authors constructed a set of hashtags to collect a separate dataset of English tweets from the Twitter API belonging to eight basic emotions, including anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The data has already been preprocessed based on the approach described in their paper. The dataset is stored as a pandas dataframe and ready to be used in an NLP pipeline. The distribution of the data and the list of hashtag examples for each emotion are provided below.

Source: https://aclanthology.org/D18-1404/

Accessing the datatset

We have seen the dataset. Let’s now see how to access it. I’ll demonstrate two ways to load the dataset and use it.

1. Loading data directly from the source

The dataset is already available as a pandas dataframe. There is an accompanying notebook showing how to use it for fine-tuning a pre-trained language model for emotion classification. It can be easily accessed as follows:

!wget https://www.dropbox.com/s/607ptdakxuh5i4s/merged_training.pkl

# Defining a helper function to load the data
import pickle
def load_from_pickle(directory):
return pickle.load(open(directory,"rb"))

# Loading the data
data = load_from_pickle(directory="merged_training.pkl")

Let’s look at few statistics of data.


First few rows of the dataset | Image by Author

We can we the first five rows of the dataset containing tweets and their corresponding labels. From here, we can split the dataset into a test and a validation set and train an emotion classifier. It is also a good idea to do some exploratory text analysis to understand the data. We’ll get to that in a bit, but before that, let me show you another way to load the dataset.

2. Downloading data from the Hugging Face 🤗 Dataset Hub.

Hugging face datasets library provides API to download public datasets and preprocess them easily. You can refer to this video on Huggingface datasets to get started.

The main page of HuggingFace datasets Hub | source: Huggingface website

We’ll first download the datasets library and import the necessary modules

!pip install datasets
from datasets import load_dataset

emotion_dataset = load_dataset("emotion")

This creates an emotion_dataset object.


emotion_dataset object | Image by Author

What we get is a dictionary with each key corresponding to a different split. Let’s access the training data to see its contents.

emotion_train = emotion_dataset['train']

Training dataset | Image by Author

The training dataset comprises six different classes — sadness, joy, love, anger, fear, and surprise.

Converting emotion datasets into a Pandas DataFrame

We can now easily convert the above dataset into a pandas dataframe and analyze it further.

import pandas as pdemotion_dataset.set_format(type="pandas")
train = emotion_dataset["train"][:]
test = emotion_dataset["test"][:]
val = emotion_dataset["validation"][:]

Let’s quickly check if the conversion was successful.


First few rows of the training dataset | Image by Author

Yes, indeed! Now you know how to access datasets from the Huggingface dataset hub- another addition to your data science skill. So what do you do when you have the data? You explore it, and that is precisely what we are going to do in the next section.

Exploratory Data Analysis of the emotion dataset

We’ll start by importing the necessary libraries and visualizing the data. As we already know, the data has been preprocessed, so that is a bonus. We’ll typically look for imbalance in the dataset and length of the tweets to start with. Beyond that, feel free to dive in further.

Notebooks to follow along: Exploratory Data Analysis of the emotion dataset

import numpy as np
import pandas as pd
import string

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', palette='muted', font_scale=1.2)
colors = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
matplotlib.rcParams['figure.figsize'] = 12, 8

Creating a column with label names.

The label column currently has integers. To make it more understandable, we’ll create a new column called description containing the description of each integer in the label column.

labels_dict = {0:'sadness', 1:'joy', 2:'love', 3:'anger', 4:'fear', 5:'surprise'}
train['description'] = train['label'].map(labels_dict )

Image by Author

Analysis of the Description Column

Now, let’s analyze and see how the description column looks like. I have only used the training dataset, but the process will remain the same if we wish to do it for the test dataset, too.

Examples of each emotion

Let’s look at an example of each of the emotions.

Examples of various tweets by emotions | Image by Author

Distribution of the labels in the training set

It’ll be informative to look at the distribution of the labels. This will also give us an idea of the imbalance in the dataset, if any.


sns.countplot(train['description'],order = train['description'].value_counts(normalize=True).index)

Distribution of target column in the dataset | Image by Author

About 33 percent of the tweets are joyful, followed by sad and angry tweets.

Analyzing Text Statistics

We can now do some statistical analysis to explore the fundamental characteristics of the text data. Some of the analyses which can be helpful are:

  • Text length analysis: calculating the length of the text, and
  • word frequency analysis: calculating the word count in the form of unigrams, bigrams, and trigrams.
train['text_length'] = train['text'].astype(str).apply(len)
train['text_word_count'] = train['text'].apply(lambda x: len(str(x).split()))

Tweet length analysis

plt.xlim([0, 512]);
plt.xlabel('Text Length');

Tweet length analysis | Image by Author

The histogram above shows that the length of the tweet ranges from around 2 to 300 characters.

Tweet word count analysis

Now let’s analyze the frequency of the words per tweet per class.

sns.boxplot(x="description", y="text_word_count", data=train)

Tweet word count analysis | Image by Author

Most of the tweets have an average of 15 words. Also, all the tweets appear to have more or less the same length. Hence, the length of the tweet isn’t a powerful indicator of polarity.

Distribution of top n-grams

An n-gram is a contiguous sequence of n items from a given sample of text or speech. It is also a good idea to look at various n-grams to understand which words mainly occur together. For instance, we look at the distribution of unigrams, bigrams, and trigrams across emotions- sadness, anger, and love.

Tweets from Sad category | Image by Author

Tweets from angry category | Image by Author

Tweets from ‘Love’ category | Image by Author

Now that we have done some preliminary exploration of the dataset, the next is to use this dataset to create an emotion classifier. This could be a great project to add to your resume and you can also share your trained model with the community. If you want to quickly spin up a notebook to explore the data, I have also made it available on Kaggle now.

Emotion Dataset for Emotion Recognition Tasks

Originally posted here. Reposted with permission.

Parul Pandey

Parul is a Data Science Evangelist at H2O.ai. She combines Data Science, evangelism and community in her work. Her emphasis is to break down the data science jargon for the people. Prior to H2O.ai, she worked with Tata Power India, applying Machine Learning and Analytics to solve the pressing problem of Load sheddings in India. She is also an active writer and speaker and has contributed to various national and international publications including TDS, Analytics Vidhya and KDNuggets and Datacamp.