This blog post is on song lyric sentiment. Feel free to fork this code from GitHub. Sentiment Analysis is one of...
 This blog post is on song lyric sentiment. Feel free to fork this code from GitHub.

Sentiment Analysis is one of the techniques of NLP (Natural Language Processing). It is part of NLU (Natural Language Understanding). It allows us to classify the sentiment of a text, positive or negative, according to the words it contains.

This blog post has three parts. The first part is about Data Collection. Web scraping using the BeautifulSoup and urllib2 libraries. The second part, is Text Analysis, we use the NLTK Python library to compute some statistics of the lyrics of the selected artist. And in the third part, it is about Sentiment Analysis, we use the VADER library (yes,  width= as in Star Wars  ). We will use plot the number of positive and negative songs there is per album.

Once we collect the discography of the artist we can plot it in Wordle, there is a button to do it, but you need Java installed.

You can learn more about NLP with these articles NLP with NLTK Part 1 and Part 2. Also I can recommend you this article if you want to hack your music preferences with song’s features. In this blog post we will only use lyrics (text).

You can reuse the functions from this code to develop further this article or to create your own projects.

In [1]:
from bs4 import BeautifulSoup
import urllib2 
import re 
import pandas as pd 
from IPython.core.display import display, HTML 
from wordcloud import WordCloud # to plot wordclouds 
import matplotlib.pyplot as plt 

# this line indicates the graphs are displayed in the notebook and not in a new window
%matplotlib inline

1. Data Collection

We will use http://lyrics.wikia.com because it has a simple html code to parse and has a vast number of lyrics in their database. Also, it has no limits on requests.

The main function for data collection is get_lyrics(). It takes an artist name, and downloads its discography. I recommend you to play with its parameters because you can see the cover of the album and wordclouds when you are aquiring the lyrics, just change the binary parameters.

The function plot_word_cloud() is used to plot wordclouds of the arguments passed.

In [2]:
prefix = 'http://lyrics.wikia.com'

def plot_word_cloud(corpus, max_words = 42, width=600, height=400, fig_size=(8,6)):
        if len(corpus) == 0:
            corpus = 'no words'
        wordcloud = WordCloud(max_words = max_words, width=width, height=height, background_color="black").generate(corpus)
        plt.figure(figsize=fig_size, dpi=80)
        plt.imshow(wordcloud, interpolation='bilinear')
def get_lyrics(band_name = None
               , display_cover=False
               , show_song_word_cloud=False
               , show_album_word_cloud=False
               , verbose=False):
    Asks for the artists name and download all its lyrics.
    def get_artist_link():
        Asks for a term to search and returns the first result.
        url_search = 'http://lyrics.wikia.com/wiki/Special:Search?query='
        if band_name == None:
            site = urllib2.urlopen(url_search + raw_input("Artist's name ?: ").replace(' ', '+')).read()
            site = urllib2.urlopen(url_search + band_name.replace(' ', '+')).read()
        soup = BeautifulSoup(site)

        links = []
        for link in soup.find_all("a", class_='result-link'):
            if link.get('href') <> None:
        print 'Getting lyrics from...', links[0]
        return links[0]
    def display_thumbnail(soup):
        images = soup.find_all("img", class_='thumbborder ')
        for image in images:
    def get_album_links(artist_link):
        link_discs = []
        site = urllib2.urlopen(artist_link).read()
        soup = BeautifulSoup(site)

        discs = soup.find_all("span", class_="mw-headline")
        for d in discs:
            for element in d.find_all('a'):
                link_discs.append(prefix + element.get('href'))

        return link_discs
    def get_text(lyric):
        text = ''
        for line in lyric:
            text += line
        print camel_case_split(text)
        return camel_case_split(text)
    def get_lyrics(url):
            site = urllib2.urlopen(url).read()
            soup = BeautifulSoup(site)
            lyric = soup.find_all("div", class_="lyricbox")

            if len(lyric) > 0:
                for element in lyric:
                    return re.sub("([a-z])([A-Z])","g<1> g<2>", BeautifulSoup(str(element).replace('<br/>',' ')).get_text())
    def get_list_of_links(url, link_filter):
        links = []
        site = urllib2.urlopen(url).read()
        soup = BeautifulSoup(site)
        if (display_cover):
            display_thumbnail(soup)   # displays the albums image

        for link in soup.find_all("a"):
            if link.get('href') <> None and '/wiki/' + link_filter + ':' in link.get('href') and not '?' in link.get('href'):
                links.append(prefix + link.get('href'))

        return links
    def download_lyrics(album_links):
        lyrics = []  # list with all the lrrics
        songs = []   # list with scanned links
        discography = []
        i = 1
        for album_link in album_links:
            album = []
            print 'Downloading:', i, 'out of', len(album_links), 'albums -', album_link.split(':')[-1].replace('_', ' ')
            for link in get_list_of_links(album_link, link_filter):
                if get_lyrics(link) <> None  and link not in songs:
                    lyric = get_lyrics(link)
                    if verbose:
                        print link.split(':')[-1].replace('_',' ') #print song title
                    if (show_song_word_cloud):
                        plot_word_cloud(lyric.lower(), max_words=50, width=400, height=200)

            if show_album_word_cloud:
                plot_word_cloud(str(album[:]).lower(), max_words=50, width=800, height=500)
            discography.append((album_link.split(':')[-1].replace('_', ' '), album))

        print 'nDone!', len(songs), 'lyrics aquired from', len(album_links), 'albums.'
        return discography
    artist_link = get_artist_link()
    link_filter = artist_link.split('/')[-1]
    album_links = get_album_links(artist_link)
    lyrics = download_lyrics(album_links)
    return lyrics


I’ll use Metallica’s lyrics for demonstration because I love MetallicA, you can try with your favorite band.


Arguments of get_lyrics()

The arguments of the function are as follows:

  • band_name='metallica' : A string to avoid to manually input the name of an artist.
  • display_cover = False : Boolean variable to display the
  • verbose = False : Boolean variable to show the name of the song that is been processed.
  • show_album_word_cloud = False : Boolean variable to show a word-cloud with the tokens of the album.
  • show_song_word_cloud = False : Boolean
    variable to show a word-cloud with the tokens of the discography at the end of the processing.
In [3]:
corpus = get_lyrics(display_cover = False # displays the cover of the album while is been proceesed
                    #, band_name='metallica' # name of the artist
                    , verbose = False # print the song titles
                    , show_album_word_cloud = False # shows a word-cloud per album
                    , show_song_word_cloud = False) # shows a word-cloud per song

# raw will contains all the text of the lyrics.
raw = ''
for title, songs in corpus:
    for song in songs:
Artist's name ?: metallica
Getting lyrics from... http://lyrics.wikia.com/wiki/Metallica
Downloading: 1 out of 14 albums - Kill %27Em All (1983)
Downloading: 2 out of 14 albums - Ride The Lightning (1984)
Downloading: 3 out of 14 albums - Master Of Puppets (1986)
Downloading: 4 out of 14 albums - ...And Justice For All (1988)
Downloading: 5 out of 14 albums - Metallica (1991)
Downloading: 6 out of 14 albums -  Binge %26 Purge (1993)
Downloading: 7 out of 14 albums - Load (1996)
Downloading: 8 out of 14 albums - ReLoad (1997)
Downloading: 9 out of 14 albums - Garage Inc. (1998)
Downloading: 10 out of 14 albums - S%26M (1999)
Downloading: 11 out of 14 albums - St. Anger (2003)
Downloading: 12 out of 14 albums - Death Magnetic (2008)
Downloading: 13 out of 14 albums - Lulu (2011)
Downloading: 14 out of 14 albums - Hardwired...To Self-Destruct (2016)

Done! 156 lyrics aquired from 14 albums.

Wordle is a widely known service to plot wordclouds. We can send all the words in the discography to Wordle a get a wordcloud of our artist. Instrctions to do that are here:

Diego Arenas

Diego Arenas, ODSC

I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.