Sentiment Analysis is one of the techniques of NLP (Natural Language Processing). It is part of NLU (Natural Language Understanding). It allows us to classify the sentiment of a text, positive or negative, according to the words it contains.
This blog post has three parts. The first part is about Data Collection. Web scraping using the
urllib2 libraries. The second part, is Text Analysis, we use the
NLTK Python library to compute some statistics of the lyrics of the selected artist. And in the third part, it is about Sentiment Analysis, we use the
VADER library (yes, as in Star Wars ). We will use plot the number of positive and negative songs there is per album.
Once we collect the discography of the artist we can plot it in Wordle, there is a button to do it, but you need Java installed.
You can learn more about NLP with these articles NLP with NLTK Part 1 and Part 2. Also I can recommend you this article if you want to hack your music preferences with song’s features. In this blog post we will only use lyrics (text).
You can reuse the functions from this code to develop further this article or to create your own projects.
from bs4 import BeautifulSoup import urllib2 import re import pandas as pd from IPython.core.display import display, HTML from wordcloud import WordCloud # to plot wordclouds import matplotlib.pyplot as plt # this line indicates the graphs are displayed in the notebook and not in a new window %matplotlib inline
1. Data Collection¶
We will use http://lyrics.wikia.com because it has a simple html code to parse and has a vast number of lyrics in their database. Also, it has no limits on requests.
The main function for data collection is
get_lyrics(). It takes an artist name, and downloads its discography. I recommend you to play with its parameters because you can see the cover of the album and wordclouds when you are aquiring the lyrics, just change the binary parameters.
plot_word_cloud() is used to plot wordclouds of the arguments passed.
prefix = 'http://lyrics.wikia.com' def plot_word_cloud(corpus, max_words = 42, width=600, height=400, fig_size=(8,6)): try: if len(corpus) == 0: corpus = 'no words' wordcloud = WordCloud(max_words = max_words, width=width, height=height, background_color="black").generate(corpus) plt.figure(figsize=fig_size, dpi=80) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show() return except: pass return def get_lyrics(band_name = None , display_cover=False , show_song_word_cloud=False , show_album_word_cloud=False , verbose=False): "" Asks for the artists name and download all its lyrics. "" def get_artist_link(): "" Asks for a term to search and returns the first result. "" url_search = 'http://lyrics.wikia.com/wiki/Special:Search?query=' if band_name == None: site = urllib2.urlopen(url_search + raw_input("Artist's name ?: ").replace(' ', '+')).read() else: site = urllib2.urlopen(url_search + band_name.replace(' ', '+')).read() soup = BeautifulSoup(site) links =  for link in soup.find_all("a", class_='result-link'): if link.get('href') <> None: links.append(link.get('href')) print 'Getting lyrics from...', links return links def display_thumbnail(soup): images = soup.find_all("img", class_='thumbborder ') for image in images: display(HTML(str(image))) return def get_album_links(artist_link): link_discs =  site = urllib2.urlopen(artist_link).read() soup = BeautifulSoup(site) discs = soup.find_all("span", class_="mw-headline") for d in discs: for element in d.find_all('a'): link_discs.append(prefix + element.get('href')) return link_discs def get_text(lyric): text = '' for line in lyric: text += line print camel_case_split(text) return camel_case_split(text) def get_lyrics(url): try: site = urllib2.urlopen(url).read() soup = BeautifulSoup(site) lyric = soup.find_all("div", class_="lyricbox") if len(lyric) > 0: for element in lyric: return re.sub("([a-z])([A-Z])","g<1> g<2>", BeautifulSoup(str(element).replace('<br/>',' ')).get_text()) except: pass def get_list_of_links(url, link_filter): links =  site = urllib2.urlopen(url).read() soup = BeautifulSoup(site) if (display_cover): display_thumbnail(soup) # displays the albums image for link in soup.find_all("a"): if link.get('href') <> None and '/wiki/' + link_filter + ':' in link.get('href') and not '?' in link.get('href'): links.append(prefix + link.get('href')) return links def download_lyrics(album_links): lyrics =  # list with all the lrrics songs =  # list with scanned links discography =  i = 1 for album_link in album_links: album =  print 'Downloading:', i, 'out of', len(album_links), 'albums -', album_link.split(':')[-1].replace('_', ' ') i+=1 for link in get_list_of_links(album_link, link_filter): if get_lyrics(link) <> None and link not in songs: lyrics.append(get_lyrics(link)) lyric = get_lyrics(link) album.append(lyric) if verbose: print link.split(':')[-1].replace('_',' ') #print song title if (show_song_word_cloud): plot_word_cloud(lyric.lower(), max_words=50, width=400, height=200) songs.append(link) if show_album_word_cloud: plot_word_cloud(str(album[:]).lower(), max_words=50, width=800, height=500) discography.append((album_link.split(':')[-1].replace('_', ' '), album)) print 'nDone!', len(songs), 'lyrics aquired from', len(album_links), 'albums.' return discography artist_link = get_artist_link() link_filter = artist_link.split('/')[-1] album_links = get_album_links(artist_link) lyrics = download_lyrics(album_links) return lyrics
I’ll use Metallica’s lyrics for demonstration because I love MetallicA, you can try with your favorite band.
The arguments of the function are as follows:
stringto avoid to manually input the name of an artist.
display_cover = False:
Booleanvariable to display the
verbose = False:
Booleanvariable to show the name of the song that is been processed.
show_album_word_cloud = False:
Booleanvariable to show a word-cloud with the tokens of the album.
show_song_word_cloud = False:
variable to show a word-cloud with the tokens of the discography at the end of the processing.
corpus = get_lyrics(display_cover = False # displays the cover of the album while is been proceesed #, band_name='metallica' # name of the artist , verbose = False # print the song titles , show_album_word_cloud = False # shows a word-cloud per album , show_song_word_cloud = False) # shows a word-cloud per song # raw will contains all the text of the lyrics. raw = '' for title, songs in corpus: for song in songs: raw+=song
Artist's name ?: metallica Getting lyrics from... http://lyrics.wikia.com/wiki/Metallica Downloading: 1 out of 14 albums - Kill %27Em All (1983) Downloading: 2 out of 14 albums - Ride The Lightning (1984) Downloading: 3 out of 14 albums - Master Of Puppets (1986) Downloading: 4 out of 14 albums - ...And Justice For All (1988) Downloading: 5 out of 14 albums - Metallica (1991) Downloading: 6 out of 14 albums - Binge %26 Purge (1993) Downloading: 7 out of 14 albums - Load (1996) Downloading: 8 out of 14 albums - ReLoad (1997) Downloading: 9 out of 14 albums - Garage Inc. (1998) Downloading: 10 out of 14 albums - S%26M (1999) Downloading: 11 out of 14 albums - St. Anger (2003) Downloading: 12 out of 14 albums - Death Magnetic (2008) Downloading: 13 out of 14 albums - Lulu (2011) Downloading: 14 out of 14 albums - Hardwired...To Self-Destruct (2016) Done! 156 lyrics aquired from 14 albums.
Wordle is a widely known service to plot wordclouds. We can send all the words in the discography to Wordle a get a wordcloud of our artist. Instrctions to do that are here:
I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.