This is a two-part series about using machine learning to hack my taste in music. In this first piece, I applied unsupervised learning techniques...

This is a two-part series about using machine learning to hack my taste in music. In this first piece, I applied unsupervised learning techniques and tools on Pandora data to analyze songs that I like. The second part, which will be published soon, is about using supervised on Spotify data to predict whether or not I will like a song.

Introduction

I’ve always prided myself on having an eclectic music taste that spans more than  a dozen of genres. If you take a look at my top tracks on Last.FM, you’ll notice a smorgasbord of tracks from artists like LCD Soundsytem, Jimi Hendrix, and Kanye West. When I make a playlist, it’s not uncommon for me to include some 80’s post-disco, 2000s indie rock, and Nigerian or Turkish funk.

The inspiration for this project came from years of wondering why I had such a diverse taste in music. When I recently discovered a way to download Pandora tags for songs, I knew right away I wanted to analyze this rich data. On each song’s Pandora page is a collection of features describing that song. These attributes were determined by Pandora’s famous Music Genome project, the self-described “Most comprehensive music analysis ever undertaken.” Here’s of a song and its Pandora-assigned labels.

 width=

The precise labels provided by Pandora granted me an excellent opportunity to undertake an unsupervised learning project in which I cluster a dataset of songs I’ve liked on my Pandora account and their attributes in order to analyze patterns in and derive structure from the data.

In this article, you’ll see how I applied clustering and dimensionality reduction techniques to analyze the different types of music I like.

Data Acquisition Wrangling

Getting the data occurred in two relatively simple parts, both of which involved web-scraping. To acquire the songs that I’ve “liked” on Pandora, I used the website http://pandorasongs.oliverzheng.com/, which neatly collects every song that you clicked the thumbs up button for on your Pandora account. After scraping the json data for those list of songs, I scrapped the attributes of each song from their Pandora webpage into a Pandas dataframe. The code for this data comes from Sinan Ozdemir, so give it a try if you’d to replicate this project on yourself.

My initial dataset was 574 songs with 406 features, but before I could begin my analysis I needed prune my the data. I filtered out songs with fewer than five features and features with fewer than four songs, which reduce my data to 439×258, which was certainly enough data for this project.

Analysis

First let’s take a look at the most common features in the data.

 width=

These features are fairly generic and could appear in any or most genres so it makes sense that they would appear in this graphic.

Before I applied any dimensionality reduction techniques, I used KMeans clustering to see if I could extract some structure on the data.

The follow plot displays the silhouette scores for every number of cluster between 2 and 40.

https://plot.ly/~GeorgeMcIntire/268/

These results are somewhat disappointing, remember any clustering project should aim for scores above 0.3 at the minimum. However this certainly doesn’t mean a dead-end for my project, it most likely indicates that the high number of features adds a significant amount of noise to the data. This is where dimensionality reduction can be very useful.

Here is a similar chart with multiple plots. Each plot represents the silhouette scores for a data that’s been transformed by Principal Component Analysis with 50, 10, 5, and 2 components.

https://plot.ly/~GeorgeMcIntire/270/_50-components-10-components-5-components-2-components/

There’s a strong negative correlation between the number of features and silhouette scores. Each time we reduce the number of PCA components, the silhouette scores go up, meaning that there are clusters in the data, they’re just hiding amongst a bunch of noise.

Out of the five silhouette scores curves plotted so far, it appears that three clusters gives me the best clustering model. The scores drop significantly after the third cluster and hit an asymptote after the 14th or so cluster. So for most of the clustering I’ll be using three clusters.

The following chart is a two-dimensional T-SNE plot of every song with two color cluster. Keep in mind that the axes do not matter in a T-SNE plot, what’s import is the distance between the points. Hover over a point to see the name of the song.

https://plot.ly/~GeorgeMcIntire/244/

The T-SNE transformed data clearly demonstrates there are two distinct blobs in the data which almost align perfectly with the labels. Keep in mind that the labels were trained on the original data not the T-SNE transformed data.

The two clusters significantly differ from one another. The left and lighter blob features mainly rock, r’nb, and softer forms of funk and electronic music, it’s certainly music you wouldn’t dance to. Whereas the right and darker blob is mostly danceable music which is why it’s all hip-hop or electronic. The KMeans algorithm has sorted the song into “melody” music or “beats” music.

Here is the three-dimensional version of that T-SNE plot with three labels instead of two.

When incorporating three clusters, it appears the algorithm just divided the “beats” cluster from the previous graph, while for the most part leaving the “melody” cluster untouched. The “beats” cluster was divided along genre lines. The turquoise cluster is staunchly hip-hop and the goldish cluster is solidly electronic music. I encourage you to really dive into the plot and examine from all sorts of angles. Zoom in and out and twist and turn that graph to observe the data in as many ways as possible.

https://plot.ly/~GeorgeMcIntire/283/

Using T-SNE and PCA are great for visualizing and transforming data, but what they lack is interpretability. What I mean by this is that the axes on a chart plotting T-SNE or PCA data don’t mean anything, they do not bring any value to data. Fortunately there’s a way to reduce the dimensionality of the data without losing the meaning from the features and the technique that will allow me to do that is called Non-negative Matrix Factorization.

A NMF model allows us to know which of the features are represented by a NMF component. The following table shows up top ten significant song attributes in each of component of  three-component NMF model.

[table id=36 /]

The three columns clearly represent three distinct genres: hip-hop, rock, and electronic music. A song’s position on an axis is essentially their genre score. Now let’s observe this data on a 3-D plot.

https://plot.ly/~GeorgeMcIntire/262/

Electronic and rap display a considerable amount of overlap. The majority of songs in either label possess positive values for both the hip-hop and electronic music axes. However, the rock song for the most are all one-dimensional, with very few songs having non-zero values in either the hip-hop or electronic music axes. This makes a fair amount of sense, the hip-hop and electronic cluster both comprised a single cluster when we set our earlier clustering algorithm with two clusters. They’re both genres that are very “beats” focused. What I find baffling about this result is that the diversity of the rock songs. Some songs by artists who are clearly hip-hop or electronic such as Kanye West and Daft Punk fall under this category. The so-called rock cluster includes a mix genres such as reggae, funk, and r’nb.

So try some of this analysis on yourself and uncover the clusters that lie in your music taste and let us know in the comments below. And stay tuned for the next part where I’ll be using Spotify data to build a classification model that will try to predict whether or not I will like a song.

 


©ODSC2017

George McIntire, ODSC

George McIntire, ODSC

I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.

Open Data Science - Your News Source for AI, Machine Learning & more