This is the second article in our two-part series on using unsupervised and supervised machine learning techniques to analyze music data from Pandora and Spotify.
As you may recall from the previous post I did, where I applied dimensionality reduction and clustering techniques to a set of songs I liked on the internet radio service Pandora, I’m a massive music fan. Since I was a teenager, there was hardly a moment where I wasn’t listening to music. I’ve always prided myself on possessing a diverse taste as well, playlists that jump from 60’s psych rock to Golden Age hip hop to Nigerian funk are quite common for me. Part of the inspiration for these projects is the lingering question of why I have such a varying taste in music? How is it possible for me to like 80’s Chicago house, 00’s indie, and cheery oldies hits? Though, I have yet to vanquish this persistent white whale, this project does help me move in the right direction.
When I discovered that Spotify provides an API service that allows you can access data of their archive of millions of songs, I knew I had to get my hands on it. Their API gives uses the ability to download a song’s audio features. These features includes attributes such as a song’s tempo, level of acousticness, and how danceable a song is. Below are detailed descriptions of what those attributes mean.
Now that I had a viable and attainable source of data, the next step for me was to devise an actual project. Since I had already done an unsupervised learning project with the Pandora data, I knew that it was time for go to the supervised route. To turn this data source in supervised classification learning project, I decided that I would use these features to predict whether or not I would like a certain song. Therefore I created two separate playlists, one of songs I do like and one of songs I don’t like. The goal of this project is to see how good of a classifier I can build to predict whether or not I like a given song and to see which of the features are the most informative.
Data preparation and acquisition
The task of creating two different playlists was probably the funnest aspect of the project. I decided that 1000 songs in each playlist would be a sufficient amount of data. Since I’ve been an avid Spotify user for the past six years, I had already accrued about 800 songs in my Starred playlist. I was 40% of the way there before I had even had the idea of doing this project. However I did not cut out a dozen or so songs that for some reason I can’t remember liking them. After simply copying and pasting those songs in the “GOOD” playlist, I needed to find about 200 songs to fill it out. That was a pretty simple task, which mainly involved searching for my favorite bands/artists and adding their songs.
Crafting the “BAD” playlist was where the real challenge laid. It could’ve been very easy for me to simple just load in the entire Toby Keith or Skrillex catalogs and call it a day. However, to minimize bias, “BAD” had to be as diverse as “GOOD”.
I started brainstorming artists who I didn’t like and 5-8 songs of theirs to “BAD”. This gave me about 200, it was a decent start but clearly I needed to dig deeper. If you click on the “Browse” tab in the Spotify application, you’ll be taken to a treasure trove of Spotify-crafted playlists that include nearly every single genre of music created by mankind. I’d hit the jackpot. From there, I swiped songs from playlists of jazz, country, and dubstep songs. With the rock, electronic, and hip-hop playlists, I had to be careful as to not include songs I did. However, I was still about 130 songs of my 1000-song. Where could I find 130 songs I didn’t like? Then it hit me, I should copy the “Starred” playlists from friends who have god awful music tastes. With the help of three friends, I finally hit the 1000-song mark. I owe a great deal of credit to those three people, they listen to some appalling music.
After the playlists were finalized, it was time to turn them into data. Fortunately for me, there exists a light weight python library called spotipy, which will grant me easy access to the data I wish to acquire instead of having to making 2000 API calls. The following code shows you how to download a set of songs from a playlist, extract their audio features, and input them into a pandas dataframe.
import pandas as pd import spotipy sp = spotipy.Spotify() from spotipy.oauth2 import SpotifyClientCredentials cid ="CLIENT ID" secret = "CLIENT SECRET" client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret) sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) sp.trace=False playlist = sp.user_playlist("geomcintire", "PLAYLIST ID") songs = playlist["tracks"]["items"] ids =  for i in range(len(songs)): ids.append(songs[i]["track"]["id"]) features = sp.audio_features(ids) df = pd.DataFrame(features)
The best part about the data wrangling is the minimal data cleaning and wrangling. When I passed through the list features variable into a pandas dataframe, it returned a clean and neatly ordered dataset absent of null values. In addition, the only feature engineering to be done was converting the duration column from milliseconds to minutes. The only thing left for me to do was label the two different datasets (1 for good, 0 for bad) and concatenate them.
Exploratory Data Analysis and Dimensionality Reduction
Given that this is my first time working with kind of data, I knew I needed to undergo a thorough EDA to attain a significant level of familiarity with the data. I refrained from making any sort of hypotheses before diving into the data, I was pretty much going in blind.
For my EDA, my two goals were understand the variance of the features and to see which of the features correlate the best with the target variable (liked/disliked song.)
The following plot displays nine sub-scatterplots of selected features. The yellow dots indicate good songs and purple ones for the bad.
As soon as I saw this scatter plot matrix, I was thrown for a loop. The plot revealed a chaotic and tangled web of data that was almost devoid of any discernible insights. These nine subplots are very of representative of the scatter plots for every combination of features, which is great because throwing on 80+ miniplots would be overwhelming.
Though there is evidence of slight correlations between certain features (acousticness/energy and acousticness and loudness), I observed just an iota of correlation between the selected features and the target variable. This meant that my task of devising a competent classifier was a difficult one and that a linear classifier would most likely not be sufficient enough.
Before moving onto the modeling section, it was imperative that I observe the data in 2-dimensional form.
This TSNE plot maps out the 13-dimensional data onto a 2D scatter plot with points labelled as “Good” or “Bad” songs. Hover over a dot to see the song name.
The TSNE plot like the scatter matrix is a messy spattering of dots with no discernible pattern. TSNE plots are useful at visualizing high dimensional but their values don’t hold meaning, which is why I decided to implement other dimensionality reduction techniques like PCA, Truncated SVD, and Non-negative Matrix Factorization. Unfortunately those methods produce essentially the same results as my TSNE graph. The exploratory data analysis and unsupervised learning processes demonstrated that a machine learning algorithm would face a significant challenge in attempting to classify song data as “Good” or “Bad.”
In this project, I applied a minimal amount of feature engineering—I converted milliseconds to minutes as I mentioned earlier—which is very uncommon in my experience. I also didn’t drop any of the features from my original dataset, the EDA portion of the project was not helpful in informing me which features to drop or to focus on, so I decided it’d be best to keep all.
Going into the modeling portion of the project, I was very low on confidence due to the results from EDA and dimensionality work.
My suspicions turned out to be correct. A logistic regression model could churn a cross-validated accuracy score of 60%, which is 20% more than my null accuracy of 50%. K-nearest neighbors produced a similar result when trying a variety of number of neighbors. The tree-based algorithms of decision trees and random forest yield significantly better results, which hovered around the upper 60s. A decent improvement but not enough for my satisfaction.
Here are the cross-validated accuracies for my KNN and Decision Trees model with varying degrees of complexity.
Decision tree is the clear winner in this graphic, its lowest score is greater than the highest K-nearest neighbors’ score. For both KNN and DT, increasing the model’s complexity (number of neighbors or depth level) produces a diminishing marginal returns after a certain point.
To see if I could improve my scores, I opted to go with a more high-powered algorithm: gradient boosting. However much to my surprise, the best cross-validated score I could come up with was a 0.693 from a model with a max depth of 7. I employed grid-searching to find the best parameters for my model, but I still was not able to persist the 0.7 threshold. I spent hours upon hours making lateral movements in my model tweaking.
I retried feature engineering, this time with more verve. I dropped and combined features, I tried thousands of different combinations of features, and yet I had no luck in budging my model’s evaluation metrics forward.
My next move was to utilize the feature importances devised by my tree-based algorithms to see which are features are the best when it comes to making a classification.
My gradient boosting model decided that loudness, danceability, instrumentalness, speechiness, and duration were the five most important features in my dataset. Unfortunately when I passed just those five features into my model, it yielded lesser scores than when I used all the features.
I persisted with the feature engineering until I had to accept that I could not extract a statistically significant improvement in model. The best performing model I employed was a Gradient Boosting Classifier with a learning rate of 0.1 and a max depth of 7.
Even though I did not create a world-class model, I felt that my modeling produced respectable results. I was pleased to see that my model received a ROC curve score about of 0.75 and that my final accuracy score was 40% larger than my null accuracy. With the conclusion of my modeling, it was now time to make predictions and to see what I could learn about myself and my taste in music.
Testing the Model
To measure the truth worth the model, I decided to apply it to a testing set. For this new testing dataset I opted to use songs from my Discover Weekly playlist, a playlist of 30 songs recommended to me each week by Spotify. I selected 100 songs from the most recent of my Discover Weekly playlist and passed them through my model. Next, I listened all 100 songs and labelled them as “Good” or “Bad”.
Then came the big moment. How well could my model predict whether or not I liked songs from my Discover Weekly playlist.
The results show that Gradient Boosting model could only correctly classify 49% (43 true positives + 6 true negatives) of the songs in the testing set. Of the 61 songs I labelled as “Good”, the model correctly identified 70.5% of them and of the 76 songs my model predicted that I would like, I liked 57% of them.
The model fared much worse when it came to identifying the “Bad” songs than the “Good” ones. Out of the 39 songs, I didn’t like, the model correctly identified just 15% of (6 out 39).
In conclusion, these results are not indicative of a exceptional let alone a satisfactory model. The model’s problem is that it frequently overestimates the likeability of certain songs and it has failed to take in account my somewhat conservative judgement.
What Do I Like in a Song? Or How to Find the Perfect Song.
So which features do I like and do not like?
To find this out, I trained a logistic regression model on a scaled version of the data. I did this because I wanted to see which features had the highest and lowest coefficient values, thus showing which features have the biggest impact on the probability that I will like a song.
I found this graphic to be quite remarkable. The two features with the lowest coefficients are “loudness” and “acousticness”, two traits that are at odds with another. We see the same trend on the other end of the graph with “instrumentalness” and “speechiness”.
The model says that an increase by the standard deviation of the loudness feature (assuming all other features are constant) will lead to 10 percentage point decrease in the probability that I will like a song and an increase in the standard deviation of the instrumentalness feature will lead to a 7.5 percentage point increase in the probability that I will like a song.
It appears that for me like to a song, it can’t be loud nor acousticness and should be danceable with high levels of speechiness and instrumentality.
George McIntire, ODSC
I'm a journalist turned data scientist/journalist hybrid. Looking for opportunities in data science and/or journalism. Impossibly curious and passionate about learning new things. Before completing the Metis Data Science Bootcamp, I worked as a freelance journalist in San Francisco for Vice, Salon, SF Weekly, San Francisco Magazine, and more. I've referred to myself as a 'Swiss-Army knife' journalist and have written about a variety of topics ranging from tech to music to politics. Before getting into journalism, I graduated from Occidental College with a Bachelor of Arts in Economics. I chose to do the Metis Data Science Bootcamp to pursue my goal of using data science in journalism, which inspired me to focus my final project on being able to better understand the problem of police-related violence in America. Here is the repo with my code and presentation for my final project: https://github.com/GeorgeMcIntire/metis_final_project.