

Visual Analysis of Music Taste Using Spotify’s API & Seaborn in Python
Data VisualizationModelingposted by ODSC Community June 24, 2020 ODSC Community

I recently started using Spotify and was amazed by the sophisticated technology that drives Spotify’s recommendation system based on collaborative filtering and NLP.
In this project, I investigated country-specific music preferences.
Data Acquisition:
I scrapped the data from Spotify’s weekly regional chart’. It is a weekly list of top 200 most-streamed songs in each one of the 64 countries including ‘global’. The data in the chart begin from Dec’16 onwards.
I have scrapped the list from 11286 ‘page_urls’ resulting in a dataset with 2.1 million rows and following attributes: song name, song URL, artist, streams, song position (within top 200), country and week. There are 48261 unique songs, 9805 artists, and 1579 genres in the dataset. Sharing with you a snapshot of the sample data:
“One good thing about music, when it hits you, you feel no pain.” -Bob Marley
I used Spotify to get other relevant music data for the scrapped songs. Spotify is a lightweight Python library for the Spotify Web API and it allows full access to all of the music data provided by the Spotify platform.
I fetched details like track popularity (score between 0–100), artist URL, artist popularity (0–100), artist followers, and artist genre from the track URL.
For each song, I also extracted audio features using Spotify.
Brief description of audio features:
Values of all the features except loudness and tempo lies in the range 0–1.
Top 25 most featured artists:
- Plot shows the number of times an artist has been featured in the top 200 list across weeks in all the countries combined.
- Featured number on the y-axis is in hundreds e.g. Ed Sheeran has been featured 48,000 times.
- Number of unique songs of a particular artist featured in the top 200 list and in top 30 (calculated from song position) is also plotted on y-axis.
Here is an exclusive analysis for India since I’m Indian.
Top 25 genres plot:
- Each song is associated with an artist and each artist may have multiple genres.
- Top 25 genres in the 48.2k unique songs are plotted as shown below.
- It shows the number of songs and artists associated with a particular Towards Data Sciencegenre.
Artist Popularity Boxen Plot:
Boxen plot (an enhanced version of classic box-plot) of the distribution of artist popularity score in each genre is shown below:
- Median artist’s popularity (indicated by dark line in the middle) in the rap genre is the highest.
- Pop artists start with a good popularity score as compared with other genres.
- Latin artists have the widest range of popularity scores.
- Rock and EDM artists fail to attain a popularity score above 90.
- Trap artist’s popularity ranges from 52–93 and most of them lie in the higher side of this range while in other genres distribution is more balanced.
Let me share some interesting insights into some of the most popular genres.
Genre-wise Artist Followers Swarmplot:
In the below swarm plot, each point indicates an artist. I have labeled the most followed artists in each genre.
Most Featured Songs in Top 200 List:
Genre-wise no. of times a song is featured in top 30:
- Here each point indicates a song.
- Pop songs rank on top of the list.
- Only a few rock songs managed to stay in the top 30 for a long time.
My Favourite Genres:
The pie chart of the top 12 genres of my saved songs in Spotify is shown below. Check this link to analyze yours. Most of the songs that I listen to are from the dance-pop genre.
Let’s explore the top artists in some major genres including my favorite ones. These are the word cloud of most featured artists in the mentioned genres.
In a process, I can explore the most popular tracks from the top artists in my fav genres and expand my playlist.
Audio Feature Correlation Analysis:
Now, let’s talk about the audio features that I extracted.
Pearson correlation coefficient (r) is a measure of the strength of a linear association between two variables. It can take a range of values from +1 to -1. A value of 0 indicates no association and a value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable.
This is the correlation heatmap among audio features.
- A positive correlation is indicated by blue and negative with red shade.
- Stronger is the association deeper is the shade.
- Loudness and energy have a strong positive association which makes sense.
- Acousticness and energy have a stronger negative association in comparison with other pairs.
- Valence has a slight positive correlation with danceability, energy, and loudness.
Audio Feature Joint KDE Plot:
We can further analyze the pairs, say energy and loudness with the Joint kernel density estimate plot as shown below.
- Both have a positive association with the Pearson r-value of 0.73.
- Distribution of both these features is left-skewed, which means the majority of the songs have a higher value of these features. Please note that axis labels do not start from 0.
- Majority of songs have audio features with loudness value close to 6 and energy value close to 0.7 as shown by the darker contour shade.
Audio Feature vs Genre Bubble Plot:
Here is another view to look at the data. In the below bubble plot, I have plotted the artists from the 3 mentioned genres with their average tempo on the x-axis and average speechiness on the y-axis. The size of the bubble is defined as their popularity score.
- Speechiness is a good criterion to separate latin songs from other genres.
- Popularity of top rap artists is more than that of others.
- Most popular rap artists have a very defined range of avg tempo in their songs (around 120 bpm).
Audio Feature Distribution Analysis
Analysis of the distribution of a few audio features in various genres.
- Hip hop, rap, and trap rank high in speechiness.
- Rock songs have the lowest speechiness.
- EDM songs have a very well defined tempo range.
- Usually, tempo across all the genres ranges between 90–160 bpm.
Audio Feature vs Genre Radial Heatmap:
In the below genre-feature radial heatmap, I have taken the average feature value of all the songs in each genre after standardization.
Greener the shade more is the average feature value. E.g. rock ranks highest in instrumentalness.
To make a better sense of the data, I have highlighted the top 7 countries (out of 21 countries) in each concentric circle i.e. in each feature. I did that by labeling the values above 66 percentile in each feature as 1 and others as 0 and then created the heatmap.
If we look at the concentric circle labeled as valence, the top 7 genres highest in average valence score are dance-pop, dutch hip hop, reggaeton, dutch urban, latin pop, tropical and latin.
My fav genre dance-pop is not in the top 7 genres in danceability as its name suggests. 😅
Country Level Analysis:
Let’s talk about countries, I took a sample of some top musical countries.
List of top 10 genres in each country:
If we look at the below chart (whole chart contains 64 bars for a total of 64 countries) out of 48.2k songs, 34.7k songs have been featured in a single country and 4.3k songs in 2 countries.
There are 32 songs that are featured in all the 64 countries to date. In top 30, there is only 1 song that is featured in all the countries.
List of songs featured in top 30 in maximum countries:
Regional Songs Proportion in a Country:
If we go a bit deeper and analyze these 34.7k songs featured exclusively in a country, we can get the sense of which countries’ regional songs dominate the top 200 list in that country.
Italy, India and France rank high in the list of countries where regional songs dominate. If we look at the top genres in these countries most of them are regional.
In Italy, top 6 genres include the regional ones like Italian hip hop and Italian pop. It’s surprising to see significantly lesser number of songs in India probably because when we get obsessed with a song we do not allow other songs to get featured in the list for a long time. 😜
Country-Specific Music Preferences:
- Spain ranks high in valence. Latin and reggaeton are the top 2 genres in Spain (see the sheet above) and both these genres rank high in valence as shown in the genre-feature heatmap.
- Similarly for France, french hip hop and pop urbaine are the 2 most prominent genres and both rank high in danceability, hence France ranks highest in danceability. Same case with acousticness and speechiness for France.
- India ranks high in acousticness. Acoustic music primarily uses instruments that produce sound through acoustic means, as opposed to electronic means.
- Japanese prefer energetic and loud music because j-pop and j-rock are prominent genres. But do they prefer energetic songs in other genres?
The plot shows the distribution of audio feature ‘energy’ in each mentioned genre. On the left side, we have the distribution with the country selected as ‘global’ and on the right we have Japan.
In the above ‘split violin plot’ their taste in energetic music is pretty much evident in genres like latin, reggaeton and tropical as compared to global.
In the sheet below, I have taken 200 most followed artists on Spotify and ranked them in each audio feature on the basis of their average score of audio features across their songs. As we can see in acousticness, Arijit Singh, Armaan Malik and AR Rahman are in the top 15. Also, Neha Kakkar from India ranked no.1 in loudness. (I’m so proud 🤩)
So if you want to make a playlist to dance on, you can check out the above list of top artists in danceability.
Top artists in other features:
P!nk Floyd, Coldplay and Khalid are top artists in instrumentalness. If you are looking for songs with positive vibes, you can check the list of artists in the valence column.
Country-wise list of top 10 most featured artists:
Most Followed Artists on Spotify:
Visualizing Data using t-SNE:
Let’s check if there is a similarity in the songs of a particular genre. Here is the t-SNE projection of 9-dimensional data (9 audio features) in 2D space. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique that maps the multi-dimensional data to a lower-dimensional space and attempts to find patterns in the data by identifying observed clusters based on the similarity of data points with multiple features.
I projected all the songs that belong to the genres: french hip hop, dance-pop and reggaeton. I ran the model for 10000 iterations with a perplexity value of 75.
As you can notice, t-SNE has tried to separate the different points and form clustered groups of similar points.
Audio Feature Radar Chart:
Here is the radar chart for the 3 mentioned genres. We can compare their audio features e.g. french hip hop has the highest speechiness value among them. The audio features were standardized before plotting.
Genre Classification with Logistic Regression and Linear SVM:
Let’s look at how well a machine learning algorithm performs Genre Classification using audio features. I took all the songs from 3 genres (dance-pop, french hip hop and reggaeton). I applied Logistic Regression with OneVsRest Classifier and linear SVM, both with hyperparameter tuning on alpha.
I trained the model on randomly selected 70% data (training data) and tested its accuracy on the remaining 30% test data with 5-fold cross-validation (cv=5). I did hyperparameter tuning on alpha and applied linear SVM as well.
The logistic regression model worked slightly better with an accuracy of 70.1%. Micro-averaged F1-score is taken as the performance metric. I am going to cover the details of this topic in the next blog. The conclusion that we can draw here is that audio features of a track are important attributes to classify a song in its respective genre.
Code snippet:
import sklearn | |
from sklearn.model_selection import train_test_split | |
x_train, x_test, y_train, y_test = train_test_split(standardized_data, labels, test_size=0.3, random_state=0) | |
#Applying Logistic Regression with OneVsRest Classifier | |
from sklearn.linear_model import SGDClassifier | |
from sklearn.multiclass import OneVsRestClassifier | |
from sklearn.model_selection import GridSearchCV | |
from sklearn import metrics | |
#hyperparameter tuning on alpha for Logistic Regression | |
param_grid = {“estimator__alpha”: [10**–5, 10**–3, 10**–1, 10**1, 10**2]} | |
classifier = OneVsRestClassifier(SGDClassifier(loss=‘log’, penalty=‘l1’, max_iter=10000), n_jobs=–1) | |
model = GridSearchCV(classifier, param_grid, scoring = ‘f1_micro’, cv=5, n_jobs=–1) | |
model.fit(x_train, y_train) | |
predictions = model.predict(x_test) | |
print(model.best_params_) | |
print(‘accuracy:’, round(metrics.accuracy_score(y_test, predictions), 3)) | |
print(‘precision recall report:’, metrics.classification_report(y_test, predictions)) | |
#Applying Linear SVM with hyperparameter tuning on alpha | |
param_grid = {“estimator__alpha”: [10**–5, 10**–3, 10**–1, 10**1, 10**2]} | |
classifier = OneVsRestClassifier(SGDClassifier(loss=‘hinge’, penalty=‘l1’, max_iter=10000), n_jobs=–1) | |
model = GridSearchCV(classifier, param_grid, scoring = ‘f1_micro’, cv=5, n_jobs=–1) | |
model.fit(x_train, y_train) | |
predictions = model.predict(x_test) | |
print(model.best_params_) | |
print(‘accuracy:’, round(metrics.accuracy_score(y_test, predictions), 3)) | |
print(‘precision recall report:’, metrics.classification_report(y_test, predictions)) |
Thank you for reading! I hope you enjoyed the article. If you want to keep up to date with my articles then follow me.
Happy Listening! 🎧
About the author, Apratim Sahu:
Growth Hacker, B.Tech M.Tech IIT Kharagpur, Photographer
Passionate about Computer Vision and AI
LinkedIn: linkedin.com/in/apratim24