We all believe that data science is a strong asset to gaining crucial insights from your business data.
However, I still find that many people (including those who are already doing data science professionally) are lacking on how data science can provide these insights.
This is because there is still a great barrier for them to understand or even believe how these insights are derived.
Data science is not machine learning.
Data science is the ability to generate data-driven business value.
Data science is a marriage of art with science in the sense that data scientists must be able to express their understanding through a visualized data story.
As we learn to tell this story well, we see that the scientific and business benefits follow.
Data visualizations are a great way to show your data in an easy-to-understand way.
There are a plethora of data visualization techniques, such as graphs, box plots, bar plots, and histograms – to name a few.
A key skill that all great data scientists possess is the ability to know which is the best visualization technique to use for what they are trying to show.
In this blog post, we will shift our attention to time-series data.
We will answer the following questions:
1. How can we best illustrate time-series data?
2. How can we easily develop it in Python?
Visualizing Time-Series Data
There are multiple ways to visually present time-series data.
Traders (and anyone who has dabbled with crypto) will most definitely be familiar with the scatter plot — or its candle chart variation — as means to depict time-series.
Albeit its ability to show how our target value is changing over time, this type of plot can become really complex really fast.
Firstly, the longer the data range, the more difficult it becomes to show the entire data.
Secondly, comparing multiple target values over the same time also becomes tedious.
Sure, we can get away with a comparison between 2, maybe 3, targets. But anything beyond, and the visualization becomes too heavy.
A cleaner way to achieve this is to bring our static plots to life through animation.
Animation is a very powerful tool for presenting your data, especially when time series are involved.
Animated visualizations allow you to clearly see patterns in your data, that you might have otherwise overlooked with a static chart.
These types of plots, most commonly referred to as data-races, have increased in popularity over the years.
There is even a YouTube channel dedicated to uploading only data-races.
In this blog, you will learn how to create your first animated data visualization in Python which will give you a better understanding of how to work with time series data.
A mixture of pandas, matplotlib, and bokeh are used for creating this visualization.
Creating a Data Race Visualisation using Python
As an example, we will be using the Earth Surface Temperature Data dataset from Kaggle. This dataset provides an Earth surface temperature reading by country per month.
The dataset contains 3239 monthly readings for 243 different countries.
The raw dataset has a shape of 577462 rows by 4 columns.
The first step before we can generate the visualization, we need to transform our dataset to a wide format.
Wide format means that every row must represent a new time element, while every column should hold the value of a specific target.
Our dataset has the following structure:
Earth Surface Temperature Data dataset structure. Image by author.
Step 1: Decide which metric to focus on.
In this case, we will visualize the AverageTemperature
Step 2: Transform the dataset to a wide format.
Our columns should be the names of the different countries, and their values should reflect their average temperature reading for that time element.
This step can easily be done using a pivot table.
import pandas as pd df = pd.read_csv('GlobalLandTemperaturesByCountry.csv') df = pd.pivot(df, index='dt', columns='Country', values='AverageTemperature')
This creates a table with 3239 rows and 243 columns (belonging to the 243 countries available in the dataset).
The next step is to fill in any missing values. When creating our animation, it is important that we always have some value to show. Any NULL values will break the aesthetics of our animation.
In our case, our dataset does suffer from missing values. Especially for the early years. Our filling strategy will be to forward-fill (use the last valid reading for the missing one) our missing recordings.
We also want to drop all countries from our dataset which do not have any recordings.
For the sake of simplicity and performance, I will also limit our data to start from the year 2000 onwards. If you are following along, you absolutely do not have to do this step.
We can easily do this in Python using Pandas.
df.fillna(method='ffill', axis=0, inplace=True) # get idx of the year 2000 # get number of months between our dates # we need to add 2 months since our data starts from 1973-11 total_month_difference: int = ((2000-1744) * 12) + 2 # get all rows from the year 2000 onwards filtered_df: pd.DataFrame = df.copy().iloc[total_month_difference:] # drop all countries without any data filtered_df.dropna(axis=1, how='all', inplace=True) print(filtered_df.shape) # (165, 242)
Our dataset now has 165 rows and 242 columns and it’s ready to be animated!
Step 3: Visualise it!
One of the main reasons why I love Python is that someone, somewhere, has already created a package to solve our task using a few lines of code.
This is the case for data race visualizations as well.
We can install the package either using pip or conda, as follows:
pip install bar_chart_race
conda install -c conda-forge bar_chart_race
After installation, we can import the package into our Python module and initialize the visualization.
import bar_chart_race as bcr from matplotlib import rcParams # update font family for use with non-Latin country names rcParams['font.sans-serif'] = ['Source Han Sans TW', 'sans-serif'] bcr.bar_chart_race( df=filtered_df[random_20_countries], filename=None, # NOne will render the result as an HTML. If a string is passed, the video will be saved locally figsize=(5, 3), dpi=144, title='Global Land Temperatures By Country')
And the result? (Note: I had to convert the result to a GIF to be supported by Medium. This resulted in some of the scales being weird).
And there you have it — your first animated plot using Python!
I encourage you to go over the documentation of this package to familiarise yourself with all the extra functionality.
Full source code:
import pandas as pd import bar_chart_race as bcr from matplotlib import rcParams import random rcParams['font.sans-serif'] = ['Source Han Sans TW', 'sans-serif'] df = pd.read_csv('GlobalLandTemperaturesByCountry.csv') df = pd.pivot(df, index='dt', columns='Country', values='AverageTemperature') df.fillna(method='ffill', axis=0, inplace=True) # get idx of the year 2000 # get number of months between our dates # we need to add 2 months since our data starts from 1973-11 total_month_difference: int = ((2000-1744) * 12) + 2 filtered_df: pd.DataFrame = df.copy().iloc[total_month_difference:] # get all rows from 2000 onwards filtered_df.dropna(axis=1, how='all', inplace=True) available_countries: list[str] = filtered_df.columns.tolist() # inplace shuffle of the countries random.shuffle(available_countries) random_20_countries: list[str] = available_countries[:20] bcr.bar_chart_race( df=filtered_df[random_20_countries], filename='bcr_land_temp.mp4', figsize=(5, 3), dpi=144, title='Global Land Temperatures By Country')
Article originally posted here by David Farrugia. Reposted with permission.