Parallel plots are useful for understanding the connections in a data set. In this post, we will demonstrate how ggplot2 and ggforce packages can be combined to create Parallel set plots–an extension of parallel plots.
Parallel set plots depict the proportional flow of information through a system. This can also be thought of as a sort of guide informing us of how the features in our data set are connected. The key advantage of Parallel set plots is the addition of the “proportional” component.
You may have seen a parallel coordinate plot that could be thought of as the foundation on which the parallel set plot is built. Parallel coordinate plots depict the relationship of the features in the data without accounting for the proportion each feature represents. Parallel coordinate plots are great for comparing continuous data while parallel set plots are better suited for use with categorical data.
Creating a Parallel Coordinate Plot with ggplot2 and ggforce
Using the mtcars data set found in R, we will create a parallel coordinate plot. This plot is useful for visualizing the manner in which groups within the data compare on rank across the features. We first call the data and complete some data manipulation using the tidyverse package:
> #call the mtcars data
> mydata <- mtcars
> mydata <- rownames_to_column(mydata, var = ‘vehicle’)
> mydata <-
+ mydata %>% gather(key = ‘key’, value = ‘value’,-vehicle)
We then filter on a subset of the vehicles to make the parallel coordinate plot easier to read. We also scale the values so that we can better compare across all levels.
> #parallel coord
> mydata %>% filter(vehicle %in% c(‘Maserati Bora’,’Pontiac Firebird’,’Camaro Z28′
+ ,’Toyota Corolla’,’Honda Civic’)) %>%
+ arrange(key) %>% mutate(value = log(value)) %>%
+ ggplot(aes(x = key,y = value, colour = vehicle, group = factor(vehicle))) +
+ geom_path(position = ‘identity’) +
This parallel coordinate plot allows us to visualize how the categories compare across vehicles. We can easily see that the Honda Civic and Toyota Corolla seem to be in a different group than the other vehicles.
The Parallel Set Plot
Now if we wanted to visualize a data set of categorical data, we can build a parallel set plot. Parallel set plots, alluvial plots and sankey diagrams are all very similar in their presentation. We will focus on the parallel set plots. In this case we use the adults data set from the UCI data repository. We load our data and create a negate function to help with filtering.
> #parallel set plot
> ‘%not_in%’ <- Negate(‘%in%’) #not in function
> adults <- read.csv(‘adult_csv.csv’)
> adults <- adults[c(2,4,6,7,9,10,15)] #select sub set of features
We then clean our data by removing any blank values and renaming some of the education levels.
> #clean up the education levels
> education_levels <-
> adults$education <- as.character(adults$education)
> adults$education[which(adults$education %not_in% education_levels)] <- ‘less_than_HS’
> adults <- adults[which(adults$workclass != ”),]
The parallel set plot depicts the aggregated totals of unique sequences through the data. To calculate this we group across all features in the data and count the frequency of occurrence. We also filter on values greater than 10 to make the plot more readable.
> #requires data.frame containing the frequency of some sequence
> adults <-
+ adults %>% group_by(workclass,education,marital.status,occupation,race,sex,class) %>%
+ summarise(freq = n()) %>% filter(freq > 10)
The parallel plot functions require the data to be in long form. We transform our data set to long form and using the ggparallel_sets functions to generate the visual.
> #gather the data.frame into long form
> adults <- gather_set_data(adults, 1:7)
> #plot parallel set
> ggplot(adults, aes(x, id = id, split = y, value = freq)) +
+ geom_parallel_sets(aes(fill = class), alpha = 0.3, axis.width = 0.2) +
+ geom_parallel_sets_axes(axis.width = 0.2) +
+ geom_parallel_sets_labels(colour = ‘black’,angle = 360,size = 3) +
The form of the plot shares characteristics with the parallel coordinate plot. Immediately we see how the width of the stream connecting each feature provides a natural path for the eye to follow. We can easily make conclusions regarding the data. We can see that a small number of male high school grads working in the private sector earn greater than 50k per year.
Parallel plots are great for visualizing the relationships of features in a data set. Another application could be in visualizing model performance across a series of models. This could aid in the selection of the best model for a given data set.