Data Visualization – Part 3 Data Visualization – Part 3
What Type of Data Visualization Do You Choose (if any)? Determining whether or not you need a visualization is step one. While... Data Visualization – Part 3

What Type of Data Visualization Do You Choose (if any)?

Determining whether or not you need a visualization is step one. While it seems silly, this is probably something everyone (including myself) should be doing more often. A lot of times, it seems like a great way to showcase the amount of work you have been doing, but winds up being completely ineffective and could potentially harm what you’re doing. Once you determine that you actually need to visualize your data, you should have a rough idea of the options to look at. This post will explain and demonstrate some of the common types of charts and plots.

This is Part 3 in a series about Data visualization:


Determine whether or not you actually need a visualizatoin in the first place.

Like the best practices I listed in Data Visualization – Part 1, make sure your visualizations:

  • Are clearly illustrating a relevant point
  • Are tailored to the appropriate audience
  • Are tailored to the presentation medium
  • Are memorable to those who care about the material
  • Are increasing the understanding of the subject matter

If these don’t seem possible, you probably don’t need a data visualization.

If you do need one, what’s a good first step to take?

Take a look at the forum in which you’re presenting, it matters! If you are writing for a scientific journal, it will be different than presenting live to a thousand person audience. Think about a Ted Talk compared to the Journal of Physics.

Point being: consider your audience!

Let’s talk about a high-level presentation. Everyone has seen a slideshow with fancy charts that add zero value. Do not be the person presenting something that way! Providing useless content will confuse the audience and/or lead to boredom.

If your point is to show year-over-year change of a single metric – show it as a simple number on the page in big bold font rather than a chart.

In this made up example, I am displaying revenue over the last few years (note: be more specific when it comes to what type of revenue you’re talking about).

Which of the following makes more sense to put on a slide?

plot of chunk unnamed-chunk-3

If you agree with me, the one on the right will be much easier for people to understand in a presentation. It gets the point across without requiring processing which will allow people to focus on what is important. Any additional nuggets you would like to point out can be spoken to.

Now, let’s talk about publishing content that isn’t for academic use but will reach the public (i.e. newspapers, magazines, blogs, etc.). These types of charts can cover a wide range of topics so we’ll have to stick to the basics. We’re going to look at displaying information which is interesting and adds value.

Here is a great example from Junk Charts in which the author of the original Daily Kos Article is showing a type of “lie detector” chart. The chart does a number of things well: it illustrates a relevant point, it is appropriate to the audience and medium, and really helps to understand the subject matter better. However, the original chart is too colorful which takes away from its effectiveness. Junk Charts took it to the next level by simplifying the colors and axes.

Original Version (Daily Kos)

plot of chunk unnamed-chunk-4

Modified Version (Junk Charts)

plot of chunk unnamed-chunk-5

By merely looking at this chart you can see how it is ranked, a sense of scale, the comparison between people, and clearly labeled names. Fantastic work!

Rather than going over more examples of work others are doing, please visit Chart Porn (don’t worry about the name, it’s a great data visualization site) and Junk Charts. They have phenomenal examples of what to do (and what not to do) when publishing to the public.

You have a point, now what?

There is no rulebook as to how to display your data. However, as you have seen, there are both great and poor options. The choice is up to you – so think long and hard before making a decision (and you can always try a number of them out on people before publishing).

Ask yourself the following questions to help drive your decision:

  • Are you making a comparison?
  • Are you finding a relationship?
  • Are you showing a distribution?
  • Are you finding a trend over time?
  • Are you showing composition?

Once you know which question you are asking, it will keep your mind focused on the outcome and will quickly narrow down your charting options.

Rule of Thumb

  • Trend: Column, Line
  • Comparison: Area, Bar, Bullet, Column, Line, Scatter
  • Relationship: Line, Scatter
  • Distribution: Bar, Boxplot, Column
  • Composition: Donut, Pie, Stacked Bar, Stacked Column

Obviously, there are plenty of choices beyond these, so don’t hesitate to use what works best. I will go over some of these basics and show some comparisons of poor charting techniques vs. slightly better ones.

For this project, I’ll use some oil production data that I found while digging through http://data.world (pretty great site). The data can be found here

Let’s load up some libraries and get started.

#Custom data preparation
#GitHub (linked to at bottom of this post)
data = getData()
##   Location Month Year ThousandBarrel       Date
## 1  Alabama   Mar 2013            883 2013-03-01
## 2  Alabama   Apr 2013            844 2013-04-01
## 3  Alabama   May 2013            878 2013-05-01
## 4  Alabama   Feb 2013            809 2013-02-01
## 5  Alabama   Mar 1982           1687 1982-03-01
## 6  Alabama   Apr 1982           1567 1982-04-01

Trend – Line Chart

Objective: Visualize a trend in oil production in the US from 1981 – 2016 by year. I want to illustrate the changes over the time period. This is a very high-level view and only shows us a decline followed by a ramp up at the end of the period.

Poor Version

The x-axis is a disaster and the y-axis isn’t formatted well. While it gets the point across, it’s still worthless.

df = data %>% 
  group_by(Year) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel))
p = ggplot(df,aes(x=Year,y=ThousandBarrel,group=1)) 
p + geom_line(stat='identity') + 
  ggtitle('Oil Production Over Time') + 
  theme(plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5)) + 
  xlab('') + ylab('')

plot of chunk unnamed-chunk-7

Better Version

The title gives us a much better understanding of what we’re looking at. The chart is slightly wider and the axes are formatted to be legible.

p = ggplot(df,aes(x=Year,y=ThousandBarrel,group=1)) 
p + geom_line(stat='identity') + 
  ggtitle('Thousand Barrel Oil Production By Year in the U.S.') +
  theme(plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust = 0.5)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_y_continuous(labels = comma)

plot of chunk unnamed-chunk-8

Comparison – Line Chart

Objective: Identify which states affected the trend the most. Evaluate them simultaneously in order to paint the picture and compare their trends over the time period. From this visual you can see the top states are Alaska, California, Louisiana, Oklahoma, Texas and Wyoming. Texas seems to break the mold quite drastically and drove the spike which occurred after 2010.

Poor Version

There are far too many colors going on here. Everything at the bottom of the chart is relatively useless and takes our focus away from the big players.

df = data %>%
  group_by(Location, Year) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel))
dfYear = as.numeric(dfYear)
p = ggplot(df,aes(x=Year,y=ThousandBarrel,col=Location))
p + geom_line(stat='identity') + 
  ggtitle(paste('Oil Production By Year By State in the U.S.')) + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-9

Better Version

This focuses attention on the top producing states. It compares them to each other and shows the trend per state as well. Using facet_wrap() tends to be used in what’s known as “small multiples” – this is a technique which helps to break up the visual components of the data into easy-to-understand pieces which make intuitive sense.

n=6 #Arbitrary at first, after trying a few, this made the most sense
topN = data %>%
  group_by(Location) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel)) %>%
  arrange(-ThousandBarrel) %>%
df = data %>%
  filter(Location %in% topNLocation) %>%   group_by(Year,Location) %>%   summarise(ThousandBarrel = sum(ThousandBarrel))  dfYear = as.numeric(dfYear) dfLocation = as.factor(dfLocation)  p = ggplot(df,aes(x=Year,y=ThousandBarrel,group=1)) p + geom_line(stat='identity') +    ggtitle(paste('Top',as.character(n),'States - Oil Production By Year in the U.S.')) +    theme(plot.title = element_text(hjust = 0.5)) +    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +    facet_wrap(~Location) +    scale_y_continuous(labels = comma)</pre> <img src="https://i2.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-10-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-10" width="456" height="319" data-lazy-loaded="true" />  <hr />  <h2>Relationship - Scatter Plot</h2> <strong>Objective</strong>: Check to see if data from Alaska and California is correlated. While this isn't extremely interesting, it does allow us to use this same data set (sorry). The charts indicate that there appears to be a strong positive correlation between the two states. <h4>Poor Version</h4> Lots of completely irrelevant data! The size of the point should have nothing to do with the year. <pre class="crayon-plain-tag">statesList = c('Alaska','California') df = data %>%   filter(Location %in% statesList) %>%   spread(Location,ThousandBarrel) %>%   select(Alaska,California,Month,Year)  p = ggplot(df,aes(x=Alaska,y=California,col=Month,size=Year)) p + geom_point() +    scale_y_continuous(labels = comma) +   scale_x_continuous(labels = comma) +   ggtitle('Oil Production - CA vs. AK') +    theme(plot.title = element_text(hjust = 0.5))</pre> <img src="https://i1.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-11-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-11" width="456" height="319" data-lazy-loaded="true" /> <h4>Better Version</h4> The points are all the same size and a trend line helps to visualize the relationship. While it can sometimes be misleading, it makes sense with our current data. <pre class="crayon-plain-tag">df = data %>%   filter(Location %in% statesList) %>%   spread(Location,ThousandBarrel) %>%   select(Alaska,California,Year)  p = ggplot(df,aes(x=Alaska,y=California)) p + geom_point() +    scale_y_continuous(labels = comma) +   scale_x_continuous(labels = comma) +   ggtitle('Monthly Thousand Barrel Oil Production 1981-2016 CA vs. AK') +    theme(plot.title = element_text(hjust = 0.5)) +    geom_smooth(method='lm')</pre> <img src="https://i1.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-12-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-12" width="456" height="319" data-lazy-loaded="true" /> <h2>Distribution - Boxplot</h2> <strong>Objective</strong>: Examine the range of production by state (per year) to give us an idea of the variance. While the sums and means are nice, it's quite important to have an idea of distributions. While it was semi-apparent in the line charts, the variance of Texas is huge compared to the others! <h4>Poor Version</h4> Alphabetical order doesn't add any value, names are overlapping on top of each other. While you can tell who the big players are, this visual does not add the value it should. <pre class="crayon-plain-tag">df = data %>%   group_by(Year,Location) %>%   summarise(ThousandBarrel = sum(ThousandBarrel))  p = ggplot(df,aes(x=Location,y=ThousandBarrel)) p + geom_boxplot() +    ggtitle('Distribution of Oil Production by State')</pre> <img src="https://i0.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-13-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-13" width="456" height="319" data-lazy-loaded="true" /> <h4>Better Version</h4> This gives a nice ranking to the plot while still showing their distributions. We could take this a step further and separate out the big players from the small players (I'll leave that up to you). <pre class="crayon-plain-tag">p = ggplot(df,aes(x=reorder(Location,ThousandBarrel),y=ThousandBarrel)) p + geom_boxplot() +    scale_y_continuous(labels = comma) +   ggtitle('Distribution of Annual Oil Production By State (1981 - 2016)') +    coord_flip()</pre> <img src="https://i2.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-14-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-14" width="456" height="319" data-lazy-loaded="true" /> <h2>Composition - Stacked Bar</h2> <strong>Objective</strong>: Check out the composition of total production by state. It's interesting to see how the composition was relatively similar across decades until the 2010's. Texas was 50% of the output! <h4>Poor Version</h4> My favorite, the beautiful pie chart! There's nothing better than this… (no need for further commentary). <pre class="crayon-plain-tag">df = data %>%   group_by(Location) %>%   summarise(ThousandBarrel = sum(ThousandBarrel)) %>%   mutate(ThousandBarrel = ThousandBarrel/sum(ThousandBarrel))  dfThousandBarrel = round(100*dfThousandBarrel,0)  library(plotrix) pie(x=dfThousandBarrel,labels=dfLocation,explode=0.1,col=rainbow(nrow(df)),main='Percentage of Oil Production by State')</pre> <img src="https://i1.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-15-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-15" width="456" height="319" data-lazy-loaded="true" /> <h4>Better Version</h4> The 1980's and 2010's will be missing years in terms of a ``decade'' due to the data provided (and it's only 2017). While the percentage labels are slightly off center, it's certainly much better than the pie chart. It's not quite ``apples-to-apples'' for a comparison because I created different decades, but you get the idea.  I also created an ``Other'' category in order to simplify the output. When you are doing comparisons, it's typically a good idea to find a way to reduce the number of variables in the output while not removing data by dropping it completely - <strong>do this carefully and transparently!</strong> <pre class="crayon-plain-tag">dataDecade = '1980s'
dataDecade[dataYear >= 1990] = '1990s'
dataDecade[dataYear >= 2000] = '2000s'
dataDecade[dataYear >= 2010] = '2010s'
dataDecade = as.factor(dataDecade)
top5 = data %>%
  group_by(Location) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel)) %>%
  arrange(-ThousandBarrel) %>%
  top_n(5) %>%
top5List = top5Location  dataState = "Other"
for(i in 1:length(top5List)){
  dataState[dataLocation == top5List[i]] = top5List[i]
df = data %>%
  group_by(Decade,State) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel)) %>%
  mutate(ThousandBarrel = ThousandBarrel/sum(ThousandBarrel))
dfThousandBarrel = round(dfThousandBarrel,3)
dftext = paste(round(100*dfThousandBarrel,0),'%', sep='')
p = ggplot(df,aes(x=Decade,y=ThousandBarrel,col=reorder(State,ThousandBarrel),fill=reorder(State,ThousandBarrel)))
p + geom_bar(stat='identity') + 
  geom_text(aes(label=text),col='Black',size = 4, hjust = 0.5, vjust = 3, position = "stack") + 
  scale_y_continuous(labels = percent) +
  ggtitle('Percentage of Top Oil Producing States by Decade') + 
  guides(fill=guide_legend(title='State'),col=guide_legend(title='State')) + 
  theme(plot.title = element_text(hjust = 0.5))

plot of chunk unnamed-chunk-16

Some other fun concepts are below!

Some of them are nice, others are terrible! I won’t comment on any of them, but I felt it was necessary to include some other ideas I toyed around with.

Have fun with your data visualizations: be creative, think outside the box, use tools other than computers if it makes sense, fail often but learn quickly. I’m sure I’ll think of a thousand better ways to have illustrated the concepts in this post by tomorrow, so I’ll make updates as I think of them!

Now it’s your turn!

As always, the code used in this post is on my GitHub

df = data %>% 
  group_by(Location) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel)) %>%
p = ggplot(df,aes(x=reorder(Location,ThousandBarrel),y=ThousandBarrel))
p + geom_bar(stat='identity') + 
  ggtitle('Oil Production 1981 - 2016 By Location') + 
  theme(plot.title = element_text(hjust = 0.5)) + 

plot of chunk unnamed-chunk-17

top10 = data %>%
  group_by(Location) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel)) %>%
  arrange(-ThousandBarrel) %>%
## Selecting by ThousandBarrel
## # A tibble: 10 × 2
##       Location ThousandBarrel
##          <chr>          <dbl>
## 1        Texas       23447172
## 2       Alaska       15775279
## 3   California        9988225
## 4    Louisiana        4267246
## 5     Oklahoma        3701224
## 6      Wyoming        2894624
## 7       Kansas        1708873
## 8     Colorado        1288643
## 9         Utah         894657
## 10 Mississippi         861999
df = data %>% 
  group_by(Location,Year) %>%
  filter(Location %in% top10Location) %>%   summarise(ThousandBarrel = sum(ThousandBarrel))  p = ggplot(df,aes(x=Year,y=ThousandBarrel,col=Location,fill=Location)) p + geom_bar(stat='identity') +    ggtitle('Oil Production - Top 10 States') +    theme(plot.title = element_text(hjust = 0.5)) +    theme(axis.text.x = element_text(angle = 90, hjust = 1))</pre> <img src="https://i0.wp.com/www.stoltzmaniac.com/wp-content/uploads/2017/04/unnamed-chunk-18-1-4.png?zoom=2&w=456&ssl=1" alt="plot of chunk unnamed-chunk-18" width="455" height="260" data-lazy-loaded="true" /> <pre class="crayon-plain-tag">df = data %>%   filter(Year == 1990)%>%   group_by(Location) %>%   summarise(ThousandBarrel = sum(ThousandBarrel)) dfLocation = tolower(dfLocation)  #Add States without data States = data.frame(Location = tolower(as.character(state.name))) missingStates = StatesLocation[!(StatesLocation %in% dfLocation)]
appendData = data.frame(Location=missingStates,ThousandBarrel=0)
df = rbind(df,appendData)
states_map <- map_data("state")
ggplot(df, aes(map_id = Location)) + 
    geom_map(aes(fill=ThousandBarrel), map = states_map) +
    expand_limits(x = states_maplong, y = states_maplat)

plot of chunk unnamed-chunk-19

df = data %>% 
  filter(Location == 'Texas') %>%
  group_by(Year,Month) %>%
  summarise(ThousandBarrel = sum(ThousandBarrel))
p = ggplot(df,aes(x=Month,y=ThousandBarrel))
p + geom_line(stat='identity',aes(group=Year,col=Year)) + 
  ggtitle('Oil Production By Year in the U.S.') + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-20


Original Source.

Scott Stoltzman

Scott Stoltzman

My name is Scott Stoltzman, I’m a Data Scientist in Fort Collins, CO. My life has taken a lot of twists and turns to get me to where I am today. All kidding aside, I have an insatiable curiosity, desire to learn, and strong work ethic. I like to help others find what they’re looking for and show them what they might be missing. If there’s one thing I’ve learned in life, it is simplicity is the key to achieving greatness. I have traveled the world, been captain of multiple sports teams, earned two graduate degrees, started a company, and fell in love with a woman who agreed to marry me! I appreciate you taking the time to read this. Please email me if you would like to get in touch.