Article by Martin Frigaard, Senior Clinical Programmer at BioMarin, and a speaker for ODSC East 2022. Be sure to check out his talk, Data Visualization with ggplot2, there!
Today, graphs and visualizations saturate our information landscape. Whether we’re trying to understand our finances, levels of physical activity, investment portfolios, or medical information, we’re encountering charts and figures daily. Government agencies create graphs and tables to share health and safety data. And it doesn’t stop there–media outlets use charts to communicate the results of political polls. Even our entertainment and sports information is regularly sorted, ranked, and tabulated.
We can attribute the recent increase in graphs and data visualizations to the rise of the internet, widespread smartphone use, and the inevitable consequence of living in a digital age. Or–for a more pessimistic view–the growing amount of quantitative information available on everyone’s likes, dislikes, spending habits, location, etc. Regardless of the reason for the current explosion of graphics, charts, tables, and figures we’re encountering, it’s clear that visualization skills are one of the most highly sought-after job skills in the 21st century (and most of these jobs pay well, too!).
*throughout this post, I use the terms ‘visualization’, ‘graph,’ ‘chart,’ and ‘figure’ interchangeably
Creating high-quality visualizations takes time and practice, and the best source of both comes from performing Exploratory Data Analysis (EDA). American mathematician John Tukey first coined “Exploratory Data Analysis” in his 1977 seminal text. In it, Tukey compares creating graphs and visualization to doing detective work,
“A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he does not understand where the criminal is likely to have put his fingers, he will not look in the right places. Equally, the analyst of data needs both tool and understanding.”
The EDA process isn’t a standardized set of procedures, but our approach and actions are driven by having a “data detective mindset.” Armed with our visualization skills, background knowledge, and curiosity, we use EDA to help us understand and verify (or change) our expectations about the data.
Good detectives don’t blindly follow a set of steps–they apply careful thought to each new piece of evidence they discover. Similarly, EDA involves making judgments about what to pay attention to, investigate further, or ignore. Consider Sherlock Holmes, the detective created by British author Sir Arthur Conan Doyle. Mastermind: How To Think Like Sherlock Holmes, author Maria Konnikova describes Holmes’ approach to observation as not just a
“passive process of letting objects enter into your visual field. It is about knowing what and how to observe and directing your attention accordingly: what details do you focus on? What details do you omit? And how do you take in and capture those details that you do choose to zoom in on?”
As data detectives, we’re not robotically creating plots and dropping them into slides when exploring a data source. We’re making decisions about what to pay attention to, and these decisions are helping us understand our data and our ability to describe their contents to others. It’s not enough to create a graph that’s nice to look at, instead we should be
“looking properly, looking with real thought. It means looking with the full knowledge that what you note — and how you note it — will form the basis of any future deductions you might make. It’s about seeing the full picture, noting the details that matter, and understanding how to contextualize those details within a broader framework of thought.”
While performing EDA, we want to keep our eyes on the forest and the trees. This view gives us the ability to “contextualize those details within a broader framework of thought.”
EDA: A learning method for learning visualization
EDA is also a great way to hone your visualization skills. The graphs we create will require us to combine our technical skills for creating accurate charts with critical thinking and reasoning. The process starts with data in a spreadsheet, and then we calculate some basic counts and summaries. Next, we build visualizations for each column (univariate graphs) and then compare the columns to each other (bivariate graphs). Finally, if necessary, we create charts for three or more columns (multivariate graphs). All along the way, we use these visualizations to answer (and ask) questions about the data. Throughout this process, it’s also possible we make discoveries that require us to restructure and reformat (or ‘wrangle’) the data before we can create a visualization that communicates what the data contains.
The goal of EDA is to produce visualizations that give us a better understanding of our data. Why? Think about it–a good detective is investigating and gathering evidence to try and decide if a crime has been committed (and if so, who committed it). While you’re performing EDA, you’re trying to understand and describe the data to facilitate better decision-making.
What are ‘data’?
Graphs are illustrations drawn from data, often to reduce their complexity into a display we can process visually. We’ll consider data to be to any rectangular arrangement of information, with rows representing different observations (e.g., participants in a survey, movies, US cities, etc.), and columns representing variable characteristics (e.g., answers to survey questions, movie critic scores, city population, etc.). The values representing a single measurement unit are at the intersection of the rows and columns (see the example below).
The most straightforward display of any data is in its raw form (as a table), but it’s rarely sufficient to derive any insight. Data are usually too big (too many columns or rows) to understand in their raw form, or we’ve collected them in a way that isn’t suitable for detecting patterns visually. We’ll often start the EDA process by creating numerical table summaries of the data to help understand their structure and contents. Still, we are much better at seeing patterns, relationships, extreme values, and unexpected findings in visualizations than in a table of numbers.
A grammar for graphics
I highly recommend using the ggplot2 package (built with the statistical programming language R) to create data visualizations. The underlying system for constructing graphs with ggplot2 is a comprehensive vocabulary and grammar of graphics (from the book with the same title by Leland Wilkinson). Grammar exists for a reason: to have precisely and unambiguously defined concepts. Dedicating an entire language to building graphs might seem excessive, but like all technical endeavors, designing visualizations benefits from having a shared vocabulary for describing their attributes. A shared language can also provide a framework for building a mental model for graphs (mental models are mental representations of how some aspect of the world works).
We’ll use the diagram below to define some standard graph components:
When creating data visualizations, I’ve discovered it’s best to start with the labels. At a minimum, making the Graph title, Subtitle, and X/Y axes allows me to begin with an end in sight and sets an expectation for what I should see when I add the data to the plot.
We won’t share most of the graphs we make during EDA with an audience other than ourselves, so it’s crucial when we revisit these graphs in a week or two that we’ll be able to understand what we created. In most cases, a Graph title should be short, clear, and tell the audience what they’re seeing. You can also create a simple title (“Column X vs Column Y”) and include additional technical details in the Subtitle (“The linear relationship between Column X vs Column Y”). The X column axis and Y column axis should contain the column name (in plain language) and any units. The legend documents the colors or shapes in the graph and the values they represent. The Caption is where you can list the source of the data (preferably as a URL).
The Data point, Data series 1, and Data series 2 are the numerical values in our dataset. We represent these in the plot with different graph types, or ‘geoms’ (short for ‘geometric elements’). In
ggplot2, a geom is a fundamental building block for data visualizations. When building graphs, we map columns and values to geom aesthetics.
An example: Palmer penguins
These terms and definitions can seem a little abstract, so we’ll work through an example. Consider the data below, which contains ten measurements of penguin bill length from the Palmer Archipelago in Antarctica. We’ve stored these data in the
Start with the labels
We’ll start by looking at the distribution of the bill length column (
bill_length_mm) using a histogram. As I mentioned above, we’ll begin by making the labels for this graph with
ggplot2's labs()` function:
bill_labels <- labs(title = "Distribution of Palmer penguins bill length", subtitle = "Histogram of bill_length_mm", caption = "https://allisonhorst.github.io/palmerpenguins/", x = "Bill Length (mm)")
Build a canvas
Now that we have the labels for our plot, we can build the first layer of our graph. In
ggplot2, layers are “a collection of geometric elements and statistical transformations.” We’ll use the
ggplot() function to initialize the graph with the
Map data to aesthetics
The code above is the beginning of our plot’s first ‘layer.’ The display we’ve created is the canvas we’ll add a geom function to; in this case, it’s a
geom_histogram(), we’ll ‘map’ the bill length (
bill_length_mm) to the
We can map the columns in our data to
y positions on the graph, but other aesthetics include
alpha (which controls transparency).
When we map
bill_length_mm to the
x position aesthetic,
ggplot2 places the column’s values along the horizontal axis and draws a histogram. The
y axis is automatically labeled count, because a histogram uses bars to represent counts for the values in
bill_length_mm (read more about the
geom_histogram() function on the ggplot2 website).
Recall that we started by creating the labels for this graph (stored in
bill_labels), and we can add these to the plot to give it a polished, finished look.
Below is a diagram of the initialized plot, the mapped aesthetic, the histogram function, and the labels for our visualization:
Let’s recap how we’ve created the graph above:
- We identified a dataset (
penguins) and variable we wanted to investigate (
- We built the labels for our plot with the
labs()function and stored them in
- We initialized the plot with
ggplot(data = ...)
- We added a geom function for the type of graph we wanted to build and mapped the aesthetics (
geom_histogram(mapping = aes(x = ...)))
- We included our graph labels to make sure we knew what we were looking at if we looked at this graph in the future
A visualization template
The great thing about creating visualizations with
ggplot2 is that once we start thinking about graphs in terms of data, columns, and layers, we can build visualizations using any of
ggplot2’s many geom functions. We can put the steps above into a template for creating graphs with
ggplot(data = <DATA>) + geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) + <LABELS>
Recreate an existing graph
Having a grammar of graphics also gives us terms and definitions that serve as a mental model for thinking about graphs. We can look at an existing visualization and break it down into the data, columns, aesthetics, and geoms. Consider the plot below of body mass and flipper length for the penguins in the
The graph above is a typical visualization we’d create during EDA. It illustrates the relationship between two measurements, which is generally positive (as body mass increases, so do flipper lengths).
If we’re attempting to re-create the graph above, the first thing we can build is the labels using
penguin_labels <- labs(title = "Penguins body mass vs. flipper length", subtitle = "Penguins from the Palmer Archipelago, Antarctica", caption = "https://allisonhorst.github.io/palmerpenguins/", x = "Body mass (g)", y = "Flipper Length (mm)")
penguinsdata to see which columns we’ll need to re-create this graph:
penguins data contains the
body_mass_g columns, which look like ideal candidates for our visualization. We’ve created the labels, identified the data and columns, so we need to find the geom that matches what we see the in graph. The
geom_point() function will create a scatter plot when given
We can adapt our template to re-create the plot:
# TEMPLATE ggplot(data = <DATA>) + geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) + <LABELS>
penguins data will replace
<DATA>, and we will map
flipper_length_mm to the
y positions in the
geom_point() layer displays points at each intersection of the values for
body_mass_g. We can finalize the graph by adding our labels (
“language is a system for making infinite use of finite means.” – Wilhelm von Humboldt
As we’ve just demonstrated, we can use the grammar of graphics to create (and re-create) visualizations. The structure of
ggplot2’s grammar makes it possible for us to build up graphs incrementally, using the aesthetics and geoms to make adjustments to the design as we visually inspect the results. We’ve also stated that EDA is a highly iterative process. The grammar of graphics also makes building visualizations ‘infinitely extensible’ by adding new data, aesthetics, and layers.
When we initialized the plot with
ggplot(data = penguins), it gave us access to all the columns in the data. For example, we can use the
species column with the
color aesthetic to change the hue of the points:
species column contained three levels of penguin species (Adelie, Chinstrap, and Gentoo), which are all automatically listed in the legend. We can also map the same column to multiple aesthetics. Below we map both the
shape aesthetics to the
The plot above uses colors and shapes to differentiate the points for the three species in the
geom_point() layer (we can also see
ggplot2 updates the legend to reflect the
So far, we’ve just been changing the aesthetics in the
geom_point() layer of the plot, but
ggplot2 also allows us to add new layers to the same graph. Suppose we wanted to add the best fit line to see the dominant pattern among the points with
geom_smooth() layer includes the
y aesthetics (just like the
geom_point() layer), but we also mapped the
species column to the
color aesthetic so we can differentiate the lines. The gray area around the lines is the confidence intervals (a measure of uncertainty). Read more about the
geom_smooth() function on the
EDA is an excellent opportunity to practice building graphs because it’s an iterative, creative process.
ggplot2’s grammar is an ideal tool for EDA because of its ability to provide rapid prototyping and feedback.
We can also use the template below to continuously add new aesthetics and geom functions to any
ggplot2 graph. Armed with the grammar of graphics (and with some trial and error), we can arrive at the optimal visualization that fits our need.
ggplot(data = <DATA>) + geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) + geom_function(mapping = aes(<AESTHETIC MAPPINGS>)) + ... <LABELS>
About the author: Martin is a Senior Clinical Programmer at BioMarin, where he builds dashboards and tools for making data-informed decisions. Previously, Martin built statistical tools and dashboards for the Diabetes Technology Society, a contributing author for Data Journalism in R on the Northeastern University School of Journalism blog/website, and other volunteer and non-profit organizations. He’s a data journalism instructor for California State University, Chico. Martin holds a graduate degree in Clinical Research and is passionate about data literacy and open source technologies.