fbpx
A Beginner’s Guide to ggplot2 A Beginner’s Guide to ggplot2
ggplot2: The gg stands for ‘get graphing’ “Hands down the best way to start learning data science is to focus on data visualization.... A Beginner’s Guide to ggplot2

ggplot2: The gg stands for ‘get graphing’

Hands down the best way to start learning data science is to focus on data visualization. Pick R or Python and practice building plots that tell a story. Everything else will follow.” – Isaac Faber Ph.D., Director of AI Development | Stanford & CMU Instructor

If you’re attending ODSC West and would like to learn (or extend your knowledge) of data visualization, please attend our workshop on ggplot2.

Below are some questions we received from attendees about our ODSC workshop.

What is ggplot2?

ggplot2 is a graphing syntax that accurately “describes the properties of a plotting system.” These properties include:

  1. A dataset
  2. Mappings from variables to visual aesthetics
  3. A geometric object (visual elements or graph types)
  4. A scale for each aesthetic mapping and a coordinate system
  5. An optional faceting specification

Why use ggplot2?

  • If you’re using ggplot2:

    • You’ll have a clear understanding of the data behind the visualizations you build
    • You be able to iterate quickly through graph enhancements/revisions
    • The consistent syntax allows you to reproduce your graphs using ‘templates’

How can I get started with ggplot2?

Install ggplot2 from CRAN or you can use the development version found on the package website

install.packages("ggplot2")
# or 
install.packages("remotes")
remotes::install_github("tidyverse/ggplot2")
library(ggplot2)

Where can I get some data?

We’ll use the penguins dataset provided by the palmerpenguins package by Alison Hill, Alison Hill, and Kristen Gorman.

# from CRAN
install.packages("palmerpenguins")
# from GitHub
remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
penguins <- palmerpenguins::penguins
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <chr>   <chr>              <dbl>         <dbl>       <dbl>   <dbl> <chr> <dbl>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

How do I (quickly) build a graph with ggplot2?

ggplot2 graphs are built in layers, and they all start with a data argument (in this case, it’s penguins).

ggplot(data = penguins)

Once we have an initialized plot, we’re ready to start mapping our graph aesthetics with mapping = aes() (i.e. providing variables and their locations). Let’s put bill_length_mm on the x and flipper_length_mm on the y.

ggplot(data = penguins, 
  mapping = aes(x = bill_length_mm, y = flipper_length_mm))

The graph above has 1) data, and 2) variables (x and y). The third step is to add a geom (or geometric object), which is the type of plot that we want to create. In this case, we’ll add geom_point() (for points, or a scatter-plot).

ggplot(data = penguins, 
  mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point()

In three lines of code, we’ve created a scatter-plot. We’ve also used the basic template for creating graphs with ggplot2:

ggplot(data = <DATA>, 
  mapping = aes(x = <X VARIABLE>, y = <Y VARIABLE>)) + 
  geom_*()

While this graph might not be ready for publication, it is infinitely extensible because it was built using ggplot2s grammar.

How do I change a ggplot2 graph?

A language is considered functional when it’s capable of making infinite use of finite meansggplot2 does this by providing an infinite number of potential graphs from a finite number of functions. Consider the graph we created above with three lines of code. We can add more aesthetics (with aes()) to highlight the differences between groups for the x and y variables.

ggplot(data = penguins, 
  mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point(aes(color = species)) 

We can also include more geoms to further illustrate the group differences (with geom_smooth().

ggplot(data = penguins, 
  mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point(aes(color = species)) + 
  geom_smooth(aes(color = species))

As you can see, with relatively few lines of code, we’re able to quickly iterate through versions of a graph. ggplot2 also gives us incredible levels of control over how graphs are displayed. For example, we can remove the legend and use facets to separate each group into a small-multiples.

ggplot(data = penguins, 
  mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point(aes(color = species)) + 
  geom_smooth(aes(color = species), show.legend = FALSE) + 
  facet_wrap(~ island, nrow = 3)

We can add finishing touches with labels and themes.

ggplot(data = penguins, 
  mapping = aes(x = bill_length_mm, y = flipper_length_mm)) + 
  geom_point(aes(color = species)) + 
  geom_smooth(aes(color = species), show.legend = FALSE) + 
  facet_wrap(~ island, nrow = 3) + 
  labs(title = "Bill Length vs. Flipper Length",
    subtitle = "Adelie, Chinstrap, and Gentoo Penguins",
    caption = "source: palmerpenguins data",
    x = "Bill length (mm)", y = "Flipper length (mm)") + 
  theme_minimal()

The consistent syntax and underlying philosophy of ggplot2s grammar allow us to quickly generate new graphs (and make adjustments to existing graphs).

Hadley Wickham, package original author: “My general thesis of visualization is that the quality of the best visualization has maybe improved 10% in the last 150 years. The best visualization you can make today is only slightly better than the best visualization someone could make 150 years ago. But the time it takes you to make them has probably decreased by three orders of magnitude.” – source

How big is ggplot2?

There are also 100+ ggplot2 extensions, and this number is still growing. Extensions include additional geoms (like ggbeeswarm) and themes (like ggthemes).

devtools::install_github("eclarke/ggbeeswarm")
devtools::install_github("jrnold/ggthemes")
library(ggbeeswarm)
library(ggthemes)

If you understand the ggplot2 grammar, extensions are like plug-and-play features to for graphs. We simply adapt our template for the new geom and theme layers…

ggplot(data = penguins, 
  mapping = aes(x = island, y = body_mass_g)) + 
  ggbeeswarm::geom_beeswarm(aes(color = species)) + 
  ggthemes::theme_fivethirtyeight()

…and we have a new graph!

We hope you’ll come join us for the workshop! You’ll walk away with a solid introduction and lots of code examples to take home and tinker with.

Additional Questions

  1. Why does ggplot2 use the + instead of the pipe (%>%)?

This can be confusing to new R users, especially if they’ve been using the pipe (%>%) from the magrittr package. The pipe allows us to easily pass the output from a function on the left as an input to the function on the right (in a ‘pipeline’). However, graph layers are added using the plus symbol (+). Hadley Wickham touches on the background for why it was implemented this way in this interview, “I think I was reading about operator overloading and I thought “Oh maybe I could do this with ‘+’ instead”, and it kind of makes sense, you know, because you’re adding layers to the plot

  1. Where can find ggplot2 extensions?

This website contains a gallery of extensions for ggplot2. It’s always a good idea to check #ggplot2 on Twitter, too.

  1. Where can I learn more?

ggplot2 has a free online book and package website with loads of examples.

About the authors/ODSC West 2022 speakers:

Martin Frigaard is a Senior Clinical Programmer at BioMarin, where he builds dashboards and tools for making data-informed decisions. Previously, Martin built statistical tools and dashboards for the Diabetes Technology Society, a contributing author for Data Journalism in R on the Northeastern University School of Journalism blog/website, and other volunteer and non-profit organizations. He’s a data journalism instructor for California State University, Chico. Martin holds a graduate degree in Clinical Research and is passionate about data literacy and open source technologies.

 

Peter Spangler is a hands-on data science leader with a business-focused approach to building data science solutions and telling stories with data. Experienced in translating business problems into data products using advanced statistical techniques and ML to support decision-making in a variety of rapid growth environments. Scaled data science solutions for user acquisition, retention, channel optimization, revenue, and fraud at Lyft, Alibaba, and Citrix. Currently leading Marketing Science for Growth at Nextdoor.

 

 

More on their ODSC West 2022 session, “Data Visualization with ggplot2

Data visualization is a powerful tool for facilitating confident, informed decision-making. ggplot2 is one of the most popular data visualization packages in use today. Based on comprehensive grammar and syntax, ggplot2 gives you the ability to create data visualizations quickly and iteratively, whether it’s a simple bar-chart or a complicated network analysis.

This workshop will teach you how to manipulate and structure your data for visualizations, graph elements, and their associated terminology, how to select the appropriate graph based on your data, and how to avoid common graphing mistakes. You will also learn how to customize data visualizations and give them the ‘personal touches’ that make them memorable to your audience.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1