Whenever someone asks me how to get into data science using R, I invariably recommend checking out the tidyverse package. Tidyverse is a great launch pad for a language like R because it offers order and consistency.
I studied programming language design as a CS undergrad. At the time, I read ACM Sigplan Notices every month to learn about cool (and sometimes not so cool) research in new programming languages. Designing languages in this manner always seemed like a luxury, rather than the reality that languages typically go through.
Languages like Java (and more recently Julia) benefit from from ground-up design. Designers could reflect on the mistakes of the past, and as a result their languages offer a lot of consistency. On the other hand, current-day R is the result of a long, sometimes twisted evolutionary path. After teaching R classes for years, I find the R quilt-work is mystifying for new and experienced programmers alike. The tidyverse is a serious effort to add consistency and leverage R for data science work.
The tidyverse is a lucid collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio. R packages in the tidyverse are intended to make statisticians and data scientists more productive. Packages guide them through workflows that facilitate communication and result in reproducible work products. The tidyverse essentially focuses on the interconnections of the tools that make the workflow possible.
There are many advantages to adopting the tidyverse for your data science projects. It offers consistent functions, workflow coverage, data science education, a streamlined path for the development of data science tools, and potential for greater productivity. Wickham says one of its major goals is to help anyone who needs to analyze data work productively.
Tidyverse in the Data Science Workflow
The following figure illustrates how individual tidyverse packages fit into the accepted data science workflow.
The fact that tidyverse packages are associated with all of the workflow processes indicates that it contains the fundamental building blocks necessary to support the entire end-to-end workflow for a broad range of data science goals.
The abbreviated figure below motivated the development of the tidyverse. The figure is an abstraction of the data analysis workflow that has always guided statisticians, and preceded the workflow above. Now, it also guides data scientists as a map to organize, streamline, automate, and optimize the various processes involved.
Hadley Wickham uses both diagrams in many of his industry presentations.
More broadly, many consider the tidyverse a “sub-dialect” of the R language that continually evolves to express ideas and tasks inherently common in data science workflows. This dialect may not be everyone’s cup of tea, but it does seem to help many R-centric data scientists address everyday needs.
If you have some experience with R as a data scientist or data analyst, you should to be able to dive right into the online documentation and get your bearings. If you’re an R newbie, or new to data science in general, you’ll do well by reading the free book R for Data Science by Hadley Wickham and Garrett Grolemund.
Learn More About Tidyverse
To find the current state of development, visit tidyverse.org. Detailed documentation for each package is available on the website. To learn more, I highly recommend watching Hadley Wickham’s Keynote address at rstudio::conf 2017, “Data Science in the Tidyverse.” The slides for the talk are also available. You can also check it out on GitHub.