fbpx
The Tidyverse Curse The Tidyverse Curse
I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve... The Tidyverse Curse

I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

The Tidyverse Curse

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars)
                  mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0 6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0 6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8 4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4 6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1 6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method.  It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision ().

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   print()
# A tibble: 32 × 11
   mpg   cyl  disp    hp  drat    wt  qsec    vs   am   gear  carb
*           
 1 21.0   6   160.0  110  3.90 2.620 16.46    0     1     4     4
 2 21.0   6   160.0  110  3.90 2.875 17.02    0     1     4     4
 3 22.8   4   108.0   93  3.85 2.320 18.61    1     1     4     1
 4 21.4   6   258.0  110  3.08 3.215 19.44    1     0     3     1
 5 18.7   8   360.0  175  3.15 3.440 17.02    0     0     3     2
 6 18.1   6   225.0  105  2.76 3.460 20.22    1     0     3     1
 7 14.3   8   360.0  245  3.21 3.570 15.84    0     0     3     4
 8 24.4   4   146.7   62  3.69 3.190 20.00    1     0     4     2
 9 22.8   4   140.8   95  3.92 3.150 22.90    1     0     4     2
10 19.2   6   167.6  123  3.92 3.440 18.30    1     0     4     4
# ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
Error in function_list[[i]](value) : 
 could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr")
> library("tibble")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
# A tibble: 32 × 12
 rowname            mpg   cyl disp    hp   drat   wt   qsec   vs    am   gear carb
                        
1 Mazda RX4         21.0   6   160.0  110  3.90  2.620 16.46   0     1     4     4
2 Mazda RX4 Wag     21.0   6   160.0  110  3.90  2.875 17.02   0     1     4     4
3 Datsun 710        22.8   4   108.0   93  3.85  2.320 18.61   1     1     4     1
4 Hornet 4 Drive    21.4   6   258.0  110  3.08  3.215 19.44   1     0     3     1
5 Hornet Sportabout 18.7   8   360.0  175  3.15  3.440 17.02   0     0     3     2
6 Valiant           18.1   6   225.0  105  2.76  3.460 20.22   1     0     3     1
7 Duster 360        14.3   8   360.0  245  3.21  3.570 15.84   0     0     3     4
8 Merc 240D         24.4   4   146.7   62  3.69  3.190 20.00   1     0     4     2
9 Merc 230          22.8   4   140.8   95  3.92  3.150 22.90   1     0     4     2
10 Merc 280         19.2   6   167.6  123  3.92  3.440 18.30   1     0     4     4
# ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   print()
# A tibble: 1 × 1
 lyrics
 
1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   as.data.frame() %>%
+   print()
 ... 
1 a long long time ago i can still remember how that music used 
to make me smile and i knew if i had my chance that i could make 
those people dance and maybe theyd be happy for a while but 
february made me shiver with every paper id deliver bad news on 
the doorstep i couldnt take one more step i cant remember if i cried 
...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

 

Original Source.

Bob Muenchen

Bob Muenchen

Robert A. Muenchen (muenchen.bob@gmail.com) is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of http://r4stats.com, a popular web site devoted to analyzing trends in data science software and helping people learn the R language. Detailed outlines of Bob's R workshops are available there. Bob is an ASA Accredited Professional Statistician™ with 35 years of experience and is currently the manager of OIT Research Computing Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has taught workshops on research computing topics for more than 500 organizations and has offered training in partnership with the American Statistical Association, DataCamp.com, New Horizons Computer Learning Centers, Revolution Analtyics, RStudio.com, Rexer Analytics, and Xerox Learning Services. Bob has written or coauthored over 70 articles published in scientific journals and conference proceedings, and has provided guidance on more than 1,000 graduate theses and dissertations. Bob has served on the advisory boards of SAS Institute, SPSS Inc., Intuitics OOD, StatAce OOD, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analytics, and data mining.

1