Where is Data Science Heading? Watching R’s Most Popular Packages May Have the Answer
Featured PostRTools & LanguagesRr's popular packagesposted by Daniel Gutierrez, ODSC September 9, 2019 Daniel Gutierrez, ODSC
Working as both a journalist and data scientist, I’m in a unique position to report on new tools of the profession as well as use them. I’m always seeking out trends surrounding the arrival of said tools because I feel they speak closely to the evolution of the field. Vendors address market forces to provide tools on a timely basis to alight with new demands. In the open source arena supporting data science, the process of detecting trends is really a matter of monitoring the type of packages being released, especially for data scientists using the R language. The flow of new and updated R packages occurs at a fast pace. As of the time of this writing there were 14,883 R packages. Whenever I have a new data science project or new application domain to explore, I always use Google to find R packages that I can use. I’m never disappointed. I always find something out there to help me out. Here are just a few of R’s most popular packages, which might help you out.
[Related Article: Deep Learning in R with Keras]
One great way to get the pulse of the data science industry is to review R package download stats using the R cranlogs library. The data visualization below shows download activity through the middle of 2019. I believe that inclusion of packages like dplyr, tibble, ggplot2, magrittr, and glue (which are part of the tidyverse) means that data scientists using R are seeking a more orderly, professional way to approach the process. That’s a good thing!
Another great way of monitoring the data science community through the ever-expanding R universe is by subscribing to R-Bloggers, the R blog aggregation site containing content contributed by hundreds of R bloggers. I receive a daily digest of R articles that helps me learn new techniques using new packages. This is the primary way I learn about packages I’ve never used before. For example, I recently learned how to perform, validate and interpret spatial regression models fitted in R on point referenced data using Maximum Likelihood with two different packages: spaMM and glmmTMB. This only reinforced my recognition of the importance of all things spatial in data science these days.
Of course another useful way to get a handle for how data science is progressing through use of R is to monitor the activities of R guru Hadley Wickham, author of the tidyverse collection of packages. I’ve found that the tidyverse is the most professional, well thought-out, and forward thinking of R packages. I’ve learned that new tools created by Wickham are worth a look as something that represents an important need or trend in data science. Most of Wickham’s work is intended to progress the field of data science in leaps and bounds.
I live and work in Los Angeles, and I’ve found that my local Meetup Groups serve as very accurate indicators for where the industry is headed. I recall attending a meeting of the LA Machine Learning group in June 2016 when the main author of XGBoost, Tianqi Chen, came to explain the theory behind the algorithm. At the time, I recognized this evolution of gradient boosting as an important new trend. I download the R package for the algorithm, and never looked back. XGBoost remains my favorite machine learning algorithm, although I’ve looked at LightGBM and CatBoost.
Another time, we had Anthony Goldbloom, founder and CEO of Kaggle come to tell the group about important trends his company learned after years of operating the machine learning challenge site (Kaggle was acquired by Google in 2017). Many Kaggle competitors use R and I learned a lot about the direction of the data science industry from learning what Kagglers are doing.
[Related Article: Hierarchical Bayesian Models in R]
And then back in 2014, I attended my first user! Conference at my alma mater UCLA. It was an eye opening experience to learn about so many previously unknown (to me) R packages. Probably the biggest discovery was the caret package by Max Kuhn, which told me that data science was seeking a more organized toolset, packed with tools that made the lives of data scientists easier.
I believe that the methods suggested above for assessing the field of data science by monitoring the advancement of R’s most popular packages will yield an accurate picture. I make the same recommendations to the students in the university-level “Introduction to Data Science” classes I teach, and I’m told it really helps for getting a handle on this rapidly advancing field.