I am one of the organizers for a session at userR 2017 this coming July that will focus on discovering and learning about R...
I am one of the organizers for a session at userR 2017 this coming July that will focus on discovering and learning about R packages. How do R users find packages that meet their needs? Can we make this process easier? As somebody who is relatively new to the R world compared to many, this is a topic that resonates with me and I am happy to be part of the discussion. I am working on this session with John Nash and Spencer Graves, and we hope that some useful discussion and results come out of the session.
In preparation for this session, I wanted to look at the distribution of R packages by date, number of version, etc. There have been some great plots that came out around the time when CRAN passed the 10,000 package mark but most of the code to make those scripts involve packages and idioms I am less familiar with, so here is an rvest and tidyverse centered version of those analyses!
The first thing we need to do is get all the packages that are currently available on CRAN. Let’s use rvest to scrape the page that lists all the packages currently on CRAN. It also has some other directories besides packages so we can use filter to remove the things that don’t look like R packages.
So that’s currently available packages!
Now let’s turn to the archive. Let’s do a similar operation.
That is good, but now we need to get more detailed information for packages that have been archived at least once to get the date they originally were released and how many versions they have had.
Visiting every page in the archive
Let’s set up a function for scraping an individual page for a package and apply that to every page in the archive. This step takes A WHILE because it queries a web page for every package in the CRAN archive. I’ve set this up with map from purrr; it is one of my favorite ways to organize tasks these days.
What do these pages look like?
This is exactly what we need: the dates that the packages were released and how many times they have been released. Let’s use mutate and map again to extract these values.
Putting it together
Now it’s time to join the data from the currently available packages and the archives.
Packages that are in archives but not pkgs are no longer on CRAN.
Packages that are in pkgs but not archives only have one CRAN release.
Packages that are in both dataframes have had more than one CRAN release.
Sounds like a good time to use anti_join and inner_join.
Let’s look at some results now.
There we go! That is similar to the results we all saw going around when CRAN passed 10,000 packages, which is good.
What about the number of archived vs. available packages?
And lastly, let’s look at the distribution of number of releases for each package.
It is pretty ironic that I worked on this code and wrote this post because I wanted to do an analysis using different packages than the ones used in the original scripts shared. That is exactly part of the challenge facing all of us as R users now that there is such a diversity of tools out there! I hope that our session at useR this summer provides some clarity and perspective for attendees on these types of issues. The R Markdown file used to make this blog post is available here. Bob Rudis has let me know that there are easier ways to get the data that I used for these plots, and I am very happy to hear about that or other feedback and questions!
My background in the physical sciences and programming has given me the tools to apply sophisticated analytical techniques to complicated problems. I am a data scientist and analyst with an understanding of mathematics and statistical models. Analyzing, understanding, and communicating about data makes me happy and I am passionate about finding insights in data and building data products to meet the needs of an organization. I come from a background in physics and astronomy and have worked in academia and ed tech before moving into data science. My experience in the physical sciences and education has given me a solid foundation for using data to answer interesting questions, and then communicating those findings to decision makers. I work effectively in both independent and collaborative environments, I learn new skills and subjects quickly, and I have proven writing and speaking abilities.
ODSC’s Accelerate AI focuses on three key areas: Innovation, Expertise, and Management. Learn what the latest advances in AI and applied data science are, how they can affect your company, and how to build an effective team around their potential. Ready to learn more? Learn more here.