When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to miss the structure of Java/C++. At the time, Python felt like a good compromise so I switched again. After joining Mango Solutions I noticed I was not an anomaly, most data scientists here know both Python and R.
Nowadays whenever I do my work in R there is a constant nagging voice in the back of my head telling me “you should do this in Python.” And when I do my work in Python, it’s telling me “you can do this faster in R.” So when the reticulate package came out I was overjoyed and in this blog post I will explain to you why.
re-tic-u-late (rĭ-tĭkˈyə-lĭt, -lātˌ)
So what exactly does reticulate do? Its goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty-gritty of how the package works here; RStudio has done a great job in providing some excellent documentation and a webinar. Instead, I’ll show a few examples of the main functionality.
Just like R, the House of Python was built upon packages. Except in Python, you don’t load functionality from a package through a call to
librarybut instead, you import a module. reticulate mimics this behavior and opens up all the goodness from the module that is imported.
library(reticulate) np <- import("numpy") # the Kronecker product is my favourite matrix operation np$kron(c(1,2,3), c(4,5,6))
##  4 5 6 8 10 12 12 15 18
In the above code, I import the numpy module which is a powerful package for all sorts of numerical computations. reticulate then gives us an interface to all the functions (and objects) from the numpy module. I can call these functions just like any other R function and pass in R objects, reticulate will make sure the R objects are converted to the appropriate Python objects.
You can also run Python code through
source_python if it’s an entire script or
py_run_string if it’s a single line of code. Any objects (functions or data) created by the script are loaded into your R environment. Below is an example of using
## mpg 642.900 ## cyl 198.000 ## disp 7383.100 ## hp 4694.000 ## drat 115.090 ## wt 102.952 ## qsec 571.160 ## vs 14.000 ## am 13.000 ## gear 118.000 ## carb 90.000 ## dtype: float64
Notice the use of the
r. prefix in front of the
mtcars object in the python code. The
r object exposes the R environment to the python session, it’s equivalent in the R session is the
py object. The
mtcars data.frame is converted to a pandas DataFrame to which I then applied the
sumfunction on each column.
Clearly, RStudio has put in a lot of effort to ensure a smooth interface to Python, from the easy conversion of objects to the IDE integration. Not only will reticulate enable R users to benefit from the wealth of functionality from Python, I believe it will also enable more collaboration and increased sharing of knowledge.
So what is it exactly that you can do with Python that you can’t with R? I asked myself the same question until I came across the following use case.
While helping a colleague out with a blog post it was suggested that I should publish it on a Tuesday. No rationale was given so naturally I wondered if I could provide one using data. The data would have to come from R-bloggers. This is a great resource for reading blog posts about R (and related topics) and they also provide a daily newsletter with a link to the blog posts from that day. At the time the newsletter seemed the easiest way to collect data 1. All I needed to do now is extract the data from my Gmail account.
Therein lies the problem as I want to avoid querying the Gmail server (it wouldn’t make it easy to reproduce). Fortunately, Google have made it easy to download your data (thanks to the Google Data Liberation Front) through Google Takeout. Unfortunately, all the e-mails are exported in the mbox format. Although this is a plain text-based format it would take some effort to write a parser in R, something I wasn’t willing to do. And then came along Python, which has a built-in mbox-parser in the mailbox module.
Using reticulate I extracted the necessary information from each e-mail.
# import the module mailbox <- import("mailbox") # use the mbox function to open a file connection cnx <- mailbox$mbox("rblogs_box.mbox") # the messages are stored as key/value pairs # in this case they are indexed by an integer id message <- cnx$get_message(10L) # each message has a number of fields with meta-data message$get("Date")
##  "Mon, 12 Dec 2016 23:56:19 +0000"
##  "[R-bloggers] Building Shiny App exercises part 1 (and 7 more aRticles)"
And there we have it! I just read an e-mail from an mbox-file with very little effort. Of course, I will need to do this for all messages, so I wrote a function to help me. And because we’re living in the Age of R I placed this function in an R package. You can find it on the MangoTheCat github repo, it is called mailman.
To publish or not to publish?
I have yet to provide a rationale for publishing a blog post on a particular day so let’s quickly get to it. With the package all sorted I can now call the function
mailman::read_messages to get a
tibble with everything I need.
We can extract the number of blogposts on a particular date from the subject of each e-mail. Aggregating that to the day of the week will then give us a good overview of which day is popular.
library(dplyr) library(mailman) library(lubridate) library(stringr) messages <- read_messages("rblogs_box.mbox", type="mbox") %>% mutate(Date = as.POSIXct(Date, format="%a, %d %b %Y %H:%M:%S %z"), Day_of_Week = wday(Date, label=TRUE, abbr=TRUE), Number_Articles = str_extract(Subject, "[0-9](?=[\\n]* more aRticles)"), # Whenever a regex works you feel like a superhero! Number_Articles = as.numeric(Number_Articles) + 1, # Ok, sometimes it doesn't work but you're still a hero for trying! Number_Articles = ifelse(is.na(Number_Articles), 1, Number_Articles)) %>% select(Date, Day_of_Week, Number_Articles)
Judging by the graph, weekends would be a good time to publish a blog post as there is less competition. Then again, not many people might read blog posts on the weekend. The next candidate would then be Monday which has the lowest average among the weekdays. Coming back to my original quest, I can conclude that publishing on a Tuesday is not the best option.
In my opinion, the reticulate package is a ground-breaking development. It allows me to combine the good parts of R with the good parts of Python (it’s already in use by the tensorflow and keras packages). Also, it allows the data science community to collaborate more easily and focus our energy on getting things done. This is the future, this is R and Python (Rython? PRython? PyR?).