Integrating Python and R
Over time, Python and R have established themselves as the leading languages for Data Science. The rise of both has not been frictionless, though, as the two communities have ‘clashed’ over philosophical differences as each side recruits Data Science newcomers. R users will recommend that R is the better language to learn with its well-developed and easier to use ecosystem for building data-driven web apps, data visualization, and, most importantly, the fact that its foundations are statistical to the core. The Python community, on the other hand, would say that as a language made for computing and then ‘re-purposed’ for Data Science, Python comes with the strength to integrate seamlessly into systems at any scale and is easier to learn than R, as well as adequately taking care of the mathematical needs of Data Science.
For a while this adversarial situation persevered, but recently the direction of the discussion has switched to, in the words of the well-known Internet meme, “Why not both?” Data Scientists are now being encouraged to learn both in order to not limit themselves to one particular workflow, and to be able to leverage what each language does best. This is why ODSC is careful to plan Data Science Conferences that address both languages equally, ODSC East will feature an eclectic lineup of speakers that represent both communities.
However, this new synergy is mostly superficial. A user of both languages will often have to make an explicit choice of which to use, and that completely changes his/her working environment. If the choice is Python, then the analysis will most often be done in the ubiquitous Jupyter notebook. If it’s R, then RStudio’s hugely popular eponymous IDE will most likely be the tool of choice. For example, utilizing Python’s mature web scraping capabilities to gather data, then seamlessly switching to R to leverage its breadth of statistical tools before doing modeling in both languages to explore both descriptive and predictive models is not yet a reality.
There are tools for pairing the languages more closely. One such method involves passing data (such as a csv file or the new feather format) between and code between them through either the command line sub-processes. A higher level interface for R that fulfils a similar goal is the rPython library, which allows R users to write Python scripts – outside or inside RStudio – and load into R scripts. For Python users there is the rpy2 library which can be used on its own, or integrated into the Jupyter notebook by utilizing the R magic function. However, the interactivity that is common in a Data Scientist’s analyses is lost with these methods.
The language agnostic nature of the Jupyter notebook also allows users to install an R kernel so that R specific notebooks can be produced in addition to those in Python. Like rpy2 and rPython, this is merely an artificial joining of the two languages instead of true integration. However, there is a notebook that gets closer to this ultimate goal. This is the Beaker notebook,1 a platform profiled in an earlier blog post. Developed by the team at financial company Two Sigma, it’s language agnostic nature goes a level beyond Jupyter’s by allowing languages to interact within the same notebook. Thus, one can do analysis in R and then seamlessly pass it off to Python or vice versa.
Beaker’s popularity is low at the moment and there is presently no sign that this will change anytime soon. If it doesn’t, perhaps the Jupyter notebook or another platform will come along to allow Data Scientists to achieve true linguistic synergy when its comes to working with Python and R.