R or Python for data science?
By: Jason O’Rawe – ODSC data science team contributor
R or Python for data science?
R and Python are some of the most popular topics at ODSC conferences. A 2014 KD nuggets poll1 suggests that R, Python, SAS and SQL are among the top languages/tools used by data scientists, and indeed it is generally the case that almost every posting on the topic will mention at least 3 of these tools. There even seems to be a not-so-silent2 competition between the R and Python community in terms crowning either as ‘best’ for performing data science tasks. A nice post by Martijn Theuwissen3 in 2015 summarizes the current state of the art in R and Python comparisons. So, is R or Python better for data science? Well, that depends, Martijn suggests. This is, of course, exactly correct. R and Python are defined by their differences, and both have unique advantages.
R started in many academic circles as a free alternative to Matlab. Due to the ability of academics to freely create, distribute and use packages developed by the community, R took off as the lingua franca of data analysis. The size of the R community has exploded and CRAN, the repository for R packages, has swelled. Almost every analysis task has a specific R package devoted to performing it. The native plotting libraries are easy to use, which enables quick data visualization and exploration and packages like ggplot2 as well as others give R users powerful tools for more effective and aesthetic portrayals of their analysis results.
Some notable recent additions to the R ecosystem are Rstudio and the Rshiny web-app framework. Rstudio enables quick, easy and interactive data analysis and exploration whose reproducible results can be saved and shared among colleagues and the community. Rshiny enables easy R-powered web-app development for data-analysis dashboards or reporting tools for quantitative managers.
However, if you are starting off as a programmer, then R is relatively difficult to learn. Quirks that many come to love in R are the bane of new users. R is considered slow by some standards, but that view is changing due to developments like (now) Microsofts Revolution R, and APIs for distributed computing libraries such as SparkR.
R is an analysis language, and as such it is rarely used for anything else. Indeed, Stackoverflow questions are highly linked and demonstrate a tight-nit user and development community.
Python is a quick and multi-use language that is reletively easy to learn. In terms of its data-crunching ecosystem, Python has seen rapid growth. Noteable is the rich suite of user-facing machine learning packages like Scikit-learn, XGBoost, and others. Pandas has brought data analysis in Python to a new high in term of making data analysis as naural and easy as it is in R. Python is host to a number of rich web-development frameworks that are used not only for building data science dash boards, but also for full-scale web-apps. The pluggability of web-app development in Python has enabled the fully expressive development of integrative and powerful web-apps. Flask and Django lead the way in terms of the Python web-app development landscape, but Bottle and Pyramid are also quite popular.
Like R, Python is also considered slow in comparison to the likes of C or C++. Unlike R, the Python data science ecosystem lacks a star IDE like Rstudio. Both of these outside criticisms are not unmet, however, as several easy-to-use tutorials describe development with Cython, which taps into the speed of C using slightly different C-like syntax. In addition, Yhat has released a very nice Python IDE, Rodeo, that mimics the look and feel of Rstudio, although it is still in early release stages. The Beaker notebook, although not specific to Python, is also a nice IDE for data science and has labeled itself the “data scientists laboratory”, as it allows for development in a number of data science related languages.
Unlike R, Python is used for a myriad of general purpose programming and piping tasks. Although a powerful language for data analysis, Python is a general programming language that is not limited to its number-crunching abilities. Indeed, Stackoverflow question links and their communities are focused around general development or language-specific usage topics.
Do companies hire R or Python experts?
Despite their differnces, R and Python have similar use cases and are both fully featured ‘data science’ programming languages. Both are powerful, and both give the user an incredible toolset for performing impactful data analyses. But will knowing R vs Python give you a leg-up on the job market? We came across a very nice posting by Jesse Steinweg-Woods5, which demonstrates how to use Beautiful Soup and other python tools to scrape Indeed.com for data science job postings. Are companies looking for R or Python experts? We looked at the top 5 cities for data science jobs and asked what the most popular data science langauages or tools are.
How do R and Python stack up? Unsurprisingly, R and Python are reliably the top two programming languages that companies want their data scientists to know. Python takes a slight lead over R, although the difference is almost negligable. Both languages are among the highest sought after skills in industry, although SQL comes in a clear third. Hadoop, Java, Spark, Pig and Hive trail behind, and surprising to us was the prevalence of Excel, Matlab, SAS, SPSS, Tableu, which are not always thought of as the most popular of toolsets among data scientists. Julia has a surprise showing. It is still a young in its development cycle, but Julia seems to be an increasingly popular language for use in data science tasks.
R and Python are distinct languages with their own benefits and pitfalls. Both are powerful and popular, and experts in both languages are highly sought after on the data science job market. Despite the surge of Python in data science, skills in R are just as valuable, and as others suggest, it might not be a bad idea to lean and become an expert in both R and Python. A recent blog post from Martijn Theuwissen suggests exactly this. Indeed, being able to communicate with and leverage the knowledge and experience of both the R and Python community could significantly boost your own knowledge base and career prospects.
Check out R and Python for data science at ODSC East 2016.