R
Snakes in a Package: Combining Python and R with Reticulate
When I first started working as a data scientist (or something like it) I was told to program in C++ and Java. Then R came along and it was liberating; my ability to do data analysis increased substantially. As my applications grew in size and complexity, I started to... Read more
SQL Equivalents in R
Whenever I’m teaching introductory courses in data science using the R language, I often encounter students who use a different language like Python or Julia, and still others who are transitioning into data science from other fields and don’t know any data science language at all. The common thread... Read more
Monthly Summary of Selected Trends, Activities, and Insights for R – July 2018
R is a leading language in the data science domain. In the following article, a summary of selected trends, activities, and insights around the R language from July 2018 are presented. Data for the trends and activities summarized here were obtained from popular websites used by the R community such... Read more
The Tidyverse Curse
I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section: The Tidyverse Curse There’s a common theme in many... Read more
rquery: Fast Data Manipulation in R
Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Note we have now (1-16-2018) re-run this benchmark with a faster, better tuned, version of the data.table solution (same package, just better use of it). Let’s take a look at... Read more
Group-By Modeling in R Made Easy
There are several aspects of the R language that make it hard to learn, and repeating a model for groups in a data set used to be one of them. Here I briefly describe R’s built-in approach, show a much easier one, then refer you to a new approach described... Read more
Seeking Guidance in Choosing and Evaluating R Packages
At useR!2017 in Brussels last month, I contributed to an organized sessionfocused on navigating the 11,000+ packages on CRAN. My collaborators on this session and I recently put together an overall summary of the session and our goals, and now I’d like to talk more about the specific issue of learning... Read more
Tutorial: Using seplyr to Program Over dplyr
seplyr is an R package that makes it easy to program over dplyr0.7.*. To illustrate this we will work an example. Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement. suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ##... Read more
Let’s Have Some Sympathy For The Part-time R User
When I started writing about methods for better “parametric programming” interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience: dplyr users who had such a need, and wanted such extensions. dplyr users who did not have such a need (“we always know the column names”). dplyr users who found... Read more
Feature Engineering with Tidyverse
In this blog post, I will discuss feature engineering using the Tidyverse collection of libraries. Feature engineering is crucial for a variety of reasons, and it requires some care to produce any useful outcome. In this post, I will consider a dataset that contains description of crimes in San Francisco between... Read more
Open Data Science - Your News Source for AI, Machine Learning & more