R
The Tidyverse Curse
I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section: The Tidyverse Curse There’s a common theme in many... Read more
rquery: Fast Data Manipulation in R
Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings. Note we have now (1-16-2018) re-run this benchmark with a faster, better tuned, version of the data.table solution (same package, just better use of it). Let’s take a look at... Read more
Group-By Modeling in R Made Easy
There are several aspects of the R language that make it hard to learn, and repeating a model for groups in a data set used to be one of them. Here I briefly describe R’s built-in approach, show a much easier one, then refer you to a new approach described... Read more
Seeking Guidance in Choosing and Evaluating R Packages
At useR!2017 in Brussels last month, I contributed to an organized sessionfocused on navigating the 11,000+ packages on CRAN. My collaborators on this session and I recently put together an overall summary of the session and our goals, and now I’d like to talk more about the specific issue of learning... Read more
Tutorial: Using seplyr to Program Over dplyr
seplyr is an R package that makes it easy to program over dplyr0.7.*. To illustrate this we will work an example. Suppose you had worked out a dplyr pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr 0.7.0 announcement. suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ##... Read more
Let’s Have Some Sympathy For The Part-time R User
When I started writing about methods for better “parametric programming” interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience: dplyr users who had such a need, and wanted such extensions. dplyr users who did not have such a need (“we always know the column names”). dplyr users who found... Read more
Feature Engineering with Tidyverse
In this blog post, I will discuss feature engineering using the Tidyverse collection of libraries. Feature engineering is crucial for a variety of reasons, and it requires some care to produce any useful outcome. In this post, I will consider a dataset that contains description of crimes in San Francisco between... Read more
How Do You Discover R Packages?
Like I mentioned in my last blog post, I am contributing to a session at userR 2017 this coming July that will focus on discovering and learning about R packages. This is an increasingly important issue for R users as we all decide which of the 10,000+ packages to... Read more
On indexing operators and composition
In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example... Read more
Fixing an infelicity in ‘leaps’
The ‘leaps’ package for R is ancient – this is its tenth twentieth year on CRAN.  It uses old Fortran code by the Australian computational statistician Alan Miller. The Fortran 90 versions are on the web, but Fortran 90 compilation with R wasn’t portable back then, so I used the older... Read more
Open Data Science - Your News Source for AI, Machine Learning & more