Spark and The Art of Data Science
Apache Spark, or simply “Spark,” is a highly distributed, fault-tolerant, scalable framework that processes massive amounts of data. As it processes data, Spark abstracts the distribution of the data computations via a machine cluster thus enabling you to create applications using Java, Scala, Python, R, and... Read more
Quoting and Macros in R
Miles McBain has a nice post about quoting in R and the tidyeval procedure. In it, there’s this footnote In truth there are other types of calls, and the ones Lisp nuts really bang on about are macro calls In this post I want to talk... Read more
Testing Probability Distribution Generators
In the ‘regression tests’ that are part of any change to the base-R source code, there’s a file called p-r-random-tests.R. People notice it from time to time because the tests sometimes fail. That’s what is supposed to happen. Testing random number generators is hard, because it’s hard... Read more
Exploratory Data Analysis in R
Hi there! tl;dr: Exploratory data analysis (EDA) the very first step in a data project. We will create a code-template to achieve this with one function. Introduction EDA consists of univariate (1-variable) and bivariate (2-variables) analysis. In this post we will review some functions that lead... Read more
R Tip: Be Wary of “…”
The following code example contains an easy error in using the R function unique(). vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # "a" "b" "c" Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with... Read more
R Tip: Use isTRUE()
R Tip: use isTRUE(). A lot of R functions are type unstable, which means they return different types or classes depending on details of their values. For example consider all.equal(), it returns the logical value TRUEwhen the items being compared are equal: all.equal(1:3, c(1, 2, 3)) # TRUE However, when the... Read more
New Version of ggplot2
I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here’s what choroplethr users need to know about this new version of ggplot2. Choroplethr Update Required The new version of ggplot2 introduces... Read more
rqdatatable: rquery Powered by data.table
rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package. rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data... Read more
WVPlots now at version 1.0.0 on CRAN!
Nina Zumel and I have been working on packaging our favorite graphing techniques in a more reusable way that emphasizes the analysis task at hand over the steps needed to produce a good visualization. We are excited to announce the WVPlots is now at version 1.0.0 on CRAN! The idea is: we... Read more
wrapr 1.4.1 now up on CRAN
wrapr 1.4.1 is now available on CRAN. wrapr is a really neat R package both organizing, meta-programming, and debugging R code. This update generalizes the dot-pipe feature’s dot S3 features. Please give it a try! wrapr, is an R package that supplies powerful tools for writing and debugging R code. Introduction Primary wrapr services include: let() (let block) %.>% (dot... Read more