rqdatatable: rquery Powered by data.table rqdatatable: rquery Powered by data.table
rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Spark. rqdatatable is a new package that supplies a screaming... rqdatatable: rquery Powered by data.table

rquery is an R package for specifying data transforms using piped Codd-style operators. It has already shown great performance on PostgreSQL and Apache Sparkrqdatatable is a new package that supplies a screaming fast implementation of the rquery system in-memory using the data.table package.

rquery is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now rquery is also one of the fastest methods to wrangle data in-memory in R (thanks to data.table, via a thin adaption supplied by rqdatatable).

Teaching rquery and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote rquery and rqdatatable, but frankly the biggest result of the benchmarking is how far out of the pack data.table itself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-data.table user may be aware of.

The R package development candidate rquery 0.5.0 incorporates a number of fixes and improvements. One interesting new feature is the DBI package is now suggested or optional, instead of required. This means rquery is ready to talk to non-DBI big data systems such as SparkR (example here) and it let us recruit a very exciting new rquery service provider: data.table!

data.table is, by far, the fastest way to wrangle data at scale in-memory in R. Our experience is that it starts to outperform base Rinternals and all other packages at moderate data sizes such as mere tens or hundreds of rows. Of course data.table is most famous for its performance in the millions of rows and gigabytes of data range.

However, because of the different coding styles there are not as many comparative benchmarks as one would like. So performance is often discussed as anecdotes or rumors. As a small step we are going to supply a single benchmark based on our “score a logistic regression by hand” problem from “Let’s Have Some Sympathy For The Part-time R User” (what each coding solution looks like can be found here).

In this note we compare idiomatic solutions to the example problem using: rquerydata.table, base R (using stats::aggregate()), and dplyrdplyr is included due to its relevance and popularity. Full details of the benchmarking can be found here and full results here. One can always do more benchmarking and control for more in experiments. One learns more from a diversity of benchmarks than from critiquing any one benchmark, so we will work this example briefly and provide links to a few others benchmarks. Our measurements confirm the common (correct) observation and conclusion: that data.table is very fast. Our primary new observation is that the overhead from the new rqdatatableadapter is not too large and rqdatatable is issuing reasonable data.table commands.

Both the rquery and dplyr solutions can be run in multiple modalities: allowing the exact same code to be used in memory or on a remote big data system (a great feature, critical for low-latency rehearsal and debugging). These two systems can be run as follows.

  • rquery is a system for describing operator trees. It deliberately does not implement the data operators, but depends on external systems for implementations. Previously any sufficiently standard SQL92 database that was R DBI compliant could serve as a back-end or implementation. This already includes the industrial scale database PostgreSQL and the big data system Apache Spark (via the SparklyR package). The 0.5.0 development version of rquery relaxes the DBI requirement (allowing rquery to be used directly with SparkR) and admits the possibility of non-SQL based implementations. We have a new data.table based implementation in development as the rqdatatable package.
  • dplyr also allows multiple implementations (in-memory, DBI SQL, or data.table). We tested all three, and the dplyr pipeline worked identically in-memory and with PostgreSQL. However, the dtplyrpipeline did not generate valid data.table commands, due to an issue with window functions or ranking, so we were not able to time it using data.table.

We are thus set up to compare to following solutions to the logistic scoring problem:

  • A direct data.table solution running in memory.
  • The base R stats::aggregate() solution working on in-memory data.frames.
  • The rquery solution using the data.table service provider rqdatatable to run in memory.
  • The rquery solution sending data to PostgreSQL, performing the work in database, and then pulling results back to memory.
  • The dplyr solution working directly on in-memory data.frames.
  • The dplyr solution working directly on in-memory tibble:tbls (we are not counting any time for conversion).
  • The dplyr solution sending data to PostgreSQL, performing the work in database, and then pulling results back to memory.

Running a 1,000,000 row by 13 column example can be summarized with the following graph.

Presenttimings 18

The vertical dashed line is the median time that repeated runs of the base R stats::aggregate() solution took. We can consider results to the left of it as “fast” and results to the right of it as “slow.” Or in physical terms: data.table and rquery using data.table each take about 1.5 seconds on average. Whereas dplyr takes over 20 seconds on average. These two durations represent vastly different user experiences when attempting interactive analyses.

We have run some more tests to try to see how this is a function of problem scale (varying the number of rows of the data). Due to the large range (2 to 10,000,000 rows) we are using log scales, but they unfortunately are just not as readable as the linear scales.

Presenttimings 22
What we can read off this graph includes:

  • data.table is always the fastest system (or at worst indistinguishable from the fastest system) for this example,
    at the scales of problems tested, and for this configuration and hardware.
  • The data.table backed version of rquery becomes comparable to native data.table itself at around 100,000 rows. This is evidence the translation overhead is not too bad for this example and that the sequence of data.table commands issued by rqdatatable are fairly good practice data.table.
  • The database backed version of rquery starts to outperform dplyrat around 10,000 rows. Note: all database measurements include the overhead of moving the data to the database and then moving the results back to R. This is slower than how one would normally use a database in production: with data starting and ending on the database and no data motion between R and the database.
  • dplyr appears to be slower than the base R stats::aggregate() solution at all measured scales (it is always above the shaded region).
  • It is hard to read, but changes in heights are ratios of runtimes. For example the data.table based solutions are routinely over 10 times faster that the dplyr solutions once we get to 100,000 rows or more. This is an object size of only about 10 megabytes and is well below usual “use data.table once you are in the gigabytes range” advice.

Of course benchmarks depend on the example problems, versions, and machines- so results will vary. That being said, large differences often have a good chance of being preserved across variations of tests (and we share another grouped example here, and a join example here; for the join example dplyr is faster at smaller problem sizes- so results do depend on task and scale).

We are hoping to submit the rquery update to CRAN in August and then submit rqdatatable as a new CRAN package soon after. Until then you can try both packages by a simple application of:

devtools::install_github("WinVector/rqdatatable")

These are new packages, but we think they can already save substantial development time, documentation time, debugging time, and machine investment in “R and big data” projects. Our group (Win-Vector LLC) is offering private training in rquery to get teams up to speed quickly.

Note: rqdatatable is an implementation of rquery supplied by data.table, not a data.table scripting tool (as rqdatatable does not support important data.table features not found in rquery, such as slice based operations, and rolling joins).


 

Original Source

John Mount

John Mount

My specialty is analysis and design of algorithms, with an emphasis on efficient implementation. I work to find applications of state of the art methods in optimization, statistics and machine learning in various application areas. Currently co-authoring "Practical Data Science with R"

Open Data Science - Your News Source for AI, Machine Learning & more