fbpx
Query Generation in R Query Generation in R
R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features... Query Generation in R

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use.

[Related Article: Jupyter Notebook: Python or R—Or Both?]

Introduction

SQL represents value use by nesting. To use a query result within another query one writes the query to be used inside the query using the values. R query generators can also represent value use by nesting, but they also tend to expose a sequential pipe notation where value use is represented by sequencing.

We will demonstrate this with an example.

Notice how in the rquery operation we write operations one after another (using the dot-arrow pipe notation), and the operations are similarly printed.

Whereas the generated SQL is written in a nested format.

Notice the database handle is kept separate from the operators. Furthermore execution is not entangled with operator definition, but is a separate step (performed through materialze()execute() or even sending the query to the database via a pipe).

Re-using derived values

The piped operator notation runs into trouble if we re-use derived values. SQL can re-use tables and views, but can not conveniently re-use queries. So if we use query results more than once the notation expands our directed acyclic graph specification into a possibly much larger tree.

The following is an artificial example to demonstrate the issue.

Notice the depth 3 expression exploded into tree with 7 joins.

rquery‘s query diagrammer can help spot and diagnose these issues.

The gold nodes are possibly repeated calculations, and the warning also notes the issue.

The above example may seem unnatural- but there are a number of places where it is natural to re-use intermediate results. One such circumstance is comparing values in groups without the use of window functions (as demonstrated here).

The query explosion issue is not unique to rquerydplyr has the same issue.

One could hope the query optimizer will detect and eliminate the common sub-expressions, but that is not always going to be the case. In fact sometimes the very size of a query turns off the query optimizer in systems such as Spark. It is better to organize your calculation to not emit so many common sub-expressions in the first place. We share here an example showing explicit value re-use on Spark (preventing a crash in dplyr, but actually slowing down rquery– implying the Spark optimizer is very good when it survives). And we also share a variation of the example here using PostreSQL where landing intermediate results speeds up both dplyr and rquery (dplyr being faster in this particular example; which is not always the case and may be the case, and here may be due to rquery‘s natural_join() coalesce semantics).

Basic solutions

dplyr can easily overcome this limitation with its compute() node, which lands or materializes intermediate results for re-use.

Notice the dplyr::compute() results are actually tables (not general queries).

rquery can also fix the issue by landing intermediate results, though the table lifetime tracking is intentionally more explicit (and the separation between code and results is also deliberately much more explicit).

More advanced solutions

With a more advanced “relop_list” notation we can both build the efficient query plan, but also the diagram certifying the lack of redundant stages.

The relop_list collector is introducing and managing intermediate tables. It is simple to materialized inspect the results.

Conclusion

[Related Article: Validating Type I and II Errors in A/B Tests in R]

rquery is an advanced SQL query generator for R that includes a number of tools for managing complex queries. If your queries are taking substantial development time or substantial run time you should definitely consider trying the rquery system. For convenience there is also data.table based implementation of the rquery grammar called rqdatatable (which tends to be much faster that dplyr).

Originally Posted Here

John Mount

John Mount

My specialty is analysis and design of algorithms, with an emphasis on efficient implementation. I work to find applications of state of the art methods in optimization, statistics and machine learning in various application areas. Currently co-authoring "Practical Data Science with R"

1