R users have been enjoying the benefits of
SQL query generators for quite some time, most notably using the
dbplyr package. I would like to talk about some features of our own
rquery query generator, concentrating on derived result re-use.
[Related Article: Jupyter Notebook: Python or R—Or Both?]
SQL represents value use by nesting. To use a query result within another query one writes the query to be used inside the query using the values.
R query generators can also represent value use by nesting, but they also tend to expose a sequential pipe notation where value use is represented by sequencing.
We will demonstrate this with an example.
Notice how in the
rquery operation we write operations one after another (using the dot-arrow pipe notation), and the operations are similarly printed.
Whereas the generated
SQL is written in a nested format.
Notice the database handle is kept separate from the operators. Furthermore execution is not entangled with operator definition, but is a separate step (performed through
execute() or even sending the query to the database via a pipe).
Re-using derived values
The piped operator notation runs into trouble if we re-use derived values.
SQL can re-use tables and views, but can not conveniently re-use queries. So if we use query results more than once the notation expands our directed acyclic graph specification into a possibly much larger tree.
The following is an artificial example to demonstrate the issue.
Notice the depth 3 expression exploded into tree with 7 joins.
rquery‘s query diagrammer can help spot and diagnose these issues.
The gold nodes are possibly repeated calculations, and the warning also notes the issue.
The above example may seem unnatural- but there are a number of places where it is natural to re-use intermediate results. One such circumstance is comparing values in groups without the use of window functions (as demonstrated here).
One could hope the query optimizer will detect and eliminate the common sub-expressions, but that is not always going to be the case. In fact sometimes the very size of a query turns off the query optimizer in systems such as
Spark. It is better to organize your calculation to not emit so many common sub-expressions in the first place. We share here an example showing explicit value re-use on
Spark (preventing a crash in
dplyr, but actually slowing down
rquery– implying the
Spark optimizer is very good when it survives). And we also share a variation of the example here using
PostreSQL where landing intermediate results speeds up both
dplyr being faster in this particular example; which is not always the case and may be the case, and here may be due to
natural_join() coalesce semantics).
dplyr can easily overcome this limitation with its
compute() node, which lands or materializes intermediate results for re-use.
dplyr::compute() results are actually tables (not general queries).
rquery can also fix the issue by landing intermediate results, though the table lifetime tracking is intentionally more explicit (and the separation between code and results is also deliberately much more explicit).
More advanced solutions
With a more advanced “relop_list” notation we can both build the efficient query plan, but also the diagram certifying the lack of redundant stages.
relop_list collector is introducing and managing intermediate tables. It is simple to materialized inspect the results.
[Related Article: Validating Type I and II Errors in A/B Tests in R]
rquery is an advanced
SQL query generator for
R that includes a number of tools for managing complex queries. If your queries are taking substantial development time or substantial run time you should definitely consider trying the
rquery system. For convenience there is also
data.table based implementation of the
rquery grammar called
rqdatatable (which tends to be much faster that
Originally Posted Here