Complex Queries in SQL Complex Queries in SQL
Hopefully, your SQL queries aren’t tangled like this. (Photo: Author) Texts on SQL are good at providing the basic templates of... Complex Queries in SQL

Hopefully, your SQL queries aren’t tangled like this. (Photo: Author)

Texts on SQL are good at providing the basic templates of SQL syntax, but sometimes the queries in those books are a little idealized compared to real life, with no more than two or three tables to join and a small number of fields to deal with. As a result, the queries themselves are usually no more than a dozen lines long.

[Related Article: How to Get Started with SQL]

It often happens in the wild that you end up joining across many tables with long-winded ‘WHERE’ clauses that are easy to lose track of. These are hard enough to use the first time you employ them, but it can be impossible to debug and understand if you need to re-use them. What can be done to make long and complicated queries easier to write and easier to read?

Firstly, though, ask yourself whether you need a long, complex query at all. Is it vital that your single query does all the work? If you still think it is, ask yourself at least two more times. In the excellent SQL Anti-patterns, using a single query to solve a complex task or multiple tasks is identified as a key anti-pattern of queries, underlining its status as something to be wary of. One particular danger that this book identifies is that as the query becomes more difficult to follow, the risk of accidentally making a Cartesian join (a join where the number of rows returned is the product of the number of rows in the tables involved in the join. By contrast, the maximum rows in an inner or outer join will be the number of rows in the largest table), taking a protracted period of time to return many unwanted rows grows enormously.

Hence the crucial piece of advice is to check whether or not the large query could be performed with more than one smaller easier to understand queries.

If you think that the complex query you are writing is completely unavoidable, there are a few ways to make it a less painful process.

  1. Specify the column names within the query; don’t fall into the ‘SELECT *’ trap. A complex query that doesn’t specify which columns to return is likely to return columns that aren’t needed, making it harder to find the information you are looking for. Additionally, there is a decent chance of returning redundant columns.
  2. Format your query carefully to make it as readable as possible for another human being. Put each field to be selected on its own line, and put each new table joined onto on its own line, and each element of a ‘WHERE’ clause on its own line.
  3. Use meaningful aliases for tables to further aid readability. You are likely to need to refer to your tables multiple times within your query, and if you are working within someone else’s database with their naming convention, the table names can contain redundant information about they refer e.g. ‘MyDBTable1’—so you call a table like that ‘T1’ as you already know you’re working within ‘MyDB’
  4. Create a redundant ‘Where’ clause e.g. ‘WHERE 0=0’ to enable easy testing of different parts of the WHERE clause (h/t former colleague Mark Watson). That is, because the ‘WHERE 0=0’ is automatically sitting at the beginning of the ‘WHERE’ clause, you can turn the rest of it on and off (e.g. comment it out) without a syntax error to check the effect.
  5. Plan your query, potentially making a sketch on paper so that you know what you are trying to achieve in terms of where your data is located and how the tables in your query relate to each other before you start converting it to code.

The end result of taking these steps is that your code will look a bit like this:

[Related Article: All The Cool Things You Can Do With PostgreSQL for Data Analysis]

Overall, it is preferable to have more small queries rather than fewer large queries. Some database structures make this difficult, for example, if you need to join on to get a secondary key to find what you really need from table B. In these cases, planning and careful formatting can assist in avoiding your code turning to spaghetti.

Originally Posted Here

Robert De Graaf

Robert de Graaf

Robert de Graaf began his career as an engineer, but switched to data science after discovering the power of statistics to solve real world problems. He is Senior Data Scientist at RightShip and a founding partner of OutputAI Labs. He is the author of many articles on data science, focusing on techniques which ensure you tackle the right problem, as well as being the author of the book 'Managing Your Data Science Projects (Apress)'.