Given the vast set of commands in SQL, there can be multiple ways to get the required result data set. However, it is important to know and implement efficient, computationally inexpensive ways to achieve your goal.
Therefore, I listed here 5 best practices for writing SQL queries that you should know as a data professional.
I kept this article pretty short so that you can finish it quickly and master the time-saving, right ways of writing SQL queries.
Prefer using specific column names in SELECT
The most common and easiest way to retrieve all the data from a specific table in database is
SELECT *. However, it will be the most expensive way, when you try to fetch data from a dataset with billions of rows occupying 100s GBs of data.
In real-world scenarios, all the columns from a dataset are rarely useful for a specific task. Therefore, it makes more sense to pass the required column names in the SELECT statement.
For example, when you need Order ID, Quantity, and Status columns from our dummy sales dataset,
SELECT * FROM Dummy_Sales_Data_v1 will generate output with all the rows and columns of the dataset.
However, you need only three columns, which means other information which is extracted by
SELECT * is not useful for your task.
Instead, when you use the right way,
SELECT OrderID, Quantity, Status FROM Dummy_Sales_Data_v1
You will get only the information needed for your task below
SELECT column names in SQL | Image by Author
In my opinion, below are the 5 reasons, why you Should NOT use SELECT *
🔸 Complications in SQL JOINs:
Using SELECT * in JOINs may introduce ambiguities between column names when two tables in the JOIN statement have the same column names.
🔸 Loss of transparency:
SELECT * gives no idea to your SQL query readers about, from which columns you are fetching the data.
🔸 Increased Network traffic:
SELECT * certainly returns more amount of data than your task needs and it is NOT free of cost. Your database and network are suffering because they are dealing with all the data.
🔸 Increased Memory consumption:
Dealing with more than required data will unnecessarily increase memory consumption. This can be specifically problematic when you work with a group of professionals on a single computation cluster.
🔸 Incorrect Data copying:
One of the common ways to copy the data from one table to another table is using SELECT in the INSERT statement. When you use SELECT * instead of specific columns, you might copy incorrect data into a new table if the column order is not the same in both tables.
Ultimately, SELECT * is a bad practice and must be avoided. Rather, use specific column names to write an easy to read, easy-to-debug, time and memory-efficient query.
Next, let’s discuss why it is important to understand the difference between WHERE and HAVING clauses.
Use WHERE and HAVING wisely
Both clauses are used in SQL to filter the data logically. However, they have a minor difference in which they work.
WHERE clause is used to select the records according to the conditions mentioned in it, whereas the
HAVING clause is used to select records from groups based on aggregations of a column or multiple columns. That’s why
HAVING comes after
GROUP BY clause which essentially groups the rows.
WHERE clause, you can use any aggregate functions such as sum, min, and max with
But, sometimes coders use them interchangeably like below,
❌ Bad way — Using
HAVING to filter records using non-aggregated columns
SELECT OrderID, Quantity, Status FROM Dummy_Sales_Data_v1 GROUP BY OrderID, Quantity, Status HAVING Quantity = 100 AND Status LIKE 'Shipped' AND Shipping_Cost > 30
✅ Best way — Using
WHERE to filter records using non-aggregated columns
SELECT OrderID, Quantity, Status FROM Dummy_Sales_Data_v1 WHERE Quantity = 100 AND Status LIKE 'Shipped' AND Shipping_Cost > 30
Above both queries return same output as below,
Same output by WHERE and HAVING | Image by Author
However, one should always think on the order of execution of SQL operations.
As mentioned by Sisense,
WHERE is executed before GROUP BY in the query. This means,
WHERE selects the rows before grouping them or aggregating calculations of columns.
And, HAVING is executed after the GROUP BY clause, which means it selects rows after aggregate functions are performed on grouped rows.
That’s why WHERE is a Pre-filter and HAVING is a Post-filter. So,
WHERE will be faster.
If your task can be done by
WHERE clause, then I would suggest don’t use
HAVING instead of
HAVING is an absolutely best solution when you want to select rows based on column aggregations.
Also, another one of the commonly made mistakes is using WHERE to join two tables, which is not good practice in SQL. Let’s discuss more about it below.
Prioritize using SQL JOINs over WHERE
JOIN clause is used to combine rows from multiple tables based on a related column between them, whereas
WHERE clause is used select rows based on a condition mentioned in it.
However, sometimes data professionals use
WHERE clauses to select rows from both tables when certain columns and their values are present in both tables.
Let’s see this with a simple example.
First, let’s create two CTEs or Common Table Expressions orders and prices. We can then see how rows can be selected from two tables. (This step is absolutely not needed when you already have 2 tables)
WITH orders AS ( SELECT OrderID, Product_Category, Quantity FROM Dummy_Sales_Data_v1 WHERE Quantity = 100 AND Status LIKE 'Shipped' AND Product_Category LIKE 'Healthcare' ),prices AS ( SELECT OrderID, UnitPrice, Shipping_Cost, Shipping_Address FROM Dummy_Sales_Data_v1 WHERE Quantity = 100 AND Product_Category LIKE 'Healthcare' )
Now, let’s select all the rows and columns — OrderID, Product_Category, Quantity, UnitPrice — from both the tables where Order ID is matching.
❌ bad practice you should avoid
SELECT orders.OrderID, orders.Product_Category, orders.Quantity, prices.UnitPrice FROM orders, prices WHERE orders.OrderID = prices.OrderID
In this case,
WHERE clause automatically served as
INNER JOIN, which means you are performing join without explicitly mentioning JOIN clause. So, here
WHERE becomes implicit join.
This leads to poor readability and understanding.
Whereas, Explicit join, such as mentioning type of join explicitly such as —
INNER JOIN — is the right way. In this way, your query is more readable and easy to understand. 💯
It also gives you more flexibility to choose what type of joining operation you want to do — LEFT or RIGHT — which
WHERE clause doesn’t.
✅ Here is the good practice you should follow.
SELECT orders.OrderID, orders.Product_Category, orders.Quantity, prices.UnitPrice FROM orders INNER JOIN prices ON orders.OrderID = prices.OrderID
Both queries retrieve same result set as below,
SQL JOINs — Implicit and Explicit | Image by Author
In this way, you can combine both the tables on a common column and leave all the conditional selection of rows on
WHERE clause after
From my experience, the main advantages of using
WHERE clause are —
🔸 SQL JOINs gives you the ability to control the order in which tables are scanned and joined
🔸 The query written with explicit join, is easy to read, understand and modify as the
JOIN criteria remain separate from
🔸 While writing a query, you can easily forget
WHERE clause after mentioning table names in FROM clause, which is not the case in JOIN criteria. When you want to combine two tables you will certainly use
As a result, explicit is always better than implicit as it is easier to understand and maintain. 💯
Next, let’s discuss the best alternatives for DISTINCT to select unique records from the table.
Select unique records without using DISTINCT
whenever you need to extract unique records from the database, a simple answer is to use
DISTINCT can be expensive as it will not tell you if the JOINs and filters you used are correct or incorrect which are actually the cause of duplicates.
As long as you want to select unique values from a single column,
DISTINCT works well. But, when you want to select unique rows (a combination of multiple columns)
GROUP BY is the best alternative. 💯
for example, selecting unique rows of Sales Manager, Shipping Address, and Shipping Cost from our dataset.
-- Using DISTINCTSELECT DISTINCT Sales_Manager, Shipping_Address, Shipping_Cost FROM Dummy_Sales_Data_v1 WHERE Status LIKE 'Shipped' AND Product_Category LIKE 'Healthcare' -- Using GROUP BYSELECT Sales_Manager, Shipping_Address, Shipping_Cost FROM Dummy_Sales_Data_v1 WHERE Status LIKE 'Shipped' AND Product_Category LIKE 'Healthcare' GROUP BY Sales_Manager, Shipping_Address, Shipping_Cost
Both queries return exactly the same rows
Select unique records using DISTINCT and GROUP BY | Image by Author
As you do not mention any complex join statements in this query and the dataset is also small as compared to the commercial dataset, both queries have the same performance.
However, almost 90% of the time, I find
GROUP BY more convenient as I always wanted to do some other calculation using aggregate functions, which can be done using
Apart from GROUP BY there are still 3 best alternatives to get unique records without using DISTINCT.
UNION(), INTERSECT() & ROW_NUMBER()
Last but not least, let’s discuss how you can use SQL’s built-in features such as an index to improve your query performance.
Use LIKE for filtering the dataset Rows
The most common way to filter the rows is by using mathematical expressions such as
> and so on. However, they generate boolean values — True or False — and only those rows are selected where the expression is
This process is slow as it does not use SQL index — which is an on-disk structure associated with a table that speeds up retrieval of rows from the table.
As per Microsoft,
An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
A quick alternative to these expressions is using operators such as
Certainly, where ever you need to use the filter on numerical values or aggregate values there is LIKE that would not be a good choice.
But, for the text values, you can always use the LIKE operator. 💯
for example, let’s retrieve all the rows of our dataset where the Product category is ‘Healthcare’ and the Status is ‘Shipped’
❌ bad practice is,
SELECT OrderID, Product_Category, Status FROM Dummy_Sales_Data_v1 WHERE Product_Category = 'Healthcare' AND Status = 'Shipped'
✅ Good practice is,
SELECT OrderID, Product_Category, Status FROM Dummy_Sales_Data_v1 WHERE Product_Category LIKE 'Healthcare' AND Status LIKE 'Shipped'
In this way, even if you don’t know the exact text value to compare with, you can use the wildcards with LIKE operator.
I recommend reading 5 Practical SQL Queries You Should Know In 2022 for deeper insights into how to use LIKE operator.
I hope you finished this article quickly and found it useful to skill up SQL.
I’ve been using SQL for 3 years, and I found these best practices through experience. Also, these practices are indirectly tested during data science job interviews. Adopting them will certainly add more value to your SQL skills.
Article originally posted here by Suraj Gurav. Reposted with permission.