fbpx
5 Most Useful SQL Best Practices You Should Follow 5 Most Useful SQL Best Practices You Should Follow
Given the vast set of commands in SQL, there can be multiple ways to get the required result data set. However,... 5 Most Useful SQL Best Practices You Should Follow

Given the vast set of commands in SQL, there can be multiple ways to get the required result data set. However, it is important to know and implement efficient, computationally inexpensive ways to achieve your goal.

Therefore, I listed here 5 best practices for writing SQL queries that you should know as a data professional.

I kept this article pretty short so that you can finish it quickly and master the time-saving, right ways of writing SQL queries.

Prefer using specific column names in SELECT

The most common and easiest way to retrieve all the data from a specific table in database is SELECT *. However, it will be the most expensive way, when you try to fetch data from a dataset with billions of rows occupying 100s GBs of data.

In real-world scenarios, all the columns from a dataset are rarely useful for a specific task. Therefore, it makes more sense to pass the required column names in the SELECT statement.

For example, when you need Order ID, Quantity, and Status columns from our dummy sales dataset, SELECT * FROM Dummy_Sales_Data_v1 will generate output with all the rows and columns of the dataset.

However, you need only three columns, which means other information which is extracted by SELECT * is not useful for your task.

Instead, when you use the right way,

SELECT OrderID,
       Quantity,
       Status
FROM Dummy_Sales_Data_v1

You will get only the information needed for your task below

SELECT column names in SQL | Image by Author

In my opinion, below are the 5 reasons, why you Should NOT use SELECT *

🔸 Complications in SQL JOINs:
Using SELECT * in JOINs may introduce ambiguities between column names when two tables in the JOIN statement have the same column names.

🔸 Loss of transparency:
SELECT * gives no idea to your SQL query readers about, from which columns you are fetching the data.

🔸 Increased Network traffic:
SELECT * certainly returns more amount of data than your task needs and it is NOT free of cost. Your database and network are suffering because they are dealing with all the data.

🔸 Increased Memory consumption:
Dealing with more than required data will unnecessarily increase memory consumption. This can be specifically problematic when you work with a group of professionals on a single computation cluster.

🔸 Incorrect Data copying:
One of the common ways to copy the data from one table to another table is using SELECT in the INSERT statement. When you use SELECT * instead of specific columns, you might copy incorrect data into a new table if the column order is not the same in both tables.

Ultimately, SELECT * is a bad practice and must be avoided. Rather, use specific column names to write an easy to read, easy-to-debug, time and memory-efficient query.

Next, let’s discuss why it is important to understand the difference between WHERE and HAVING clauses.

Use WHERE and HAVING wisely

Both clauses are used in SQL to filter the data logically. However, they have a minor difference in which they work.

The WHERE clause is used to select the records according to the conditions mentioned in it, whereas the HAVING clause is used to select records from groups based on aggregations of a column or multiple columns. That’s why HAVING comes after GROUP BY clause which essentially groups the rows.

Additionally, unlike WHERE clause, you can use any aggregate functions such as sum, min, and max with HAVING clause.

But, sometimes coders use them interchangeably like below,

❌ Bad way — Using HAVING to filter records using non-aggregated columns

SELECT OrderID,
       Quantity,
       Status
FROM Dummy_Sales_Data_v1
GROUP BY OrderID, Quantity, Status
HAVING Quantity = 100
AND Status LIKE 'Shipped'
AND Shipping_Cost > 30

✅ Best way — Using WHERE to filter records using non-aggregated columns

SELECT OrderID,
       Quantity,
       Status
FROM Dummy_Sales_Data_v1
WHERE Quantity = 100
AND Status LIKE 'Shipped'
AND Shipping_Cost > 30

Above both queries return same output as below,

Same output by WHERE and HAVING | Image by Author

However, one should always think on the order of execution of SQL operations.

As mentioned by Sisense,

WHERE is executed before GROUP BY in the query. This means, WHERE selects the rows before grouping them or aggregating calculations of columns.

And, HAVING is executed after the GROUP BY clause, which means it selects rows after aggregate functions are performed on grouped rows.

That’s why WHERE is a Pre-filter and HAVING is a Post-filter. So, WHERE will be faster.

If your task can be done by WHERE clause, then I would suggest don’t use HAVING instead of WHERE.

HAVING is an absolutely best solution when you want to select rows based on column aggregations.

Also, another one of the commonly made mistakes is using WHERE to join two tables, which is not good practice in SQL. Let’s discuss more about it below.

Prioritize using SQL JOINs over WHERE

In SQL, JOIN clause is used to combine rows from multiple tables based on a related column between them, whereas WHERE clause is used select rows based on a condition mentioned in it.

However, sometimes data professionals use WHERE clauses to select rows from both tables when certain columns and their values are present in both tables.

Let’s see this with a simple example.

First, let’s create two CTEs or Common Table Expressions orders and prices. We can then see how rows can be selected from two tables. (This step is absolutely not needed when you already have 2 tables)

WITH orders AS
(
SELECT OrderID,
       Product_Category,
       Quantity
FROM Dummy_Sales_Data_v1
WHERE Quantity = 100
AND Status LIKE 'Shipped'
AND Product_Category LIKE 'Healthcare'
),prices AS
(
SELECT OrderID,
       UnitPrice,
       Shipping_Cost,
       Shipping_Address
FROM Dummy_Sales_Data_v1
WHERE Quantity = 100
AND Product_Category LIKE 'Healthcare'
)

Now, let’s select all the rows and columns — OrderID, Product_Category, Quantity, UnitPrice — from both the tables where Order ID is matching.

❌ bad practice you should avoid

SELECT orders.OrderID,
       orders.Product_Category,
       orders.Quantity,
       prices.UnitPrice
FROM orders, prices
WHERE orders.OrderID = prices.OrderID

In this case, WHERE clause automatically served as INNER JOIN, which means you are performing join without explicitly mentioning JOIN clause. So, here WHERE becomes implicit join.

This leads to poor readability and understanding.

Whereas, Explicit join, such as mentioning type of join explicitly such as — LEFT JOINRIGHT JOININNER JOIN — is the right way. In this way, your query is more readable and easy to understand. 💯

It also gives you more flexibility to choose what type of joining operation you want to do — LEFT or RIGHT — which WHERE clause doesn’t.

✅ Here is the good practice you should follow.

SELECT orders.OrderID,
       orders.Product_Category,
       orders.Quantity,
       prices.UnitPrice
FROM orders
INNER JOIN prices
ON orders.OrderID = prices.OrderID

Both queries retrieve same result set as below,

SQL JOINs — Implicit and Explicit | Image by Author

In this way, you can combine both the tables on a common column and leave all the conditional selection of rows on WHERE clause after JOIN clause.

From my experience, the main advantages of using JOIN over WHERE clause are —

🔸 SQL JOINs gives you the ability to control the order in which tables are scanned and joined

🔸 The query written with explicit join, is easy to read, understand and modify as the JOIN criteria remain separate from WHERE clause.

🔸 While writing a query, you can easily forget WHERE clause after mentioning table names in FROM clause, which is not the case in JOIN criteria. When you want to combine two tables you will certainly use JOIN after FROM.

As a result, explicit is always better than implicit as it is easier to understand and maintain. 💯

Next, let’s discuss the best alternatives for DISTINCT to select unique records from the table.

Select unique records without using DISTINCT

whenever you need to extract unique records from the database, a simple answer is to use DISTINCT!!

However, DISTINCT can be expensive as it will not tell you if the JOINs and filters you used are correct or incorrect which are actually the cause of duplicates.

As long as you want to select unique values from a single column, DISTINCT works well. But, when you want to select unique rows (a combination of multiple columns) GROUP BY is the best alternative. 💯

for example, selecting unique rows of Sales Manager, Shipping Address, and Shipping Cost from our dataset.

-- Using DISTINCTSELECT DISTINCT Sales_Manager,
                Shipping_Address,
                Shipping_Cost
FROM Dummy_Sales_Data_v1
WHERE Status LIKE 'Shipped'
AND Product_Category LIKE 'Healthcare'
-- Using GROUP BYSELECT Sales_Manager,
       Shipping_Address,
       Shipping_Cost
FROM Dummy_Sales_Data_v1
WHERE Status LIKE 'Shipped'
AND Product_Category LIKE 'Healthcare'
GROUP BY Sales_Manager, 
         Shipping_Address, 
         Shipping_Cost

Both queries return exactly the same rows

Select unique records using DISTINCT and GROUP BY | Image by Author

As you do not mention any complex join statements in this query and the dataset is also small as compared to the commercial dataset, both queries have the same performance.

However, almost 90% of the time, I find GROUP BY more convenient as I always wanted to do some other calculation using aggregate functions, which can be done using GROUP BY.

Apart from GROUP BY there are still 3 best alternatives to get unique records without using DISTINCT.

UNION(), INTERSECT() & ROW_NUMBER()

Last but not least, let’s discuss how you can use SQL’s built-in features such as an index to improve your query performance.

Use LIKE for filtering the dataset Rows

The most common way to filter the rows is by using mathematical expressions such as = , > and so on. However, they generate boolean values — True or False — and only those rows are selected where the expression is TRUE.

This process is slow as it does not use SQL index — which is an on-disk structure associated with a table that speeds up retrieval of rows from the table.

As per Microsoft,

An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.

A quick alternative to these expressions is using operators such as LIKE.

Certainly, where ever you need to use the filter on numerical values or aggregate values there is LIKE that would not be a good choice.

But, for the text values, you can always use the LIKE operator. 💯

for example, let’s retrieve all the rows of our dataset where the Product category is ‘Healthcare’ and the Status is ‘Shipped’

❌ bad practice is,

SELECT OrderID, 
        Product_Category,
        Status
FROM Dummy_Sales_Data_v1
WHERE Product_Category = 'Healthcare'
AND Status = 'Shipped'

✅ Good practice is,

SELECT OrderID, 
        Product_Category,
        Status
FROM Dummy_Sales_Data_v1
WHERE Product_Category LIKE 'Healthcare'
AND Status LIKE 'Shipped'

In this way, even if you don’t know the exact text value to compare with, you can use the wildcards with LIKE operator.

I recommend reading 5 Practical SQL Queries You Should Know In 2022 for deeper insights into how to use LIKE operator.

That’s all!

I hope you finished this article quickly and found it useful to skill up SQL.

I’ve been using SQL for 3 years, and I found these best practices through experience. Also, these practices are indirectly tested during data science job interviews. Adopting them will certainly add more value to your SQL skills.

Article originally posted here by Suraj Gurav. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1