Real-world data is almost always messy. And as a data scientist or analyst, you need to discover the facts about the data. To do so, the data must be tidy and free from errors. Hence, the very first step is to clean the data.
Therefore, I summarized 5 tips that you can practice to clean messy data in SQL.
It is a quick read, which you can finish with your coffee and master some of the interesting tricks to clean the dirty data.
You can easily navigate to your favorite part using this index.
· Find Missing Values in Your Data · Flag Certain Records · Clean Up Messy Values · Bucket Values into Segments or Bins · Correct Data Types
Okay, let’s get started.
Find Missing Values in Your Data
The problem of missing value is quite common in many real-life datasets. Missing value can bias the results of data analysis or the machine learning models or reduce the accuracy of the model or can simply make the data analysis difficult.
Use these ready-to-go queries to find out missing values from the data as well as to check if any patterns are associated with the data.
- Missing data in a particular column
For instance, finding out the
Product_Category is missing.
SELECT OrderID, Sales_Manager, Product_Category, Shipping_Address FROM Dummy_Sales_Data_v1 WHERE Product_Category IS NULL
Find missing values in Data | Image by Author
Moreover, you can find out if the dataset follows any trends when some column has missing values.
- Patterns associated with Missing Data
For an instance, in our example, let’s try to find out if data follows any specific pattern when the values in the column
Product_Category are missing.
SELECT Shipping_Address, OrderID, Quantity, Product_Category FROM Dummy_Sales_Data_v1 WHERE Product_Category IS NULL GROUP BY Shipping_Address, OrderID, Quantity
Patterns associated with missing data in SQL | Image by Author
In this way, you can examine, is there any specific pattern followed by the data when
Product_Category is missing. You can pass any number of columns in
WHERE clause and get the patterns associated with those missing values.
Furthermore, instead of missing values, you can always discover patterns in data when a column has a specific value, as shown below.
SELECT Shipping_Address, OrderID, Quantity FROM Dummy_Sales_Data_v1 WHERE Product_Category LIKE 'Healthcare' GROUP BY Shipping_Address, OrderID, Quantity ORDER BY Quantity DESC
Patterns in Data | Image by Author
In this example, you found out the Quantity and Shipping address patterns associated with
Healthcare products and clearly see the top 5 shipping addresses by order
Flag Certain Records
Another best practice is to create a simple binary column to flag some records. This can be used in the later stages while performing data analysis.
For example, suppose you want to mark all the records as
Dirty_Data if the
Delivery_Time is missing in the dataset.
SELECT OrderID, Sales_Manager, Shipping_Address, Delivery_Time, CASE WHEN Delivery_Time IS NULL THEN 1 ELSE 0 END AS Dirty_data FROM Dummy_Sales_Data_v1
Flagging missing values in SQL | Image by Author
A new column,
Dirty_Data gets added to the output with values as
1. When this output is taken out as excel or any editor, it will certainly make further tasks easy such as finding out the Orders where
Delivery_Time is missing.
In this way, you can include as many different conditions to define dirty data and flag it.
Next, let’s see how to clean up the data which is too granular than required.
Clean Up Messy Values
One of the best ways to clean up granular values is to standardize them. ✅
Often, column values are available at a more granular level than required. In such scenarios, it makes more sense to standardize them at a higher level.
For example, a
shipping address column in contains names of countries, which can be standardized to region names such as Europe, North America, and so on, as shown in the below query.
SELECT OrderID, Shipping_Address, CASE WHEN Shipping_Address IN ('Germany','UK') THEN 'Europe' ELSE 'Other' END AS region FROM Dummy_Sales_Data_v1
Standardize values using CASE-WHEN | Image by Author
So, the information on shipping addresses on the country level was granular, so you standardized it at the regional level.
For simplicity, I used only one
WHEN statement within
CASE, however, we can anytime create more values in
region column by using multiple
Going a step ahead and making the query a little bit complex, you can analyze the data across multiple records.
Bucket Values into Segments or Bins
This is classic use-case of
CASE statement in SQL. The
CASE statement is SQL’s way of handling if-then logic.
You can get deeper insights into the
CASE statement from my latest article — 5 Advanced SQL Concepts You Should Know in 2022.
Let’s see, how this can be used to bucket values into bins.
Suppose, you want to categorize the
Shipping_Cost into Low, Medium, and High depending on the value.
SELECT OrderID, Shipping_Cost, CASE WHEN Shipping_Cost < 25 THEN 'Low' WHEN Shipping_Cost BETWEEN 25 AND 32 THEN 'Medium' WHEN Shipping_Cost > 32 THEN 'High' END AS cost_range FROM Dummy_Sales_Data_v1
Bucket values into Bins SQL | Image by Author
As a result, a new column
cost_range gets created containing values Low, Medium, High depending on the column
Well, these bins serve the purpose of categorizing the shipping cost. However, if you want to order the records as per
cost_range either in ascending (from low to high) or descending order (from high to low), these bins are not working. See below,
SELECT OrderID, Shipping_Cost, CASE WHEN Shipping_Cost < 25 THEN 'Low' WHEN Shipping_Cost BETWEEN 25 AND 32 THEN 'Medium' WHEN Shipping_Cost > 32 THEN 'High' END AS cost_range FROM Dummy_Sales_Data_v1 ORDER BY cost_range DESC
Bins ordered in Descending in SQL | Image by Author
The records are ordered in descending order of
cost_range and Medium comes on the top, which is incorrect. ❌
The correct output would have the rows with
cost_range high on top, followed by Medium and Low. And to get this correct result, you need to name your bins smartly. ✅
One of the easiest ways you can achieve it is, to use numbers as shown below.
SELECT OrderID, Shipping_Cost, CASE WHEN Shipping_Cost < 25 THEN '1. Low' WHEN Shipping_Cost BETWEEN 25 AND 32 THEN '2. Medium' WHEN Shipping_Cost > 32 THEN '3. High' END AS cost_range FROM Dummy_Sales_Data_v1 ORDER BY cost_range DESC
records ordered in descending in SQL | Image by Author
Now, you got it correct. Here, note the use of 3, 2, and 1 to reorder the values. 🎯
Going ahead, incorrect data types for columns can be one of the significant obstacles in data extraction and analysis. Let’s see how you can deal with it in SQL.
Correct Data Types
In the real world, sometimes, the data type of the column is not accurately assigned or the data type assigned initially no longer serves the purpose for your task.
For example, a date-time column is often stored as a text column or the column with float values is assigned the data type integer.
Only extraction of data type might not have any problem with incorrect data types, however, if you wish to join the tables and the data type of the common column in both tables is different, the query will return an error.
❓ Then what’s the solution..
CAST function !!!
CAST function is used to explicitly convert a given data type to a different data type in a SQL database. It has the simplest syntax as,
CAST(column_name AS new_data_type)
column_name is the name of the column whose data type you want to change.
An interesting fact about
CAST is, it portable across different Database Management Systems such as Microsoft SQL Server, MySQL, Oracle and will work the same on each of them. 💯
Let’s quickly go through an example.
Suppose, you want to convert the
Shipment_Cost to float value, which is originally stored as an integer.
SELECT Shipping_Cost, CAST(Shipping_Cost AS REAL) AS Shipping_Cost_float FROM Dummy_Sales_Data_v1
Integer to Float | Image by Author
Simple it is!
However, it also has some limitations. You can not simply convert the column from one to another data type.
for an instance, the date column can not be converted into float. The two data types — existing and new ones — should be similar in nature.
Microsoft has provided an interesting chart, which can be used as a quick reference to understand which data types are compatible with each other, which you can download from here.
That’s all! 🏆
I hope you finished this article quickly and found it refreshing and useful.
I’m using SQL since past 3 years, and I found these tricks through practice. You can always let me know, what else do you do to clean the data. These concepts are very useful while working on real projects.