fbpx
5 Easy SQL Tricks to Clean Dirty Data 5 Easy SQL Tricks to Clean Dirty Data
Real-world data is almost always messy. And as a data scientist or analyst, you need to discover the facts about the... 5 Easy SQL Tricks to Clean Dirty Data

Real-world data is almost always messy. And as a data scientist or analyst, you need to discover the facts about the data. To do so, the data must be tidy and free from errors. Hence, the very first step is to clean the data.

Therefore, I summarized 5 tips that you can practice to clean messy data in SQL.

It is a quick read, which you can finish with your coffee and master some of the interesting tricks to clean the dirty data.

You can easily navigate to your favorite part using this index.

· Find Missing Values in Your Data
· Flag Certain Records
· Clean Up Messy Values 
· Bucket Values into Segments or Bins
· Correct Data Types

Okay, let’s get started.

Find Missing Values in Your Data

The problem of missing value is quite common in many real-life datasets. Missing value can bias the results of data analysis or the machine learning models or reduce the accuracy of the model or can simply make the data analysis difficult.

Use these ready-to-go queries to find out missing values from the data as well as to check if any patterns are associated with the data.

  • Missing data in a particular column

For instance, finding out the OrderIDs where Product_Category is missing.

SELECT OrderID,
       Sales_Manager,
       Product_Category,
       Shipping_Address
FROM Dummy_Sales_Data_v1
WHERE Product_Category IS NULL

Find missing values in Data | Image by Author

Moreover, you can find out if the dataset follows any trends when some column has missing values.

  • Patterns associated with Missing Data

For an instance, in our example, let’s try to find out if data follows any specific pattern when the values in the column Product_Category are missing.

SELECT Shipping_Address,
       OrderID,
       Quantity,
       Product_Category
FROM Dummy_Sales_Data_v1
WHERE Product_Category IS NULL
GROUP BY Shipping_Address, OrderID, Quantity

Patterns associated with missing data in SQL | Image by Author

In this way, you can examine, is there any specific pattern followed by the data when Product_Category is missing. You can pass any number of columns in WHERE clause and get the patterns associated with those missing values.

Furthermore, instead of missing values, you can always discover patterns in data when a column has a specific value, as shown below.

SELECT Shipping_Address,
       OrderID,
       Quantity
FROM Dummy_Sales_Data_v1
WHERE Product_Category LIKE 'Healthcare'
GROUP BY Shipping_Address, OrderID, Quantity
ORDER BY Quantity DESC

Patterns in Data | Image by Author

In this example, you found out the Quantity and Shipping address patterns associated with Healthcare products and clearly see the top 5 shipping addresses by order Quantity.

Flag Certain Records

Another best practice is to create a simple binary column to flag some records. This can be used in the later stages while performing data analysis.

For example, suppose you want to mark all the records as Dirty_Data if the Delivery_Time is missing in the dataset.

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       Delivery_Time,
       CASE WHEN Delivery_Time IS NULL THEN 1
            ELSE 0
       END AS Dirty_data
FROM Dummy_Sales_Data_v1

Flagging missing values in SQL | Image by Author

A new column, Dirty_Data gets added to the output with values as 0 and 1. When this output is taken out as excel or any editor, it will certainly make further tasks easy such as finding out the Orders where Delivery_Time is missing.

In this way, you can include as many different conditions to define dirty data and flag it.

Next, let’s see how to clean up the data which is too granular than required.

Clean Up Messy Values

One of the best ways to clean up granular values is to standardize them. ✅

Often, column values are available at a more granular level than required. In such scenarios, it makes more sense to standardize them at a higher level.

For example, a shipping address column in contains names of countries, which can be standardized to region names such as Europe, North America, and so on, as shown in the below query.

SELECT OrderID,
       Shipping_Address,
       CASE WHEN Shipping_Address IN ('Germany','UK') THEN 'Europe'
            ELSE 'Other'
       END AS region
FROM Dummy_Sales_Data_v1

Standardize values using CASE-WHEN | Image by Author

So, the information on shipping addresses on the country level was granular, so you standardized it at the regional level.

For simplicity, I used only one WHEN statement within CASE, however, we can anytime create more values in region column by using multiple WHEN statements.

Going a step ahead and making the query a little bit complex, you can analyze the data across multiple records.

Bucket Values into Segments or Bins

This is classic use-case of CASE statement in SQL. The CASE statement is SQL’s way of handling if-then logic.

You can get deeper insights into the CASE statement from my latest article — 5 Advanced SQL Concepts You Should Know in 2022.

Let’s see, how this can be used to bucket values into bins.

Suppose, you want to categorize the Shipping_Cost into  and  depending on the value.

SELECT OrderID,
       Shipping_Cost,
       CASE WHEN Shipping_Cost < 25 THEN 'Low'
            WHEN Shipping_Cost BETWEEN 25 AND 32 THEN 'Medium'
            WHEN Shipping_Cost > 32 THEN 'High'
       END AS cost_range
FROM Dummy_Sales_Data_v1

Bucket values into Bins SQL | Image by Author

As a result, a new column cost_range gets created containing values  depending on the column shipping_cost.

Well, these bins serve the purpose of categorizing the shipping cost. However, if you want to order the records as per cost_range either in ascending () or descending order (), these bins are not working. See below,

SELECT OrderID,
       Shipping_Cost,
       CASE WHEN Shipping_Cost < 25 THEN 'Low'
            WHEN Shipping_Cost BETWEEN 25 AND 32 THEN 'Medium'
            WHEN Shipping_Cost > 32 THEN 'High'
       END AS cost_range
FROM Dummy_Sales_Data_v1
ORDER BY cost_range DESC

Bins ordered in Descending in SQL | Image by Author

The records are ordered in descending order of cost_range and Medium comes on the top, which is incorrect. ❌

The correct output would have the rows with cost_range high on top, followed by Medium and Low. And to get this correct result, you need to name your bins smartly. ✅

One of the easiest ways you can achieve it is, to use numbers as shown below.

SELECT OrderID,
  Shipping_Cost,
  CASE WHEN Shipping_Cost < 25 THEN '1. Low'
       WHEN Shipping_Cost BETWEEN 25 AND 32 THEN '2. Medium'
       WHEN Shipping_Cost > 32 THEN '3. High'
  END AS cost_range
FROM Dummy_Sales_Data_v1
ORDER BY cost_range DESC

records ordered in descending in SQL | Image by Author

Now, you got it correct. Here, note the use of 3, 2, and 1 to reorder the values. 🎯

Going ahead, incorrect data types for columns can be one of the significant obstacles in data extraction and analysis. Let’s see how you can deal with it in SQL.

Correct Data Types

In the real world, sometimes, the data type of the column is not accurately assigned or the data type assigned initially no longer serves the purpose for your task.

For example, a date-time column is often stored as a text column or the column with float values is assigned the data type integer.

Only extraction of data type might not have any problem with incorrect data types, however, if you wish to join the tables and the data type of the common column in both tables is different, the query will return an error.

❓ Then what’s the solution..

CAST function !!!

In SQL, CAST function is used to explicitly convert a given data type to a different data type in a SQL database. It has the simplest syntax as,

CAST(column_name AS new_data_type)

where column_name is the name of the column whose data type you want to change.

An interesting fact about CAST is, it portable across different Database Management Systems such as Microsoft SQL Server, MySQL, Oracle and will work the same on each of them. 💯

Let’s quickly go through an example.

Suppose, you want to convert the Shipment_Cost to float value, which is originally stored as an integer.

SELECT Shipping_Cost,
       CAST(Shipping_Cost AS REAL) AS Shipping_Cost_float
FROM Dummy_Sales_Data_v1

Integer to Float | Image by Author

Simple it is!

However, it also has some limitations. You can not simply convert the column from one to another data type.

for an instance, the date column can not be converted into float. The two data types — existing and new ones — should be similar in nature.

Microsoft has provided an interesting chart, which can be used as a quick reference to understand which data types are compatible with each other, which you can download from here.

SQL Server Data Type Conversion Chart by Microsoft

That’s all! 🏆

I hope you finished this article quickly and found it refreshing and useful.

I’m using SQL since past 3 years, and I found these tricks through practice. You can always let me know, what else do you do to clean the data. These concepts are very useful while working on real projects.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1