Data wrangling is an essential job function for data engineering, data science, or machine learning roles. As knowledgable coders, many of these professionals can rely on their programming skills and help from libraries like Pandas to wrangle data. However, it can often be optimal to manipulate data directly at the source with SQL scrips. Here are some data wrangling techniques every data expert should know.
Note for the purpose of this blog post I will not go into the details of the many versions of SQL. Much of the syntax mentioned is conformant with the SQL-92 standard, but where there are platform-specific clauses and functions, I will try and point them out (PostgreSQL, MySQL, T-SQL, PSQL, etc) but not extensively. Remember, however, it’s common to implement a missing function directly from your favorite flavor of SQL. For example, the correlation coefficient function, CORR, is present in PostgreSQL and PLSQL but not MySQL and can be implemented directly or imported from an external library. To keep this blog brief, I don’t specify the implementation of these techniques but they are widely available and will also be covered in my upcoming virtual training session where I’ll go into further detail on data wrangling techniques and how to wrangle data.
1. Data Profiling with SQL
Data profiling is often the first step in any data science or machine learning project and an important preliminary step to explore the data prior to data preparation or transformation. The goal is straightforward: determine if the data is accurate, reliable, and representative to build performing algorithms since the resulting models are only as good as the data employed therein.
Core SQL Skills for Data Profiling
- Descriptive statistics: MIN, MAX, MEAN, AVERAGE, MEDIAN,, STDEV, , STDDEV(PostgreSQL), STDEVP (Transact-SQL)
- Correlation functions:: CORR (PostgreSQL),
- Aggregation functions: COUNT, SUM, GROUP BY
Data profiling starts with some basic descriptive statistic functions such as MIN, MAX, STDEV, etc to help you understand the distribution and range of your data. Even basic aggregation functions like COUNT and GROUP BY are useful. From there, standard deviation functions including STDEV/STDEVP can tell you a lot about your data (high, low) and its distribution. For a quick look at correlations across features and the target feature, the CORR correlation coefficient function can give some quick insight.
2. Detecting Outliers with SQL
An essential part of data profiling is identifying outliers, and SQL can be a quick and easy way to find extreme values. You can get quite far with the below simple techniques and employing basic sorting techniques.
Core SQL Skills for Detecting Outliers
- Sorting & Filtering: ORDER, CASE, WHERE HAVING
- Descriptive statistics: AVG, AVERAGE, MEAN, MEDIAN
- Winsorization: PERCENTILE_CONT, PERCENT_RANK(T-SQL), Subqueries, VAR (T-SQL)
- Fixing outliers: UPDATE, NULL, AVG
Simple sorting and ordering using ORDER BY will produce quick results and further classifying using CASE statements and using groups and summation. More sophisticated techniques can be used by implementing various forms of Winsorization and variance functions VAR(T-SQL). To identify outliers you can use percentile calculations, and VAR(T-SQL) can be used to measure the statical variants. A good blog on how to implement it can be found here.
There are various techniques for handling outliers from ignoring to removing (trimming). Replacement techniques such as Winsorization where we essentially change extreme values in the dataset to less extreme values with observations closest to them. This typically requires us to calculate percentiles. We can use UPDATE with PERCENTILE_CONT (present in PostgreSQL, T-SQL, and PLSQ) to, for example, sets all observations greater than the 95th percentile equal to the value at the 95th percentile and all observations less than the 5th percentile equal to the value at the 5th percentile.
3. Time Series, ReIndexing, and Pivoting with SQL
Time-series databases such as Prometheus, TDengine, and InfluxDB are quite popular among data wrangling techniques but more often than not you may be dealing with a standard relational database, and pivoting, reindexing, or transforming data into time series can be a challenge.
Core SQL Skills for TimeSeries, ReIndexing, and Pivoting with SQL
- Window Functions: LAG, OVER, (PostgreSQL, T-SQL, MySQL) OVER
- Time-series generation: Pivoting
- PIVOT (T-SQL) CROSSTAB, CASE
- Data Trends: COALSE
For time series and data realignment, window functions perform calculations on a set of rows that are related together and thus perfect for this type of data transformation. For example, if we have a set of time-stamped data and we wanted to get the difference at each step, you can use the LAG function. The LAG() function provides access to a row that comes before the current row. This works because window operations like LAG() do not collapse groups of query rows into a single output row. Instead, they produce a result for each row. Using the OVER clause specifies you can partition query rows into groups for processing by the window function:
Reindexing (not to be confused with database indexing) or aligning datasets is another key technique. For example, you have sales from different time periods you’d like to compare. Using windowing functions and subqueries you can reset the data from the same data point or zero or align on the same period units in time.
Pivoting is a widely used spreadsheet technique among data wrangling techniques. Often when looking at periodic or timestamp data you get a single column of data you’d like to pivot to multispoke columns. For example, pivoting a single column of monthly sales listed by rows into multiple columns of sales by month. PIVOT and CROSSTAB can be used but for other SQL favors that don’t support those functions, you can implement them with CASE statements.
4. Fake Data Generation with SQL
There are a number of reasons to generate fake data, including:
- Anonymizing production and sensitive data before porting it to other data stores
- Generating more data that has similar characteristics to existing data for load testing, benchmarking, robustness testing, or scaling.
- Varying the distribution of your data to understand how your data science or ML model performs for bias, loss function, or other metrics
Core SQL skills:
- RANDOM, RAND
- CREATE, UPDATE, SELECT, INSERT, and subqueries
- JOIN and CROSS JOIN, MERGE CAST
- String manipulation: functions, CONCAT SUBSTR, TRIM, etc.
SQL can be useful for generating fake data for load testing, benchmarking, or model training and testing. For numerical data, RANDOM, RAND(), and its variants are classic random generators. Generation data for new tables (CREATE) or updating existing ones (UPDATE). Data selection and subqueries are generally a prerequisite. CROSS JOIN is a favorite as it lets you combine data (aka cartesian join) to make new data sets and MEGE can be used similarly. CAST is useful for converting variables into different data types. When generating text string manipulation CONCAT SUBSTR, TRIM and a host of other functions can be used to create new text variables.
5. Loading Flat Files with SQL
This may look like a bonus round data wrangling technique but I’ve worked with many data experts that struggle to load data files directly into a database (the L in ETL). Why do this? For many research projects data is not readily available in data lakes, APIs, and other data stores. Generally, it has to be collected in the wild and more often than not in a flat-file format or multiple flat files.
Core Skills required
- SQL Clients; Understand delimiters, text delimiting, and special characters
- Table creation CREATE, TEMPORARY TABLE
- Importing data, IMPORT, INSERT, SELECT INTO
- Normalize data with SQL clauses including, SELECT, UPDATE, WHERE, JOIN, and subqueries
- Transaction control: BEGIN, COMMIT, and ROLLBACK
ETL (Extraction, Transformation, Loading) is the most common way to simply import CSV files. You can use SQL for the all-important verification step. Simple verification includes checking for nulls, understanding the range and distribution of values imported, finding extreme values, and event simple count checks.
However, loading flat file data into a database is not as trivial as it seems thanks to the need for specific delimiters. Most SQL clients allow you to specify custom delimiters. For text-heavy files, you can’t use simple text quote delimiters. Double quotes and other techniques must be employed. Be careful to check for whitespace characters also including newline characters.
Importing a single flat file like a CSV file is somewhat straightforward once the above steps are done. However, by nature flat files are not normalized thus you can use SQL to create a temporary table to load the data and then use the SQL filter clause and Update clause to port that data to the final destination tables. A common technique is to import that data to temporary tables in “raw” format and then normalize the data to other tables. As always when inserting or updating data employ transaction controls.
Get started with SQL for data science and add it to your skillset at ODSC West 2022
If you’re looking to add an in-demand, evergreen, and broad-use skill to your repertoire, then maybe it’s time to learn SQL or other core data science skills. At ODSC West 2022, we’ll have an entire mini bootcamp track where you can start with core beginner skills and work your way up to more advanced data science skills, such as working with NLP or neural networks. By registering now, you’ll also gain access to Ai+ Training on demand for a year. Sign up now, start learning today!