Using a Smoother in SQL Using a Smoother in SQL
A problem found throughout the world of data is how to distinguish signal from noise. When dealing with data that comes... Using a Smoother in SQL

A problem found throughout the world of data is how to distinguish signal from noise. When dealing with data that comes in a sequence, such as time-series data (the most familiar example but by no means the only example), a frequent method of dealing with the problem is to apply a smoother. Smoothers can range from the very simple to very sophisticated, but we will concentrate on the simpler end of the spectrum.

[Related Article: How to Get Started with SQL]

All these smoothers, no matter how simple or how complicated have in common that they rely on the data appearing in a particular order. To calculate the smoother, you need to be able to refer to multiple rows around the row in focus, something SQL was not always well set up to do.

In more recent years, major implementations of SQL including MySQL, SQLite, PostgreSQL and SQL Server have added Lead() and Lag() functions allowing calculations to be performed on rows appearing before and after a row of interest. We can make use of these functions to implement  simple smoothers in SQL, such as a weighted moving average.

Like all the windowing functions, SQL gives you flexibility on how to specify the data that the function applies to using the ‘PARTITION’ and ‘OVER’ key words. In semantic terms

Note that the default offset is one place but you can specify the offset within the Lag function e.g. Lag(Value,2) is two rows behind the original row.

However, we can’t return multiple window function variables in the same column, so we can’t do the arithmetic we need to do to calculate the weighted moving average itself. Therefore we need to create a Common Table Expression with to provide the lagged columns, and run a subsequent query to calculate the weighted moving average itself. Again, in purely semantic terms:

However, to illustrate properly we need a good example. For better or worse, I’m an Australian from Victoria, so the football code I follow is Australian Rules (less likely to be true if I was an Australian from NSW or Queensland). My team is the Essendon Bombers who recently recorded one of the biggest losses of their history against the Western Bulldogs, only managing to score 33 points to the Western Bulldogs’ 137 points.

By way of self-flagellation, I thought it would instructive to calculate the moving average of the scores of both teams in recent matches in order to better see how each of them are trending.

Firstly, we need to create a table containing the scores from the last several matches.

Then we have the query itself, following the pattern introduced above of a CTE to create the lagged columns and a subsequently query to calculate the moving average itself. Note that this is the simple moving average, but in many cases a weighted moving average is used, with a variety of different weighting schemes to suit differing tastes.

Giving the result:

Possibly a little bit of comfort for Essendon fans there in the sense that it appears the Western Bulldogs have been improving for several rounds in a row, a stronger trend than Essendon’s more recent collapse in scoring.

[Related Article: All The Cool Things You Can Do With PostgreSQL for Data Analysis]

In any case, a useful, if personally painful way to illustrate how SQL can be used to calculate a moving average with window functions.

Originally Posted Here

Robert De Graaf

Robert de Graaf

Robert de Graaf began his career as an engineer, but switched to data science after discovering the power of statistics to solve real world problems. He is Senior Data Scientist at RightShip and a founding partner of OutputAI Labs. He is the author of many articles on data science, focusing on techniques which ensure you tackle the right problem, as well as being the author of the book 'Managing Your Data Science Projects (Apress)'.