Comparing Five Different Smooths – Which One Rules Them All?
Modelingposted by Brandon Dey, ODSC September 18, 2018 Brandon Dey, ODSC
Short answer – it depends on how fast and non-smooth (read: wiggly) a smooth your data demands. If you only need a line plotted summarily through a cloud of points, it’s probably in your time’s best interest to take Ockham’s razor to your data and deploy the simplest approach: Bin Smoothing, Simple Moving Average, or Loess.
On the flip side of that, if you want something more complex and/or wiggly, you should further compare smooths on their: smoothness, accuracy, and speed. But there is no free lunch when it comes to the one smooth to rule them all.
Let’s visually evaluate the smoothness of a few choice smooths from my last post on how they fit to Friedman’s formula, which I pick because it is itself a smooth function and it has a fast and slow curve (plus there’s literature on it, originally Friedman & Stuetzle (1984) on Supersmoother). (Not to be confused with the Friedmann equation, which is derived from Einstein’s Theory of General Relativity and predicts how fast the universe is expanding.)
We’ll start by fitting Friedman’s formula with a simple moving average and exponential moving average, then try out Loess with varying spans, and finish up by fitting a simple generalized additive model, and a cubic spline. I conclude the post with some general findings on the performance of a handful of popular smoothing techniques, per Breiman and Peters (1992).
That doesn’t look nice. Simple and Exponential Moving Average clearly don’t fit the underlying Friedman function, the red line, very well, so let’s try Loess with two separate parameter settings: (i) with the default span from ggplot2 and (ii) with a smaller span for a wigglier line. Keep in mind that in the wild — you’re at your desk, boss is breathing down your neck — you won’t know what the red line actually is. I show it so you know what the gray data points are generated from, which is what we’re trying to fit.
The plot on the left (A) in the Loess plot clearly fits Friedman’s formula a lot better than the either of moving averages, but it has a hard time pulling up at the bends, especially at the fast-bending minima around x = .2. The plot on the right (B) in the Loess plot addresses this issue, and shows a Loess smooth with shorter spans, so the smooth that’s fit is more local, allowing the curve to flex at both the fast and slow curve.
Loess is O(n^2) in memory so, sure, it looks a nicer, but it might be slow on large datasets. In fact ggplot2::geom_smooth() actually switches its default smooth method from Loess to a Generalized Additive Model (GAM) fit by: formula = y ~ s(x, bs = “cs”)once n is > 1,000, which is shown here, fitted:
Unsurprisingly the GAM fits our data just as well as the Loess with shorter bandwidth did. The tradeoff is speed, which is why as n increases and memory gets increasingly precious, we’re forced to switch to GAM or another method.
How about we try fitting a spline?
If you’re interested in an excellent paper that goes deep on the available density estimation techniques, and evaluates the accuracy and speed of the R packages that implement them, I recommend Deng and Wickham (2011).
For a more comprehensive comparison of popular smoothers, see Breiman and Peters (1992), where they ran simulations comparing a handful of smooths on five criteria (on a Sun 3/160 with only 16M with-an-m! of RAM and 16.67 MHz…). In general, they concluded:
- Despite there being no single smoother that outperformed across all five aspects (Root Mean Square Error, Root Mean Square Bias, Maximum Deviation, Smoothness, & Band width), distributions, and functions, delete-knot regression splines were the smoothest, most efficient, and performed most accurately on straight-line data.
- On large samples (n >= 225) smoothing splines was slowest and roughest.
- On small samples (n = 25), smoothing splines generally had the highest accuracy as measured by RMSE.
I glossed over a ton of detail on every smoother I talked about, which I’d love to dig into in a longer post, since I didn’t even touch on the differences between parametric and non-parametrics techniques.
The R code behind these plots can be found on my GitHub, here.
- Deng and Hadley 2011:
- Dr. David Banks of Duke University, Department of Statistical Science, Lecture 2:
- “That’s Smooth”, Statistical Research, Blog: