Multi-step Time Series Forecasting with ARIMA, LightGBM, and Prophet Multi-step Time Series Forecasting with ARIMA, LightGBM, and Prophet
Time series forecasting is a quite common topic in the data science field. Companies use forecasting models to get a clearer... Multi-step Time Series Forecasting with ARIMA, LightGBM, and Prophet

Time series forecasting is a quite common topic in the data science field. Companies use forecasting models to get a clearer view of their future business. Choosing the right algorithm might be one of the hard decisions when you develop time series forecasting model. In this article, we are comparing three different algorithms, namely ARIMA/SARIMA, LightGBM, and Prophet, on different types of time series datasets. ARIMA/SARIMA is one of the most popular classical time series models. Prophet is the newer statical time series model developed by Facebook in 2017. LightGBM is a popular machine learning algorithm that is generally applied to tabular data and can capture complex patterns in it.

We are using the following four different time series data to compare the models:

  1. Cyclic time series (Sunspots data)
  2. Time Series without trend and seasonality (Nile dataset)
  3. Time series with a strong trend (WPI dataset)
  4. Time series with trend and seasonality (Airline dataset)

While we will try ARIMA/SARIMA and LightGBM on all the four different time series, we will model Prophet only on the Airline dataset as it is designed to work on seasonal time series.

1. Cyclic Time Series (Sunspots data)

Cyclic time series have rises and falls that are not of a fixed frequency which is different from seasonal time series having a fixed and known frequency. The dataset below is yearly (1700–2008) data on sunspots from the National Geophysical Data Center.

import statsmodels.api as sm

data = sm.datasets.sunspots.load_pandas()
ts_sun = data.data.set_index('YEAR').SUNACTIVITY
ts_sun.plot(figsize=(12, 5))

Sunspots dataset

First, we are examining the stationarity of the time series. Stationarity means time series does not change its statistical properties over time, specifically its mean and variance. Time series with cyclic behavior is basically stationary while time series with trends or seasonalities is not stationary (see this link for more details). We need stationary time series to develop stable linear models, such as ARIMA.

Below we are setting up and executing a function that shows autocorrelation (ACF) and partial autocorrelation (PACF) plots along with performing Augmented Dickey–Fuller unit test.

import pandas as pd
import matplotlib.pyplot as plt
import warnings

# adapted from https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python?scriptVersionId=50985180&cellId=80
def tsplot(y, lags=None, figsize=(12, 7)):
        Plot time series, its ACF and PACF, calculate Dickey–Fuller test
        y - timeseries
        lags - how many lags to include in ACF, PACF calculation
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
    fig = plt.figure(figsize=figsize)
    layout = (2, 2)
    ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
    acf_ax = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))

    p_value = sm.tsa.stattools.adfuller(y)[1]
    ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
    smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
    smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)

Autocorrelation (ACF) plot can be used to find if time series is stationarity. It also can be helpful to find the order of moving average part in ARIMA model. Partial autocorrelation (PACF) plot is useful to identify the order of autoregressive part in ARIMA model. Augmented Dickey–Fuller unit test examines if the time series is non-stationary. The null hypothesis is that the series is non-stationary, hence if the p-value is small, it implies the time series is NOT non-stationary.

In the picture above, Dickey-Fuller test p-value is not significant enough (> 5%). We are going to take the first difference to make the series more stationary.

ts_sun_diff = (ts_sun - ts_sun.shift(1)).dropna()

This time, Dickey-Fuller test p-value is significant which means the series now is more likely to be stationary.

The ACF plot shows a sinusoidal pattern and there are significant values up until lag 8 in the PACF plot. This implies ARIMA(8,1,0) model (We took the first difference, hence d=1). You can see the general rules to determine the orders on ARIMA parameters from ACF/PACF plots in this link. In the next step, we are going to use AutoARIMA in sktime package which automatically optimizes the orders of ARIMA parameters. Given that, the plot analysis above to find the right orders on ARIMA parameters looks unnecessary, but it still helps us to determine the search range of the parameter orders and also enables us to verify the outcome of AutoARIMA.

Before modeling, we are splitting the data into a training set and a test set. The first 80% of the series is going to be the training set and the rest 20% is going to be the test set.

test_len = int(len(ts_sun) * 0.2)
sun_train, sun_test = ts_sun.iloc[:-test_len], ts_sun.iloc[-test_len:]

1.1 ARIMA on Sunspots dataset

ARIMA is one of the most popular time series forecasting models which uses both past values of the series (autoregression) and past forecasting errors (moving average) in a regression-like model. The model has three different parameters p, d, and qp is the order of the autoregressive part, d is the degree of first difference involved, and q is the order of the moving average part. We need to find the right values on these parameters to get the most suitable model on our time series. We are using sktime’s AutoARIMA here which is a wrapper of pmdarima and can find those ARIMA parameters (p, d, q) automatically. pmdarima is a Python project which replicates R’s auto.arima functionality. You can see how auto.arima automatically tunes the parameters in this link. As the analysis above suggests ARIMA(8,1,0) model, we set start_p and max_p with 8 and 9 respectively.

from sktime.forecasting.arima import AutoARIMA

forecaster = AutoARIMA(start_p=8, max_p=9, suppress_warnings=True)
sun_train.index = sun_train.index.astype(int)

It turned out AutoARIMA picked slightly different parameters from our beforehand expectation.

Next, we are setting up a function below which plots the model forecast along with evaluating the model performance. We are using mean absolute error (MAE) and mean absolute percentage error (MAPE) for the performance metrics. MAE averages absolute prediction error over the prediction period:

𝑡 is time, 𝑦𝑡 is the actual y value at 𝑡, 𝑦̂ 𝑡 is the predicted value, and 𝑛 is the forecasting horizon.

MAPE is the scaled metric of MAE which is dividing absolute error by the actual 𝑦:

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error

def plot_forecast(series_train, series_test, forecast, forecast_int=None):

    mae = mean_absolute_error(series_test, forecast)
    mape = mean_absolute_percentage_error(series_test, forecast)

    plt.figure(figsize=(12, 6))
    plt.title(f"MAE: {mae:.2f}, MAPE: {mape:.3f}", size=18)
    series_train.plot(label="train", color="b")
    series_test.plot(label="test", color="g")
    forecast.index = series_test.index
    forecast.plot(label="forecast", color="r")
    if forecast_int is not None:
    plt.legend(prop={"size": 16})

    return mae, mape

fh = np.arange(test_len) + 1
forecast, forecast_int = forecaster.predict(fh=fh, return_pred_int=True, alpha=0.05)
sun_arima_mae, sun_arima_mape = plot_forecast(
    sun_train, sun_test, forecast, forecast_int

The table below summarizes the outcome of the two different models. For this time series data, LightGBM performs better than ARIMA.

Model performance on Sunspots data

Nile dataset contains measurements on the annual flow of the Nile as measured at Ashwan for 100 years from 1871–1970. The time series does not have any seasonality nor obvious trend.

ts_nl = sm.datasets.get_rdataset("Nile").data
ts_nl = ts_nl.set_index('time').value


While Dickey-Fuller test implies it’s stationary, there is some autocorrelation as can be seen in ACF plot. We are trying to see how its first difference looks like.

ts_nl_diff = (ts_nl - ts_nl.shift(1)).dropna()

This looks more stationary than the original as the ACF plot shows an immediate drop and also Dicky-Fuller test shows a more significant p-value. From this analysis, we would expect ARIMA with (1, 1, 0), (0, 1, 1), or any combination values on p and q with d = 1 since ACF and PACF shows significant values at lag 1. Let’s see what parameter values AutoARIMA picks.

2.1 ARIMA on Nile dataset

test_len = int(len(ts_nl) * 0.3)
nl_train, nl_test = ts_nl.iloc[:-test_len], ts_nl.iloc[-test_len:]

forecaster = AutoARIMA(suppress_warnings=True)

The model picked d = 1 as expected and has 1 on both p and q. Then, we are creating a forecast with its evaluation.

fh = np.arange(test_len) + 1
forecast, forecast_int = forecaster.predict(fh=fh, return_pred_int=True, alpha=0.05)
nl_arima_mae, nl_arima_mape = plot_forecast(nl_train, nl_test, forecast, forecast_int)

As there are no clear patterns in the time series, the model predicts almost constant value over time.

2.2 LightGBM on Nile dataset

We are using the same functions as the previous data to develop LightGBM.

param_grid = {"window_length": [5, 10, 15, 20, 25, 30]}
forecaster = create_forecaster()
nl_lgb_mae, nl_lgb_mape = grid_serch_forecaster(nl_train, nl_test, forecaster, param_grid)

It turned out LightGBM creates a similar forecast as ARIMA. The summary table below shows there is not much difference between the two models.

3. Time series with a strong trend (WPI dataset)

U.S. Wholesale Price Index (WPI) from 1960 to 1990 has a strong trend as can be seen below.

import requests
from io import BytesIO

wpi1 = requests.get("https://www.stata-press.com/data/r12/wpi1.dta").content
data = pd.read_stata(BytesIO(wpi1))
ts_wpi = data.set_index("t").wpi


We are taking the first difference to make it stationary.

ts_wpi_diff = (ts_wpi - ts_wpi.shift(1)).dropna()

It still looks non-stationary as the ACF drops slowly over time and Dicky-Fuller also does not show a significant p-value. Hence, we are taking one more difference.

ts_wpi_2diff = (ts_wpi_diff - ts_wpi_diff.shift(1)).dropna()

Now, it looks stationary with the Dicky-Fuller’s significant value and the ACF plot showing the rapid drop. From this analysis, we would expect d = 2 as it required second difference to make it stationary. As the ACF has a significant value at lag 1 and the PACF has the ones untile lag 2, we can expect q = 1 or p = 2.

3.1 ARIMA on WPI dataset

We are splitting the time series into training and test set, then train ARIMA model on it.

ts_wpi.index = ts_wpi.index.to_period("Q")
test_len = int(len(ts_wpi) * 0.25)
wpi_train, wpi_test = ts_wpi.iloc[:-test_len], ts_wpi.iloc[-test_len:]

forecaster = AutoARIMA(suppress_warnings=True)

As confirmed in the previous analysis, the model has a second degree of differences. Next, we are creating a forecast along with its evaluation.

fh = np.arange(test_len) + 1
forecast, forecast_int = forecaster.predict(fh=fh, return_pred_int=True, alpha=0.05)
wpi_arima_mae, wpi_arima_mape = plot_forecast(
    wpi_train, wpi_test, forecast, forecast_int

3.2 LightGBM on WPI dataset

We are modeling LightGBM in the same way as before to see how it works on this time series.

param_grid = {"window_length": [5, 10, 15, 20, 25, 30]}
forecaster = create_forecaster()
wpi_lgb_mae, wpi_lgb_mape = grid_serch_forecaster(
    wpi_train, wpi_test, forecaster, param_grid

LightGBM is clearly not working well. As the regression tree algorithm cannot predict values beyond what it has seen in training data, it suffers if there is a strong trend on time series. In this case, we need to detrend the time series before modeling. sktime offers a convenient tool Detrender and PolynomialTrendForecasterto detrend the input series which can be included in the training module.

Before including it in the training module, we are demonstrating PolynomialTrendForecaster below to see how it works.

from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.transformations.series.detrend import Detrender

# linear detrending
forecaster = PolynomialTrendForecaster(degree=1)
transformer = Detrender(forecaster=forecaster)
yt = transformer.fit_transform(wpi_train)

forecaster = PolynomialTrendForecaster(degree=1)
fh_ins = -np.arange(len(wpi_train))  # in-sample forecasting horizon
y_pred = forecaster.fit(wpi_train).predict(fh=fh_ins)

    wpi_train, y_pred, yt, labels=["y_train", "fitted linear trend", "residuals"]

You can see the trend forecaster captures the trend in the time series in the picture above.

Next, we are creating a forecaster using TransformedTargetForecaster which includes both Detrender wrapping PolynomialTrendForecasterand LGBMRegressor wrapped in make_reduction function, then train it with grid search on window_length.

from sktime.forecasting.compose import TransformedTargetForecaster

def create_forecaster_w_detrender(degree=1):

    # creating forecaster with LightGBM
    regressor = lgb.LGBMRegressor()
    forecaster = TransformedTargetForecaster(
            ("detrend", Detrender(forecaster=PolynomialTrendForecaster(degree=degree))),
                make_reduction(regressor, window_length=5, strategy="recursive"),

    return forecaster

param_grid = {"forecast__window_length": [5, 10, 15, 20, 25, 30]}
forecaster = create_forecaster_w_detrender(degree=1)
wpi_lgb_mae, wpi_lgb_mape = grid_serch_forecaster(
    wpi_train, wpi_test, forecaster, param_grid

This time LightGBM is forecasting the value beyond the training target range with the help of the detrender.

The table below summarizes the performance of the two different models on the WPI data. LightGBM again performs better than ARIMA.

Model performance on WPI data

4. Time series with trend and seasonality (Airline dataset)

The Box-Jenkins airline dataset consists of the number of monthly totals of international airline passengers (thousand units) from 1949–1960. This data has both trend and seasonality as can be seen below.

from sktime.datasets import load_airline
ts_al = load_airline()


First, we are taking a seasonal difference (lag 12) to make it stationary.

ts_al_diff = (ts_al - ts_al.shift(12)).dropna()

It still looks not stationary with ACF dropping slowly, so we are taking an additional first difference on it.

ts_al_2diff = (ts_al_diff - ts_al_diff.shift(1)).dropna()

Now, it looks stationary as Dickey-Fuller’s p-value is significant and the ACF plot shows a quick drop over time. The outcome of this analysis implies SARIMA with d = 1 and (order of seasonal difference) = 1.p or q can be 1 as ACF and PACF plots show significant value at lag 1.

4.1 SARIMA on Airline dataset

Next, we split the data into training and test set and then develop SARIMA (Seasonal ARIMA) model on them. SARIMA model has additional seasonal parameters (P, D, Q) over ARIMA. P, D, and Q represent order of seasonal autocorrelation, degree of seasonal difference, and order of seasonal moving average respectively. To model SARIMA, we need to specify sp parameter (seasonal period. In this case it is 12) on AutoARIMA.

test_len = int(len(ts_al) * 0.3)
al_train, al_test = ts_al.iloc[:-test_len], ts_al.iloc[-test_len:]

forecaster = AutoARIMA(sp=12, suppress_warnings=True)

As expected, the created model has d = 1 and D = 1. Next, we create a forecast with its evaluation.

fh = np.arange(test_len) + 1
forecast, forecast_int = forecaster.predict(fh=fh, return_pred_int=True, alpha=0.05)
al_arima_mae, al_arima_mape = plot_forecast(al_train, al_test, forecast, forecast_int)

4.2 LightGBM on Airline dataset

As the time series has seasonality, we are adding Deseasonalizer in our LightGBM forecaster module. As the seasonality effect varies across years, we are setting “multiplicative” on Deseasonalizer module.

from sktime.transformations.series.detrend import Deseasonalizer

def create_forecaster_w_desesonalizer(sp=12, degree=1):

    # creating forecaster with LightGBM
    regressor = lgb.LGBMRegressor()
    forecaster = TransformedTargetForecaster(
            ("deseasonalize", Deseasonalizer(model="multiplicative", sp=sp)),
            ("detrend", Detrender(forecaster=PolynomialTrendForecaster(degree=degree))),
                make_reduction(regressor, window_length=12, strategy="recursive"),

    return forecaster
forecaster = create_forecaster_w_desesonalizer()
param_grid = {"forecast__window_length": [6, 12, 18, 24, 30, 36]}

al_lgb_mae, al_lgb_mape = grid_serch_forecaster(
    al_train, al_test, forecaster, param_grid

4.3 Prophet on Airline dataset

Prophet is a time series forecasting model developed by Facebook in 2017 which can effectively deal with multiple seasonalities (yearly, weekly, and daily). It also has capabilities incorporating the effects of holidays and implementing custom trend changes in the time series. As our time series do not require all of those functionalities, we are just using Prophet only with yearly seasonality turned on.

from fbprophet import Prophet

al_train_pp = al_train.to_timestamp(freq="M").reset_index()
# Prophet requires specific column names: ds and y
al_train_pp.columns = ["ds", "y"]

# turning on only yearly seasonality as this is monthly data.
# As the seasonality effects varies across years, we need multiplicative seasonality mode
m = Prophet(

future = m.make_future_dataframe(periods=test_len, freq="M")
forecast = m.predict(future)
forecast = forecast.iloc[-test_len:]
forecast.rename(columns={"yhat_lower": "lower", "yhat_upper": "upper"}, inplace=True)
al_pph_mae, al_pph_mape = plot_forecast(
    al_train, al_test, forecast["yhat"], forecast[["lower", "upper"]]

The table below compares the performance metrics with the three different models on the Airline dataset. While there is not much performance difference between those three models, ARIMA performed slightly better than others.

Model performance on Airline data


In this blog post, we compared the three different model algorithms on the different types of time series. LightGBM showed comparable or better performance than ARIMA except for the time series with seasonality (Airline).

Model Forecast MAE by Time Series Dataset

As LightGBM is a non-linear model, it has a higher risk of overfitting to data than linear models. You might want to set up reliable cross-validation when you use it. The machine learning approach also has an advantage over linear models if your data has a lot of different time series (e.g. stock prices of companies or sales by product) as you may be able to forecast multiple time series with a single machine learning model (we didn’t dig into this advantage in this blog post. Please look at some implementation from M5 kaggle competition if you are interested in it).

One of the drawbacks of the machine learning approach is that it does not have any built-in capability to calculate prediction interval while most statical time series implementations (i.e. ARIMA or Prophet) have it. You might want to code your own module to calculate it.

While Prophet does not perform better than others in our data, it still has a lot of advantages if your time series has multiple seasonalities or trend changes.

You can see the full working code on Google Colab or Github.

Article originally posted here by Tomonori Masui. Reposted with permission.

Cover Photo by Markus Winkler on Unsplash

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.