Most organizations generate time-series data. The generation of sales data and financial data are primary components of all organizations’ business. This data is a form of time series data. Time series data consists of any data that carries a temporal component with it. Time series data is data that is recorded across time, not always in consistent intervals, but across time nonetheless.
[Related article: A Practitioner’s Guide To Interrupted Time Series]
The study of time series data and how to model it has been around for generations. Understanding your time series data allows you an avenue to understand what the future may hold. As such, most forecasting problems involve some flavor of time series analysis. The forecast model seeks to use knowledge of the past to explain what should be expected in the future at various time periods.
There are many methods that are great at solving time series problems. Traditionally, these methods fell well within the realm of statistical modeling. Auto-Regressive Integrated Moving Average (ARIMA), Exponential Smoothing, and Fourier Transforms have all proven successful at modeling time series data. Many of these methods are univariate in nature. In other words, the future is estimated by finding patterns in the single time series of interest. For example, if we want to estimate the sales total for March, we could use the past patterns found in the sales data to infer that such patterns will continue into the future. From this, we could then estimate what the expected sales total in march is. Some of these methods do have the capability to incorporate multivariate components into their respective method. Like any modeling problem, the actual model that is used depends on the available data, the objective sought, and the performance comparison of various models on the data.
Univariate methods may leave a gap in understanding of the time series. In situations for which a greater understanding of the influence on the time series is required, we can turn to multivariate methods. One of these multivariate approaches is through using machine learning methods to build time series models that are both accurate in their forecast and allow for the inclusion of influential features on the time series.
Feature Engineering the Time Series
To model time series data using machine learning, the time feature must be broken out into subcomponents of the time series. For any date, the following is a small list of the features that may be created:
- Day of month
- Day of week
- Day of year
- Weekend or weekday
- Start of quarter
- End of quarter
- Days to month-end
- Days from month start
- Days to holiday
- Days from holiday
- Season of year
- N period lagged date (e.g. yesterday, last week, last month, last quarter, etc.)
- Rolling mean, min, max and etc.
Each one of these new features is meant to capture some component of the time series. Through this, it becomes possible to gain insight into the time features that are influential on the data.
Some of the common univariate methods require stationarity in the series. In other words, the series has a constant mean and variance. This is similar to the requirements of normality for using regression modeling. By applying feature engineering to the data and using a machine learning model, the requirement for stationarity in the data is removed. In general, machine learning methods relax a number of the requirements that statistically-based time series methods require. However, trend- and seasonality-related features may be useful to use in a model so they are usually worth investigating.
Training vs Test Set for Time Series Data
With any machine learning model, a training and test set in the data must be created. The model is first built on the training data and then that model is applied to the test set for an assessment of the model’s ability to generalize to new data. There are two primary methods that can be used for modeling time series data with machine learning.
The first is to simply find a cut-off point in the data and separate it into a training and test set. The important thing here is to maintain the time-series order in the data after it is split. If the order in the time series is disturbed, then the ability of the model to generalize will be hindered.
The second approach that may be useful is called windowing. Windowing is essentially the process of training the model on a small range of dates and then testing on a range of dates immediately following. This creates the window and this window then slides forward through the time series. This can be thought of as similar to cross-validation but instead of selecting random subsets of the data, the selection is based on the date and maintaining the order in the date.
Wrapping it up
We covered an initial introduction to using machine learning for modeling time series data. The general idea is that machine learning, while not always the perfect choice, can be powerful in modeling time series data due to its ability to handle non-linear data. The feature engineering applied to the time series data in a machine learning approach is the key to how successful the model will be.