fbpx
Feature Engineering for Time Series Analysis – ODSC East 2018 Feature Engineering for Time Series Analysis – ODSC East 2018
What is a time series analysis in data science? How can you construct an effective time series analysis? What does feature... Feature Engineering for Time Series Analysis – ODSC East 2018

What is a time series analysis in data science? How can you construct an effective time series analysis? What does feature engineering have to do with time series analysis? These questions and more were many of the talking points from the talk by Michael Schmidt, PhD, Chief Scientist at DataRobot in his training session: “Feature Engineering for Time Series Data” at ODSC East 2018.

In plain English, a time series is a list or collection of data points spanning over a period of time; weekly, monthly, yearly. There are many examples of time series problems that occur throughout different facets of society such as the stock price of a specific company over a duration of time or meteorological conditions of a city.

Time series analysis are techniques used to understand and extract insights from a collection of data based upon the patterns and behaviors of the data points over a specific time period. Time series data is usually timestamped and collected at regular intervals.

Forecasting is a core part of time series analysis as it tries  tries to predict the value of the analysed signal. Forecasting is one of the hardest problems in predictive analytics because it’s not always obvious what attributes can explain the future values of the signal and because you often will have less data than you would like to have, for example, if you have monthly data over a 4 year period you will basically have 48 data points. As time series analysis data is temporal, you will often have one data points per timestamp. The motto, “the more data the better,” is true only up to a certain point, particularly when running a time series analysis. Adding more data can actually negatively impact your model. As such, there is a a sweet spot in how much data you need to effectively run a time series analysis model.

To create a predictive model for time series analyses, you, of course, need predictors. Predictors are the variables you use as inputs for your models and they help you explain the defined target variable. But how do you know which predictors are relative to the data and which are going to help explain the target variable? This is where feature engineering steps in.

Feature engineering involves finding and creating predictors that can help understand, explain and predict the target variable of a time series analysis model or any other type of model. There is a lot of creativity that goes into feature engineering as well as a great deal of knowledge domain expertise. Feature engineering tries to come up with the right set of predictors for a model. The lack of data on this particular type of problem, demands from data scientists the skill of coming up with the right set of derived variables. The constant process of trial and error, thinking hard, i.e. engineering what features or derived variables would work to predict the target variable. The mission of data scientists when running a time series analysis is to get creative with selecting various predictor variables and building out and testing said variables to reduce and test variables to reduce the error in prediction as much as possible.

In order to achieve this, Schmidt recommended some libraries in his talk to effectively help in forecasting time series data plus. These libraries also contain plenty of pieces of advice to create useful features to forecast better in future projects.

Recommended Libraries

 

StatsModelsis a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration.

 

 

fprophetis a procedure for forecasting time series data. It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers. is an open source library developed at Facebook Research.

 

dmlc/xgboost is a “Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on a single machine: Hadoop, Spark, Flink and DataFlow.

Conclusion

Time Series Analysis is one of the hardest problems to solve. Feature engineering plays an important role because it explores and creates useful features that impact accurate prediction. More often than naught, this is the type of problem in which creativity and feature engineering can help more in the construction of the model than the selection of the algorithm. This all is key to keep in mind for the creation of a successful time series analysis model.Try out Schmidt’s recommended  libraries – all of them have good documentation and examples to follow. Give them a try and start analysing your time series data more effectively today!

Diego Arenas

Diego Arenas, ODSC

I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.

1