Tuesday, November 22, 2011

Time series cross-validation 2

In my previous post, I shared a function for parallel time-series cross-validation, based on Rob Hyndman's code.  I thought I'd expand on that example a little bit, and share some additional wrapper functions I wrote to test other forecasting algorithms.  Before you try this at home, be sure to load the cv.ts and tsSummary functions from my last post.

Monday, November 21, 2011

Functional and Parallel time series cross-validation

Rob Hyndman has a great post on his blog with example on how to cross-validate a time series model.  The basic concept is simple:  You start with a minimum number of observations (k), and fit a model (e.g. an arima model) to those observations.  You then forecast out to a certain horizon (h), and compare your forecasts to the actual values for that series.  You then add the next observation (k+1), and repeat the process until you run out of data.  This gives you a matrix of forecast accuracies at various horizons (1 step ahead, 2 steps ahead, all the way to h steps ahead).  You then take the mean of each column of the matrix, and get the model's average accuracy at that horizon.  This method is analogous to leave-one-out cross-validation.

There are 2 variations to this method:
1. Use a fixed training window.  In this case, when you add an observation to your "training" series, you drop the first observation, keeping the training window fixed.
2. Increment by n at each step, rather than 1.  This is analogous to k-fold cross-validation.  In this case, your forecast error is more unstable, and it's a good idea to average error across ALL horizons when evaluating the model.

This technique is very useful, because it allows you to define a horizon of interest (say 1 month or 12 months), and then asses how well your model performs at that horizon.  Furthermore, you can use this data to compare various models, including different types of models, such as linear models vs. arima models vs. exponential smoothing model.

However, time series cross-validation is very time consuming, particularly for arima and exponential smoothing models.  Therefore, I thought it would be a good idea to parallelize Hyndman's algorithm, using the foreach package in R.  Furthermore, I wrapped the entire thing into a single function, which allows you to easily change the type of cross validation by altering the minObs (k), stepSize (n), and fixed-length or growing window parameters.  My function takes an argument tsControl, which contains each of these parameters, as well a summary function to calculate your error metric (such as MAE).  I've structured it similarly to the caret packages's train function.

Sociable