This post originally appeared on the ModelOp Tech Blog

If I could predict the future, some aspects of business would become much easier. For instance, if I owned a store and could somehow divine what my next week’s sales would look like, I could make sure that my inventory was perfectly matched to the coming demand. Similarly, if I could predict the movements of the stock market, I could use this information to make advantageous trades. Sadly, I can’t see into the future. So the natural question becomes: what are the best predictions I can make about the future given the information that is available to me today?

This is where the statistical field of forecasting comes in. In its most basic form, the goal is this: given a time-indexed list of numbers (for instance, a stock ticker or the weekly sale numbers of a product), what are the best estimates I can make of future values of the series (e.g. the stock’s next price or tomorrow’s sales)? Note that for the purposes of this blog post, we will be dealing only with the time series itself - no other data will be used. One might hope that if I knew, say, the rate of inflation I might be better able to predict shifts in the stock market. But that is a topic for another post.

Since it’s getting cold in Chicago and I’m dreaming of warmer days, lets take our time series to be the historical high temperature at Chicago’s O’Hare Airport. We’ll try to build a model that can predict the average high temperature next month from this historical data.

Data Preliminaries

We get our data from the National Centers for Environmental information. It consists of the daily high temperatures at O’Hare airport from 1991 to the present, averaged by month. We then create an R ts() object containing just this information. It is plotted in Figure 1 below.

Figure 1

Before we even begin trying to model anything, let’s make a few observations. First, we can see that all of these numbers lie between (say) 20 and 95 degrees Fahrenheit. So an ideal model for this data would have outputs roughly in these ranges. If our model is regularly predicting 110 or -15 degrees for the monthly average high temperature, we would already be able to say that it’s not a very good model. This may seem obvious, but it should be thought of as a toy example of how past data can tell us something about how we should forecast.

Second, we can see that there is a cyclical nature to the data - highs are lower in the winters and higher in the summers. This is a phenomenon known (for obvious reasons) as seasonality in time-series modeling. An ideal model would take this into account. The effect of seasonality is extremely pronounced here (compare to, say, a stock price where the effect of the time of year presumably does not show up nearly as strongly). Finally, there doesn’t seem to be any particularly strong trend in these numbers - the ranges remain roughly constant throughout the time period we’re looking at.

One thing that we can look at with time series data is something called the autocorrelation - roughly, how much are the values of the time series correlated with the surrounding values. The plot of our autocorrelation function is included below as Figure 2. As we can see, the value for a month is highly correlated with the same month in different years, slightly less correlated with the surrounding months, and has negative correlation roughly 6 months ahead. This tells us that this series is strongly seasonal.

Figure 2

Baselines for Forecasting

Alright, we’ve explored the data a bit - let’s try to build a model. First, we want to get some baselines for our predictions. These are usually naive prediction methods that we will compare our future, fancier models against - if we can’t beat the baselines with our fancy model, then the fancy model probably isn’t worth much.

First, what’s the most naive prediciton method we could use? Well, we could just always predict the last value we saw. In fact, this is known as the naive method. The issue with this is that it will consistently over/under estimate in this case - basically, we’re not using the seasonality of the data in any way. Another option that doesn’t really use the seasonality in any way is to take the mean over some prior period (say, 3 months) and use that as the prediction. We might expect that this does slightly better than the naive method, but again we’re not really incorporating the seasonality. The final baseline we’ll look at is the seasonal naive method - namely, predict the value of the series from 12 months ago as the prediction. We might expect that this does pretty well with this time series.

Let’s see how each of these does for 2018.

actual <- window(avgts,2018)
naive <- window((stats::lag(avgts,-1)),2018,c(2018,12))
mean <- window(rollapply(avgts, 3, "mean", align = "right", partial = T, fill = NA),2018)
seasonal <- window(stats::lag(avgts,-12),2018,c(2018,12))
actual

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2018 31.87097 38.00000 44.70968 50.03333 76.35484 80.43333 85.12903
##           Aug      Sep      Oct      Nov      Dec
## 2018 84.80645 77.50000 60.67742 39.73333 38.25000

mean

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2018 37.66595 34.68817 38.19355 44.24767 57.03262 68.94050 80.63907
##           Aug      Sep      Oct      Nov      Dec
## 2018 83.45627 82.47849 74.32796 59.30358 46.22025

naive

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2018 34.19355 31.87097 38.00000 44.70968 50.03333 76.35484 80.43333
##           Aug      Sep      Oct      Nov      Dec
## 2018 85.12903 84.80645 77.50000 60.67742 39.73333

seasonal

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2018 34.58065 47.67857 47.12903 63.76667 67.70968 83.26667 83.80645
##           Aug      Sep      Oct      Nov      Dec
## 2018 80.74194 79.73333 65.45161 46.93333 34.19355

For our evaluation metric, lets look at mean absolute error (MAE) - the average of the absoulte values of the errors of the predictions.

naiveMAE <- mean(abs(actual-naive))
meanMAE <- mean(abs(actual-mean))
seasonalMAE <- mean(abs(actual-seasonal))
naiveMAE

## [1] 8.538306

meanMAE

## [1] 8.686111

seasonalMAE

## [1] 5.305876

As expected, the seasonal naive method outperforms the other two. However, lets see if we can do better.

A Brief Intro to ARIMA Models

Everything we’ve done so far is fairly straightforward. A much fancier method of time series forecasting is the ARIMA method. Such a model consists of several parts - first the autoregressive (AR) piece takes weighted sums of some number of previous observations (we can think of this as a much more general version of the naive method we looked at above). The next piece is the integrated (I) piece, which takes difference of the series to force it to have nicer statistical properties. Finally, there’s the moving average (MA) piece, which takes weighted sums of white-noise errors of previous terms. If we combine these all together (plus seasonal analogues of these), we get a general ARIMA model.

This might seem like a lot to deal with, but luckily the forecast package in R has a function that computes such a model for you.

train <- window(avgts, end=c(2017,12))
trainmodel <- auto.arima(train)
trainmodel

## Series: train
## ARIMA(2,0,2)(1,1,0)[12] with drift
##
## Coefficients:
##           ar1     ar2     ma1      ma2     sar1   drift
##       -0.0746  0.7579  0.3338  -0.6444  -0.4218  0.0024
## s.e.   0.0828  0.0689  0.0992   0.0944   0.0524  0.0345
##
## sigma^2 estimated as 23.15:  log likelihood=-931.66
## AIC=1877.31   AICc=1877.68   BIC=1903.51

Let’s break down what this means: the (2,0,2) means the fit model has 2 AR terms, 0 differences (the I part of the model), and 2 MA terms. The (1,1,0) piece corresponds to the seasonal ARIMA terms - one AR term, and one difference. The [12] indicates the seasonality - since there are 12 months in a year, this is the natural choice. Finally, the model incorporates slight drift - a trend in the data. The coefficients and standard errors for each pieces are then listed, as well as some statistics about the fit of the model.

Let’s see how this model does at predicting for 2018:

fullmodel <- Arima(avgts,model=trainmodel)
arimaforecast <- tail(fullmodel$fitted,12)
arimaforecast

##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2018 33.43440 41.00933 48.22833 58.20269 65.85530 82.51868 82.58382
##           Aug      Sep      Oct      Nov      Dec
## 2018 82.65571 78.56760 65.93328 48.73150 30.35375

arimaMAE <- mean(abs(actual-arimaforecast))
arimaMAE

## [1] 4.729957

So this model outperforms the seasonal naive method, but not by much. There’s definitely a tradeoff between the simplicity of the model and making marginal gains that is problem dependent.

That does it for this brief intro to forecasting. It is a topic that certainly will come up again in future blog posts. If you want to learn more about what was covered here (plus a plethora of other topics in this space), a good reference is Forecasting: Principles and Practices by Rob J Hyndman and George Athanasopoulos, which is available freely online.