Jul 30 2021

# Sales Forecasts – Part 2. More About Evaluation

The lively response to last week’s post on this topic prompted me to dig deeper. First, I take a shot at clarifying the distinction between point forecasts and probability forecasts. Second, I present the idea behind the accuracy metric for probability forecasts that Stefan de Kok recommends as an alternative to the WSPL. Finally, I summarize a few points raised in discussions on LinkedIn and in this blog.

All of this is about *evaluating* forecasts. We still need methods to *generate* them. There are many well-known, published methods for *point* forecasts but not for *probability* forecasts, particularly for sales. This is a topic for another post.

Contents

## Point Forecasts Versus Probability Forecasts

In the terminology of the M5 Competition, a *point forecast* provides a *single value* \hat{y}_j for each y_j in the testing set, with j=1,...,h. It is based on the historical values y_1,..., y_n in the training set and external parameters. A *probability forecast* provides a *probability distribution* for each point in the testing set.

If the weather forecast on the radio is “it will rain today,” it is a point forecast; if it says “there is a 50% chance that it will rain today,” it is a probability forecast for the variable with values “true” if it does rain today and “false” otherwise.

Making the point forecast is taking responsibility for telling listeners to pack an umbrella or a raincoat. The probability forecast, by contrast, gives listeners presumably objective information, based on which they make their own decisions.

In Manufacturing, you place an order for a specific quantity of parts, not a probability distribution. With the *probability* forecast, however, the materials manager can choose the risk to take.

With a *point* forecast, the forecasters make this decision, which really shouldn’t be theirs. Their job should be limited to providing the best possible forecast, which is enough of a challenge.

In a more general context, the differences between the two are illustrated as follows:

### Errors in Point versus Probability Forecasts

The point forecasts are usually estimates of the expected value, mode, or median, conditional on the data, and the differences y_j - \hat{y}_j are considered to be forecast *errors*, because \hat{y}_j is presented at *the* forecast for period j.

With a probability forecast, on a given point in the testing set, only an actual value outside the range of the distribution would indicate an error. The distribution may be too broad to be of use in planning production but all it means is that the evidence from history and external parameters is insufficient to produce a tighter distribution by the forecasting method used. *Quantile* forecasts are probability forecasts because they give you the cumulative distribution function of the distribution, and therefore the distribution.

### Point Forecasts and Probability Forecasts

De Kok calls the point forecasts “deterministic,” which I don’t think is fair, given that they are based on the *probabilistic* theory of time series. Newtonian mechanics is deterministic; autoregressive or moving average models are not. “Point forecast” is more descriptive, because the output is a point, even when accompanied by a confidence interval. “Probability forecast” is descriptive but needs clarification because it sounds as if it encompasses point forecasts when it is meant to exclude them.

The problem with point forecasts is their interpretation as a *prediction* when they are, in fact, estimates of the *mean*, *median*, or *mode* of a future value. Even if they were exact, actual values would vary around them. Unless the sales process is deterministic, actual values almost never exactly match any first-order moment, and it’s less than fully logical to consider the differences to be *errors*.

### The Challenge of Evaluating a Probability Forecast

A probability forecasts assigns a distribution to each point in the testing set and, as in the above picture, they can all be different. How do we know it is an appropriate model based on a single point? Obviously, it’s not possible. The same algorithm, however, is used for all the points in the testing set, and we can extract a characteristic of the relationship between each point in the testing set and its forecast distribution that we can aggregate over the whole set to evaluate the *algorithm*. Although the name does not suggest it, Stefan de Kok pulled this trick that the Total Percentile Error (TPE).

## The Total Percentile Error (TPE)

The WSPL measures forecasting performance for a family of products and a level \tau. To get a complete picture, you would have to calculate it for multiple values, for example for \tau = 5%, 25%, 50%, 75%, 95%. The TPE evaluates the fit of all the distributions for the testing set in one go.

## Reducing the Testing Set to Quantile Bin Counts

While TPE calculations involve complicated details, the key idea is simple. If P_j is the forecast probability distribution for point Y_j, F_jits cumulative distribution function (CDF), AND q_\tau\left ( j \right ) is the \tau-quantile for Y_j, then, for any \tau \in \left [ 0, 100\% \right ], by definition:

F_j\left ( q_\tau \right ) = P_j\left ( Y_j \leq q_\tau\left ( j \right )\right ) = \tau

If we choose a sequence of K equally spaced \tau_k for k= 1,...,K , then the Y_j wil have equal probabilities of being in each interval [q_{\tau_k}\left ( j \right ), q_{\tau_{k+1}}\left ( j \right ) [ . For K=10 , for examples, the K\, \tau‘s are 0\%, 10\%,..., 90\%, 100\% . This means, for example, a 10% probability of being below the 10% quantile, …, between the 40% and the 50% quantile, and… above the 90% quantile. This holds regardless of F_j.

The following picture explains the procedure:

We can assign a sequence number to each interval as a “bin number,” and place the actual value y_j in this bin, and we can do this for all the y_j, j =1,...,h in the testing set and count the number c_k of values in each bin, as follows:

The differences between the distributions If P_j don’t matter and \left (C_1,...,C_K \right ) should follow the multinomial distribution with equal probabilities p_k = 1/K for k = 1,...,K . This is also known as the *discrete uniform distribution.*

## Evaluating the Fit of Probability Forecasts

We look for discrepancies between the observed bin counts \left (c_1,…,c_K \right ) and the discrete uniform distribution.

De Kok uses mean absolute deviation as a distance between the \left (c_1,…,c_K \right ) and their expected values \left (h/K,...,h/K \right ) if the forecast distributions match:

D_0 = \sum_{k=1}^{K}\left | c_k - \frac{h}{K} \right |

and normalizes it to remove the influence of the size h of the testing set.

The literature on multinomial distribution recommends using the following instead:

D_1^2 = \frac{K}{h}\sum_{k=1}^{K}\left ( c_k - \frac{h}{K} \right )^2

on the grounds that, as h rises, the distribution of D_1^2 approaches a \chi^2 with K-1 degrees of freedom. The authors don’t usually bother to prove it and I won’t try either but we can still think about why it is true.

Except for the constraint that C_1 +...+C_K = h, each C_k resembles a Poisson variable of mean h/K, which, as soon as h/K \geq 20, is approximately gaussian, with mean h/K and standard deviation \sqrt{h/K}. This makes D_1^2 approximately a sum of K centered gaussian variables with unit variance, which, by definition, would be a \chi^2 with K degrees of freedom. The constraint then drops the number of degrees of freedom by 1.

The classical statistics approach is to use the \chi^2 statistic to test the goodness-of-fit of a model and reject it if it exceeds thresholds associated with given significance levels. We can also use it to compare forecasting methods and, among competing methods, select the one with the lowest D_1.

## Discussion

The previous post elicited 17 comments on LinkedIn and 10 on the blog. This summarizes my responses.

### Visibility of Probability Forecasting Methods

Sales forecasting is all about helping managers in operations make better decisions. It’s *applied* math, and its methods are justified by empirical success, not by theory. In fact, the internal logic of forecasting tools is often inaccessible to their users, either because they are proprietary to software suppliers or because the algorithms are intrinsically unable to justify their results, as in, for example, “deep learning.”

Even when the procedures are visible, as is the case for the winners of the M5 Competition, their mathematical foundation may be less than compelling. Some successful time-series tools like the AIC or exponentially-weighted moving averages don’t have much of a theoretical basis.

This context gives a vital importance to the metrics used to assess the accuracy and precision of the forecasts, which must be understandable to forecasters. De Kok sells a forecasting tool that is a black box to its users, and this is how he describes the way he gains acceptance for it:

“The approach I’ve used is a head-to-head bake-off to make the case initially. Then when the customer accepts that it was not even remotely close, go forward with implementation. Once delivered, allow anyone to override the forecast with their preferred alternative and measure the impact. Within 3 months not a single user is overriding anything anymore. Guaranteed. Because the metrics show they are consistently making it worse, dramatically so. So rather than me having to keep on proving what I did initially I let them try to disprove it themselves. Works like a charm.”

What users need to know about any method, however, is its underlying assumptions. If they don’t match the actual situation, then the method should not be used.

### Absolute deviations Versus Squares

When asked about why he uses D_0 rather than D_1, de Kok responds that absolute values are related to medians while squares are related to means or averages. He prefers medians because they are less sensitive to outliers than means. When Bill Gates walks into a bar, he sends the patrons’ average income off the charts but the median barely budges. One problem here is that this discussion is about the K-dimensional vector \left (C_1,...,C_K \right ). While the multivariate mean is simply the vector of the means of the coordinates, there is no consensus on what a multivariate median might be.

This analysis is all about how far the bin counts \left (c_1,…,c_K \right ) derived from the testing set are from their expected values \left (h/K,...,h/K \right ) based the forecast distribution. The relevance of the mean versus median distinction in this context is not obvious.

### Equal Probability Bins Versus σ-Intervals

The above discussion is limited to equal probability bins with equal weights and simple counts of forecasts in each bin, for simplicity. De Kok discusses, for example, using bins with unequal probabilities corresponding to the intervals \left [ n\sigma, \left ( n+1 \right )\sigma \right ] of a gaussian. It is a different way to assess the fit of the forecast distributions but it does not place any constraint on their shapes.

D_0 or D_1 may then be modified to give more weight to some intervals than others, and, rather than just tallying the number of forecasts in each bin, you may want to aggregate a measure of volume.

### “Overfitting a metric”

De Kok discusses “overfitting metrics.” “Overfitting” usually means using a model with too many parameters, that fits closely to the training data but has no predictive value on the testing set. A metric is a tool you use to assess the fit, so what does it mean to overfit a *metric*?

### Probability Forecasting Metrics and Business Performance

De Kok’s concern is the evaluation forecasting methods with metrics that don’t relate to business performance. As discussed in the previous post, the assessment of a forecast should measure the losses errors generate. Such an assessment, however, would be application-specific and difficult to do. This is why forecasters rely on simpler and more abstract metrics. It leaves open the question of the relationship between these metrics and stock-outs, rush orders, long lead times, etc.

### Probability Forecasts for Weather on the Radio

The weather reports on the radio are a simple case illustrating this point. Nate Silver reports that the relative frequency of rainy days is *lower* than the forecast probability of rain broadcast on the radio: it doesn’t rain on 25% of the days for which the weather report announces a 25% probability of rain.

The loss caused to the station by inaccurate forecasts is in terms of listener complaints and it is asymmetric: fewer city listeners complain when the weather was fair on a day predicted to be rainy than the other way around. They can’t announce rain every day because they would lose credibility but they can reduce complaints by boosting the probability of rain.

There are, in fact, two different functions here: generating a forecast and putting it on the air. The forecaster’s job is to generate probability forecasts for rain that match its relative frequency in the circumstances. The announcer’s job is to communicate with listeners. The forecaster doesn’t need to have a voice for radio and the announcer doesn’t need to understand the weather. The forecaster is, or should be, rated on accuracy; the announcer, on rapport with listeners.

The radio weather report case is conceptually troubling but simple; the effects of a forecast on a manufacturing supply chain, more complex.

## References

- De Kok, S. (2015) Measure Total Percentile Error to See the Big Picture of Forecast Accuracy, LinkedIn Pulse
- Small, C. G. (1990)
*A survey of multidimensional medians.*International Statistical Review, 58, 263–277. - Silver, N. (2012)
*The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t,*Penguin Publishing Group, ISBN: 9781101595954

#probabilityforecast, #salesforecast, #tpe, #totalpercentileerror

Muhammad Afzal

July 31, 2021@ 9:07 amVery insightful and practical!

Stefan de Kok

August 3, 2021@ 7:56 pmMichel, Great overview!

I appreciate the evaluation and fresh perspective on TPE. Frankly, it had not even occurred to me at the time to test a squared error variation. I must have tested over a 100 metrics, many experimental, with all kinds of variations. Right now I do not have a live case to test it on, but on my next one I will surely try a comparison.

Regarding your statement on forecasters using simpler metrics because the impact on business, like stock-outs, rush orders, are application-specific is in my experience just a historical artifact. A metric like TPE measures pure accuracy, and a metric like Mean Inter-Percentile Range measures pure precision. In my benchmarks, accuracy (when pure) correlates strongly to stability encountered as stock-outs, lost sales, rush orders etc. Whilst precision correlates strongly with efficiency, such as inventory buffers and manufacturing capacity. So yes, there is some application-specific part to it. Companies with bloated inventory should put more emphasis on improving precision. Whilst companies with service issues should focus more on accuracy. You should be able to determine an appropriate weighting between those two per application.