Evaluating Sales Forecasts

When sizing a new factory or production line, or when setting work hours for the next three months, most manufacturers have no choice but to rely on sales forecasts as a basis for decisions.

But how far can you trust sales forecasts? You use a training set of data to fit a particular model and a testing set of actual data observed over a time horizon of interest following the end of the training set period. The training set may, for example, cover 5 years of data about product sales up to June 30, 2021, and the testing set the actual sales in July, 2021. 

The forecasters’ first concern is to establish how well a method works on the testing set so that the decision makers can rely on it for the future. For this, they need metrics that reflect end results and that end-users of forecasts can understand. You cannot assume that they are up to speed or interested in forecasting technology.

Forecasters also need to compare the performance of different algorithms and to monitor the progress of an algorithm as it “learns,” and only they need to understand the metrics they use for this purpose. 


Sales Forecasting in the Literature

The most current source is M5 forecasting competition of 2020. The challenge was a sales forecast of the sales by Walmart of 3,049 products in 10 stores for 28 days based on a 5.5-year history of daily sales supplemented by external variables including, for example, special days, weather events, and promotions. Out of ~20,000 candidates, the winner was LightGBM, “a gradient boosting framework that uses tree based learning algorithms.”

Otherwise, a literature search for “sales forecasts” yields few titles on this exact subject, and many more on forecasting in general. They cover topics like call volumes at a call center, the weather, and securities prices. Sales in general, and in particular the demand for manufactured goods, is not covered extensively. The message of the few authors that cover this subject in either discouraging or, compared to M5, outdated. 

Writing in 2002, Tom Wallace tells you not to focus on sales forecast accuracy, presumably because it’s a hopeless pursuit. The other references, up to 2019 cover 1970s vintage methods and some accuracy metrics but not methods to predict quantiles or prediction intervals directly from the data.

In their defense, more recent developments failed to outperform these methods until the M4 competition in 2018. They do now, as M5 confirmed. 

Forecast Performance for End-Users

For each element of the testing set, the residual is the difference between the actual value y and the forecast \bar{y} and the ratio 

 s = \frac{\left |y-\bar{y}\right |}{y}


expressed as a percentage, is a measure of the forecast’s accuracy that everyone understands, as in, “the sales forecast is within ±5% of the actual.” 

By specifying a list of thresholds, like 5%, 10%, 15%, 20%, 30%, you can tally the number of forecasts in the testing set that are within ±5%, ±10%, ±15%, ±20%, and ±30% of the actual, as a proportion of the size of the testing set. This gives you a “Proportional Estimation Accuracy Table” (PEAT).

It usually makes sense to slice and dice the testing set in a variety of ways.  Each data point in the testing set is for a given product on a given day. 

As you can expect sales forecasts to be more accurate 1 day than 30 days ahead, you can generate a different table for each day. As you can also expect forecasts for Runners to be more accurate than for Repeaters or Strangers, you can generate a table for each category,…

Without knowing anything about algorithms, the production control manager will know what to do with the information that sales forecasts for Runners are within  ±5% of actuals one week ahead 90% of the time and that this goes down to ±15% when you look ahead one month. The same manager may also conclude that the forecasts for Strangers are useless. 

Sales Forecast Performance for Forecasters

In 2021, practicing forecasters do not need to code their own algorithms or explain the underlying math. They buy and use them as a service, or download them. They need to understand them as a driver understands a car but not as a mechanic or a designer. It includes purpose, input requirements, and outputs but not the math. 

It calls for more summarized and more abstract metrics of performance, that may not be of interest to the end-users. 

Every forecast error leads to wrong decisions, and therefore to losses. An overall assessment of a forecasting tool can therefore be described as some measure of the losses errors generate.

A complete assessment of these economic losses would be complex and not necessarily mathematically tractable. For these reasons, much simpler loss functions are actually used.

The Metrics Used In The M5 Forecasting Competition

We describe here step by step the construction of the two main metrics used in M5. Once unpacked, they are easier to understand as their names might suggest:

  1. The “Weighted Root Mean Square Scaled Error” (WRMSSE) of sales forecasting accuracy for a family of products. 
  2. The “Weighted Scaled Pinball Loss” (WSPL) for the bracketing of the values in terms of quantiles and confidence intervals. 

Among candidates, the algorithm that performs best in terms of WRMSSE does not necessarily excel in WSPL, and vice versa. 

Open-source Python and R software is available to calculate both WRMSSE and WSPL, and the following explanations are about their meaning. They are not intended as guides to coding. 

The “Weighted Root Mean Square Scaled Error” (WRMSSE) 

The WRMSSE is the quadratic loss function that is implicitly used whenever a model is fitted using least squares. As it starts out flat at 0, It inflicts small losses for small errors, rising quadratically with the errors. It is usually suitable in sales forecasting for manufacturing products, where small errors are relatively harmless but large ones crippling. 

By contrast, sometimes you need to hit a target and your loss is the same whether you miss by an inch or a mile. In those cases, you use the 0-1 loss function and maximum likelihood estimators.  In quality, classical tolerancing classifies a part as within specs or out of specs, which means it uses the 0-1 loss function. Taguchi replaced it with the quadratic loss function. 

The 0-1 loss function versus quadratic loss function (Source Phadke)


The Root Mean Square Error (RMSE)

The most common being the mean of the squared error. If the testing set has h points, the actual values are y_1,...,y_h and the forecasts  \bar{y}_1,...,\bar{y}_h, then the Mean Squared Error (MSE) is

L^2= \frac{1}{h}\times \sum_{i=1}^{h} \left ( y_i - \bar{y}_1 \right )^2


The summary is usually expressed as L or Root Mean Squared Error (RMSE) to match the dimension as the y_i.

For the same reason, you use the standard deviation \sigma rather than the variance \sigma^2 of a random variable: the interval \mu \pm n\sigma around the mean \mu makes sense. 

The Root Mean Square Scaled Error (RMSSE)

The RMSE, however, is affected by the scale of the variability of the y_i, not just the quality of the forecast. To eliminate this, you compare  L instead to the RMSE of the naive one-step forecast  \bar{y}_{i+1} = y_i in which the demand for period  i+1 is assumed to match the consumption of period  i. This is implicitly used in the Kanban system. If the training set has n points, the RMSE S of the naive forecast is 

S^2= \frac{1}{n-1}\times \sum_{i=2}^{n} \left ( y_i - y_{i-1} \right )^2


The ratio T = L/S then measures how much better the forecasting algorithm is than the naive forecast when the loss function is quadratic. We call this ratio the Root Mean Squared Scaled Error (RMSSE).

The Weighted Root Mean Square Scaled Error (WRMSSE)

When you are forecasting sales more than one product, you can compute an RMSSE for each. It gives you a vector of assessments but not a single assessment for the whole family. Unless forecast errors have equivalent consequences for all products, taking a straight average of the RMSSEs does not make sense. For example, you want to give the RMSSE of a Runner more weight than that of a Stranger. 

But how do you assign a weight to a product? In retail, you can use recent sales, in monetary terms. In manufacturing, it is more complex. Loss leaders, for example, may tie up as many resources as the high margin items. In machine-based operations, it may make more sense to assign a weight based on recent consumption of a bottleneck; in manual operations, you could use labor requirements. Regardless of which weight you use, if w_p is the weight you assign to product p in the family P of  products, and T_p is its RMSSE, then the WRMSSE of the entire product family is:

 U\left ( P \right ) = \sum_{p\in P}^{} w_p\times T_p 

The Weighted Scaled Pinball Loss (WSPL)

Besides producing numbers for daily demand by product, the forecaster wants to provide a confidence interval. Traditionally, this is based on an analysis of the residuals y_i - \bar{y}_i in the training set that assumes them to be a white noise.

More sophisticated methods to forecast quantiles directly from the raw data have emerged in the past 10 years. They are available as black boxes. Again, forecasters don’t need to know their inner workings but they do need to measure how well they work. The Weighted Scaled Pinball Loss (WSPL) is the metric used for this in M5. 

The Pinball Loss Function

The  the pinball loss function is used when the objective of the forecast is a quantile of the distribution. The forecast output, for example, may be a level that 80% of the future values will fall below, or above. The name comes from the graphic shape of the loss function. It bounces like a pinball off the x-axis where actual values exactly match the forecast  quantile: 

Pinball loss function
The Pinball Loss Function

If the pinball function is 

\varrho_\tau\left ( y \right ) = y\left ( \tau - I_{\left [ y\leq 0\right ]}\right )


where I_{\left [ y\leq 0\right ]} is the indicator function, worth 1 when y\leq 0 and 0 otherwise.   

The Pinball Loss L_{\tau}\left ( y,z \right ) for actual value y and \tau-quantile forecast z is then:

L_{\tau}\left ( y,z \right ) = \varrho_\tau\left ( y -z\right )


While this is by no means obvious, according to Biau and Patra, the true \tau-quantile minimizes the expected value of the pinball loss.

Those who remember calculus can see it when a random variable Y has a probability distribution function f\left ( y \right ) and a cumulative distribution function F\left ( y \right ) because

E\left (L_{\tau}\left ( Y,q \right ) \right ) = E\left ( \varrho_\tau\left ( Y -q\right ) \right ) =\int_{-\infty}^{+\infty}\left (y -q \right )\left ( \tau - I_{\left [ y \leq q\right ]}\right )f\left ( y \right )dy


Knowing that the derivative of the indicator I_{\left [ y \leq q\right ]} is the Dirac \delta_q, we have:

\frac{\partial }{\partial q}E\left (L_{\tau}\left ( Y,q \right ) \right )= -\tau + F\left ( q \right )


which zeroes at q = F^{-1}\left ( \tau \right ) and \frac{\partial^2 }{\partial q^2}E\left (L_{\tau}\left ( Y,q \right ) \right ) = f\left ( q \right ) \geq 0, which says that E\left (L_{\tau}\left ( Y,q \right ) \right ) has a minimum where q is the \tau-quantile of the distribution. 

Therefore, among different quantile forecasting algorithms, it makes sense to select the one with the smallest pinball loss. 

The Scaled Pinball Loss (SPL) for one product 

For this metric, the scaling factor is also based on the naive forecast on the training set. It uses absolute values rather than squares, for consistency with the Pinball Loss function in the numerator:

S = \frac{1}{n-1}\times \sum_{i=2}^{n}\left |y_i - y_{i-1} \right |


And the Scaled Pinball Loss Function for the testing set is:

T_{\tau} = \frac{1}{h}\sum_{i=1}^{h}\frac{L_{\tau}\left ( y_i, z_i \right )}{S}

The Weighted Scaled Pinball Loss Function (WSPL)

The weights of the different series are the same as for the WRMSSE calculations, and, therefore, for the family P of products, 

 U_\tau\left ( P \right ) = \sum_{p\in P}^{} w_p\times T_{\tau} 


  • Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2021) The M5 competition: Background, organization and implementation
  • Wiseman, P.K. (2019)Sales Forecasting: Process and Methodology in PracticeCreateSpace Independent Publishing Platform, ISBN: 978-1719012652
  • Carlberg, C. (2016) Excel Sales Forecasting For Dummies, 2nd Edition, For Dummies, ISBN: 978-1119291428
  • Kolassa, S., Siemsen, E. (2016) Demand Forecasting for Managers, Business Expert Press, ISBN: 9781606495032
  • Thomopoulos, N. T. (2014)Demand Forecasting for Inventory Control. Germany: Springer International Publishing, ISBN: 9783319119762
  • Biau, G. & Patra, B. (2011) Sequential Quantile Prediction of Time Series 
  • Bourdonnais, R. & Usunier. J.C. (2007) Prévision des Ventes, Economica, ISBN: 978-2-7178-5344-8
  • Wallace, T. & Stahl, R. (2002) Sales Forecasting,  T.F. Wallace and Company, ISBN: 096748841-9
  • Phadke, M. (1989) Quality Engineering Using Robust Design, Prentice Hall, ISBN: 0-13-745167-9

#salesforecast, #productionplanning, #kanban, #M5Competition