My last two long posts were about evaluating sales forecasts. They begged the question of how you generate these forecasts. This is a partial answer, about what you can tell from a history of sales through both classical methods and recent developments, particularly probability forecasting.

Rob Hyndman stated five conditions a phenomenon that is easy to forecast:

You understand the factors that contribute to its variation.

There is plenty of data available.

The forecasts cannot affect the thing you are trying to forecast.

The future is similar to the past.

There is relatively little natural or unexplainable variation.

Sales and Hyndman’s Conditions

It’s not a foregone conclusion that your company’s sales within your planning horizon meet these conditions. Market research can improve your understanding, and you can collect more detailed and accurate data.

Demand is what you would like to know; Sales is what you have historical data about, to end-users or intermediaries like distributors, dealers, or retailers. Intermediaries buffer the manufacturer against fluctuations in demand but they also reduce the visibility of end-user demand to manufacturers.

Data Categories for Complete Forecasting

For a complete picture, David Simchi-Levi lists the following categories of data that a consumer products manufacturer considers for sales forecasting:

Internal data on shipments to retailers, prices, discounts, promotions, and various product characteristics.

Data on consumer demand, accessed through retailers’ point-of-sale technology or purchased from market research companies.

Macroeconomic information—including quarterly GDP, the Purchasing Managers’ Index, the Consumer Price Index, and unemployment and inflation rates.

Other external data, on web searches, social media mentions of products, the weather, and the competition.

All we are talking about here is the time-series analysis of historical sales data. It doesn’t mean it’s OK to ignore the other elements.

Why Probability Forecasting?

Rob Hyndman defines forecasting as “the estimation of a probability distribution of a variable to be observed in the future.” Point forecasting, however, has defined most professionals’ understanding.

Forecasts and Decisions

As Bill Waddell put it, “in the end, the forecast is used to drive decisions (i.e. how many long lead time parts to buy); and this reduces all of it to a point.”

Indeed, a purchase order is for a specific quantity of parts, not a probability distribution. With the probability forecast, however, the materials manager chooses the risk to take. When issuing a point forecast, forecasters make that decision, and it’s not really theirs to make. Their job should be limited to providing the best possible forecast. It is enough of a challenge.

Logical Issues with Point Forecasting

There is also something logically questionable in using an estimate of the mean, median, or mode of a distribution as a forecast. Numerical variables never match it, and we call the difference between the actual value and the forecast an “error.”

A random variable, however, will fluctuate around this number, no matter how perfectly you have estimated it. That’s why you call the variable “random.” No actual error is involved.

The real reason point forecasts have dominated the field is that probability forecasts require more computing power. It is readily available now but wasn’t when the most common point forecasting methods were developed decades ago.

The Challenge

Probability forecasting is straightforward from a sample of independent and identically distributed variables. It’s not when you want to predict quantiles for the future of a time series. It should be noted that the justification for forecasting methods is not theoretical but empirical.

Forecasting and Math

The math must make sense but math cannot prove a method works. Its validation depends on forecasting performance measured on both simulated and actual data. This is why Biau & Patra devote several pages of their paper to experimental results on call center data, and why forecasting competitions, from M1 in 1982 to M5 in 2020, are essential to progress in this field.

Forecasting and Software

Software tools like Excel’s Data Analysis add-on or R’s lm function are not bashful about putting out prediction intervals. The way they do it, however, is by using multiples of the Root Mean Squared Error (RMSE), essentially the \sigma of the residuals in the training set. The implicit assumption is that the residuals are gaussian — a.k.a “normal” — and form a white noise, meaning that they are independent of each other. Residuals from data don’t always satisfy both.

We now have the computing power and the math available to predict quantiles from the raw data with looser assumptions. Please do not let the math intimidate you. Since forecasting is applied math, there is no clearer, simpler, or more concise way to describe the methods. As much as possible, we backed up every math statement with graphics illustrating it.

Simple Cases

To infer a probability distribution, we need a sample of data. Much of the challenge of probability forecasting is extracting from history a sample that is representative of the current condition. First, we need to discuss what to do once you have such a sample. It differs from common model fitting practices. Then we’ll go into the means of finding an appropriate sample, starting from the simplest cases.

Independent and Identically Distributed Variables

What we are trying to do here differs from the traditional approach to choosing a model for a set \left ( y_1,...,y_n \right ) of numbers that we interpret as independent and identically distributed (i.i.d.) instances of a random variable Y.

Common Practice With a Standard Distributions

Usually, based on prior knowledge of the meaning of the numbers, we select one from a small number of distribution families, like the Gaussian, binomial, Poisson, hypergeometric, Weibull, \chi^2, etc., for which the math is well-known. Each family has a small number of parameters, like mean, standard deviation, or degrees of freedom. We fit them to the data and then test for goodness of fit.

Quantiles of Any Distribution

All we want to do with the distribution here is calculate its quantiles. For this purpose, we don’t need to limit ourselves to this small list from the infinite universe of possible distributions. We want to infer a distribution from the data without worrying whether it matches a well-known one.

The first step is sorting of the y_i by increasing value, which we denote by \left ( y_{\left (1 \right )},..., y_{\left (n \right )}\right ). Then y_{\left (i \right )} has exactly i-1 points below and n-i points over it.

Then, if we plot i/n versus y_{\left (i \right )}, we have the cumulative relative frequencies of the values in the sample, as shown in the following examples, with samples of increasing sizes drawn from a set of 1,000,000 points:

This staircase function is \hat{F}\left ( q \right ) = \frac{1}{n}\times\sum_{i=1}^{n}I_{\left ( q > y_{i}\right )} where I_{\left ( q > y_{i}\right )} is the indicator function that is 1 when q > y_{i} and 0 otherwise.

It estimates the cumulative distribution function F of Y. Then, if it were strictly monotonous, it would have an inverse \hat{F}^{-1} and \hat{q} = \hat{F}^{-1}\left (\tau\right ) would be the estimate of the \tau-quantile that minimizes the pinball loss function for \hat{F}, as discussed in a previous post.

From Staircase to Smooth Cumulative Frequency Curves

A staircase, however, is not monotonous, and \hat{F}^{-1}\left (\tau\right ) is actually the interval [y_{\left ( i \right )}, y_{\left ( i+1 \right )}[ whenever \tau matches the top of a step. With large samples, the steps are so narrow that the curve looks continuous and, in this example, monotonous.

You don’t choose the size of the sample to work with. If life gives you 30 points, that’s what you have. With 30 points, this method gives you an estimate for the 50% quantile, also known as the median, between 18 and 38. The center of this interval, at 28, is used as estimate. With 30,000 points, your median estimate is 42.

The Probability Distribution Function and KDE

The corresponding probability distribution function (pdf) \hat{f}\left ( q \right ) = \frac{1}{n}\times\sum_{i=1}^{n}\delta_{y_i}\left ( q \right ) has a Dirac unit impulse at each y_i and is 0 in every ]y_{\left ( i \right )}, y_{\left ( i+1 \right )}[ interval.

While strictly based on the sample and free of any assumption, it doesn’t make much sense. If you only have one point, y, the pdf is \hat{f}\left ( q \right ) = \delta_{y}\left ( q \right ), which gives a probability of 1 to any interval containing y and 0 to any other interval, no matter how close to y. Prior knowledge of sales says otherwise: the next value is likely to be in such an interval.

Rather than \delta_{y}, a symmetric distribution g\left (q - y \right ) with a single mode at y and tapering off as you get away from y would make more sense. If we call this distribution a kernel and substitute it for the Dirac at every y_i, we obtain a density g\left (q - y_i \right ) for each data point.

We discussed this concept in Updating the 7 tools of QC as an improvement over histograms when you have software to do it for you. It’s not free of assumptions because you have to choose a scale for the kernel. If the steps are visible in the cumulative proportion plot, the KDE is both more realistic and easier to work with.

Now, let’s look at how we can apply these methods to probability forecasting.

Control Charts: Constant + Noise

This is the simplest model for a time series, and one for which probability forecasting is straightforward. When drawing SPC control charts, the sequence of points y_1,...,y_n,... is assumed the be the realization of random variables Y_1, ...,Y_n,… of the form

Y_i = C + S_i

where the S_i are a white noise, meaning that they are independent and identically distributed and have a mean of 0. SPC sets control limits based on the additional assumption that the S_i are gaussian, a.k.a “normal.” It makes sense where, as in bar{X}-charts, each Y_i is the average of a sample of measurements of workpiece dimensions.

The Constant + Noise Model of Sales

On the other hand, if we want to apply the “Constant + Noise” model in sales forecasting, the noises can have any distribution, which we need to infer from the s_1,…,s_n,... observed over time. To dramatize that the noise does not have to be Gaussian, we simulated the deliberately ugly camelback distribution. It’s bimodal, skewed, and pointy. Visually, as seen below, such a series may appear to have oscillations but they are an optical illusion:

With a Gaussian white noise, it would look different:

Forecasting with the Constant + Noise Model

In sales forecasting, this model is often applied to the residuals of a point forecast. What remains once you subtract the point forecast from the data in the training set is supposed to be noise. If the mean of the noise is not 0, the point forecast is biased, and removing biases is a major issue with point forecasting. In probability forecasting, on the other hand, the output is not the point forecast, and the bias just indicates the location of the distribution.

To check that the noises keep the same distribution over time, you check them for homogeneity of variance. Does their \sigma change over time? If it doesn’t the time series is said to be homoscedastic. It is often visually obvious. Then, if you can’t find any autocorrelation among the noises, you assume them to be independent.

If all of this holds, the y_i are i.i.d. and treatable by the methods described above. The probability forecast for this case is the same distribution, regardless of how far into the future we look.

Naive Forecasting: Next = Last + Noise

Replenishment systems use the quantity consumed in period i as a point forecast for period i+1. The implicit time series model is

Y_{i+1} = Y_i + S_{i+1}

where the noises S_i, subject to the same conditions as in the Constant + Noise model, are assumed i.i.d.

Naive Model Simulation

A simulation of this process with the same camelback noise distribution as above but with 1/4 of its amplitude, looks as follows:

The variability is lower day-to-day but higher quarterly. The quarterly oscillations on this chart are also an optical illusion because there is nothing periodic in this process.

If we have a sample y_1,..., y_n of n, we have a sample s_1,..., s_{n-1} of n-1 noises given by s_i = y_i-y_{i-1}. Then you estimate the distribution g of S_i as above. The distribution of Y_{i+1}, knowing that Y_i = y_i is

Then we can take quantiles from the cumulative distribution G. This is if we want to look ahead just one period. For k periods, then

Y_{i+k} = Y_i + \sum_{j=1}^{k}S_{i+j}

About the distribution of the sum of k i.i.d. noises is that, if they have a mean \mu and a standard deviation \sigma:

Its mean is k\mu

Its standard deviation is \sqrt{k}\times\sigma

If k\sim 10 or more, the distribution is indistinguishable from a Gaussian N\left ( k\mu, \sqrt{k}\sigma \right ).

Since not all distributions have means and standard deviations, the assumption that S does is not trivial. We have known that sums of i.i.d. variables with such a distribution converge to a Gaussian for over 200 years. What is still remarkable is how quickly it happens.

In the general case and for small k, we obtain the distribution of the sum of k noises by convolvingg with itself k times. “Convolving” is less intimidating than it sounds once you realize that there is no more to it than invoking a function called “convolve.” The following figure shows the convergence happening with the camelback distribution, which converges to a smooth bell-shape in just 15 iterations:

Autoregressive(AR) Models

When forecasting a time series from its own history, you assume that its past values inform its future. A common way of expressing this is in terms of correlations between values in different time periods, assuming they depend only on the lag between them. The correlation between today’s sales and the sales of a week ago is the same as between the sales of one day 6 months ago and the day 6 months + 1 week ago. Then the correlation between values that are j days apart is a function of j called the series autocorrelation, and it usually has a finite range. Beyond a certain lag, the correlations vanish.

If there is autocorrelation of range p in your time series, you can use regression to forecast future values against the last p. Without autocorrelation, the forecaster is limited to using external factors like the weather, the season, or economic indicators, and applying the Constant + Noise model to residuals. With autocorrelation, the forecaster can learn from the time series’ own history.

Examples of Autocorrelation

The Naive Model is an example of positive, 1-step autoregression. If today’s Sales is the point forecast for tomorrow’s, the more we sell today, the more we expect to sell tomorrow.

Eli Godratt’s Hockey Stick Effect

By contrast, what Eli Goldratt called the hockey stick effect has high sales on the last day of an accounting period followed by low sales on the first day of the next. By scrambling to close deals at the end of the reporting period, the sales reps are actually borrowing from future demand and thereby depleting the sales of the following days. This is negative autocorrelation, and the Naive Model would be no help.

Autocorrelation Induced by Averaging

Autocorrelation may be present in raw data but is also routinely introduced in processing it. For example, many published time series are moving averages of daily values. For example, the daily case counts of COVID-19 for the county of Santa Clara in California are reported as follows:

The daily cases are in light blue; the 7-day averages, in navy blue. The purpose is to provide daily values while filtering fluctuations, assumed to be noise. Advisor Perspectives does the same with monthly car sales data:

“Exponential,” in the legend, means that, in the value for a given month, the values for past months are assigned weights that decrease exponentially with lag. When working with such data, a forecaster needs to know that they are autocorrelated whether the individual values are or not, simply because consecutive values are averages of overlapping series of terms. If the y_i are i.i.d., the k-period moving averages z_i for i > k are

z_i = \frac{y_{i-k+1}+...+y_i}{k}

and if the y_i have 0 mean and a standard deviation of \sigma the covariance of z_{i+1} and z_i is

because the cross terms vanish: when i\neq j, E\left ( Y_i Y_j \right )= E\left ( Y_i \right )E\left ( Y_j \right ) = 0.

Autocorrelation in Moving Range Charts

In SPC, the moving range part of the XmR/Process behavior chart has 1-step autocorrelation because successive values R_i =\left | Y_i - Y_{i-1} \right | and R_{i+1} =\left | Y_{i+1} - Y_i \right | are both functions of Y_i . It is ignored in the way control limits are set.

Probability Forecasting with Autoregression

The naive forecast is a special case of autoregression of order p, known as AR\left ( p \right ). In general AR\left ( p \right ), the point forecast for Y_{i+1} is a linear combination of the last p past values:

Y_{i} = c + \sum_{k=1}^{p}\phi_kY_{i-k} + S_{i}

where c is a constant, the S_{i} are a white noise and the point forecast is \hat{Y}_{i} = c + \sum_{k=1}^{p}\phi_kY_{i-k}.

To fit this kind of model, the procedure is straightforward:

Estimate the autocorrelation function from the training set y_i. This is to know whether there is autocorrelation. If not, don’t use this model. In R, you can use the acf function.

Get a noise sample s_{p+1},...,s_n from s_i = y_i - \hat{y}_i for i = p+1,...,n.

Estimate the distribution of the S_i from the sample as above.

This yields a probability forecast for the next period. You can still look ahead k periods into the future, but it is not as simple as for the naive model where the noises just add up.

A More General Model

In all of the above, we have assumed that the noises around point forecasts are i.i.d. Their distribution does not change with the period i, the value y_i , or the history \left (y_1,…,y_i \right ). These assumptions allow us to use all the available past residuals as a sample from which to infer the distribution of the noise. Where they don’t apply, the challenge is to select for each y_i , a sample \left (s_{i_1},...,s_{i_m} \right ) a sample that is representative of the noise around y_{i+1} .

A Few Assumptions

To make some headway, we retain the assumptions that the sales process has the following characteristics:

It is stationary: none of its characteristics change over time. Its expected value, the distribution of fluctuations around it, autocorrelations, etc., all remain the same.

It is ergodic: you can infer the distribution of the values at one time from the variability of the series over time.

These concepts are best understood by example.

Stationary Time Series

Let’s say you have a group of machines on a production floor that are either running, ready, or down. They are all fitted with andon lights that are green when they are running, yellow when ready, and red when down. The dynamics of transitions between these states do not change hour-by-hour or day-by-day. Improvement efforts may reduce downtime within 3 to 6 months, but not by tomorrow. For daily operations, you consider this process to be stationary.

Sales are generally not stationary but there are often ways to extract from sales data a stationary process that contains all the randomness:

Subtracting all growth, seasonal and cyclical trends, or pent-up demand effects leaves you with residuals that have a constant expected value of 0.

If the residuals tend to grow, a monotonous transformation may make them stationary, and the reverse transformation maps quantiles of the transformed data to the quantiles of the original. This is often done with logarithms and exponentials.

If autocorrelations do not vanish beyond a certain lag, you can work with increments y_i - y_{i-1} instead of the y_i.

Ergodic Time Series

At various times during the day, you may count how many Andon lights on machines are green, yellow, and red. If you see that, usually, about 15% of the machines are down, you may infer that any given machine is, on average, down 15% of the time. When you are doing this, you are assuming that snapshots of the states of multiple machines inform you about the behavior of a single machine over time. You want this to be true because taking snapshots is easier and faster than collecting machine histories. The same assumption is made in work sampling.

It is called ergodicity. You would not guess it from the Wikipedia article about ergodicity, except for one sentence that actually states the converse: “the average behavior of the system can be deduced from the trajectory of a ‘typical’ point.” With your machines, you are actually deducing the difficult-to-observe trajectory of a single machine from an easy-to-observe average behavior of the entire shop but it is the same idea. With a time series, you start from the observed trajectory \left (y_1,…,y_i \right ) and deduce the distribution of Y_{i+1}.

For ergodicity to be a reasonable assumption, the data must be homogeneous. If the machines in your shop are the same models, operated and maintained by the same teams, you can make the assumption that their state transitions are ergodic but you shouldn’t if they include both machining centers and drill presses. With sales data, you wouldn’t want to mix items routinely made in high volumes with intermittent demand items.

Biau & Patra’s Method

Biau & Patra’s has the merit of being published. Most of the available probability forecasting algorithms are proprietary to software companies.

We present it here as an example, neither a recommendation nor an endorsement. Neither Biau nor Patra have published more on this subject since 2011. Biau’s later publications are about random forests; Patra has worked with software startups. The discussion is focused on the principles by which they choose samples from which to infer the forecast distribution, omits many details, and uses more graphics than their academic paper.

Outline of the method

Biau & Patra explain an ensemble method for probability forecasting one period ahead, for a stationary, ergodic time series. It means that they forecast quantiles as weighted averages of forecasts obtained by multiple methods, based on sequences of preceding values of increasing lengths and nearest neighbors of these sequences in the data, in increasing numbers.

For the distribution of Y_{n} you start with just y_{n-1} and select samples of the form

J^{\left (1,ell \right )}_n = \left \{ t; 1<t<n,\textup{ such that }y_{t-1}\textup{ is among the }\ell\textup{ nearest neighbors of }y_{n-1}\right \}

where the distance between y_{n-1} and y_t is \left | y_{n-1}-y_t \right |.

Then you select samples for pairs near \left (y_{n-2}, y_{n-1} \right ), triplets near \left (y_{n-3}, y_{n-2}, y_{n-1} \right ), etc.

For brevity, Biau & Patra use the notation Y^n_1 = (Y_1,...,Y_n) for a sequence of n random variables and likewise y^n_1 = (y_1,...,y_n) for a data sample drawn from Y^n_1.

We just need to remember that, in this context, the superscript n is an index and not a power of Y_1 or y_1. We also use NN as an abbreviation for “Nearest Neighbor.” Then

J^{\left (k,\ell \right )}_n = \left \{ t; k<t<n,\textup{ with }y^{t-1}_{t-k}\textup{ among the }\ell \textup{ NNs of }y^{n-1}_{n-k}\textup{ in the } y^k_1,...,y^{n-2}_{n-k-1} \right \}

where the distance between sequences of length k is a norm \left \| y^{n-1}_{n-k} - y^{t-1}_{t-k} \right \| on vectors of dimension k.

Biau & Patra then estimate a quantile of the distribution of Y_{n}, given y^n_1 as a weighted average of the estimates from from each J^{\left (k,\ell \right )}_n.

With many technical details omitted, this is the general idea of their method. They do not justify it by math but by experimental results of workload forecasting in call centers.

Application to Call Center Volumes

Biau and Patra applied their methods to 21 sets of daily call volumes at call centers tallied over 2 years, totaling 16,270 records. As shown below, the level of activity varied between call centers. It also exhibited periodic drops to near 0 that seem to reflect weekend lulls, and the time domains for available data were not the same for all centers.

Pruning the Series

Biau & Patra did not use the complete series but instead a subset of 91 dates between 11/30/2006 and 12/13/2007, some consecutive and others with gaps of up to one month. This shows in the plots of the pruned series, the one-month gap being between the last two points:

Except for the desire to use dates for which volumes are available in all 21 series, they provide no justification for pruning the data in this particular way. Then they identified the J^{\left (k,\ell \right )}_n for 1\leq k \leq 14 and 1\leq \ell \leq 25.

Examples of Nearest-Neighbor Samples

For k =1 and \ell = 25, it means selecting the 25 nearest neighbors of the last point in the pruned series. For k = 5 and \ell = 5, it means finding the 5 sequences of 5 consecutive points in the pruned series that are closest to the sequence of the last 5 points. This is what it looks like for Series 0:

The red dots in the top chart then become one sample from which to infer the distribution of y_{m+1}; the red dots in the third chart become another sample. If you only use the points in the top chart, you will get a 0 probability for the steep drops visible in the timeline. You begin to collect them with longer sequences.

If, as in the bottom chart, the sequence is too long, the sample is almost the entire data set, and not distinctive of the last point.

Biau & Patra collected 14\times 25 = 210 samples for each series, estimated quantiles with each sample, and combined them into a single estimate through weighted averages.

The means for setting the weights are quite complex and given without a justification. The weight given to the sample J^{\left (k,\ell \right )}_n is a function of the pinball loss of the estimate for this sample.

These are the kind of technical details the reader can find in Biau & Patra’s paper. Our focus here has been on the concept of using multiple nearest-neighbor sequences to select samples from which to infer distributions before aggregating these distributions into a composite estimate.

Other Methods

Biau & Patra compare their results with a number of pre-existing methods like Quantile Autoregression (QAR) for 10%, 50% and 90% quantiles in terms of the Pinball Loss and the Ramp Loss — that is, “the empirical fraction of quantile estimates which exceed the observed values y_{m+1}.”

Then they compare their results for the median, the 50% quantile with point forecasting methods like Moving Averages (MA), Autoregression (AR), and others, using metrics for point forecasting, like Average Absolute Error or Mean Absolute Percentage Error (MAPE). The paper is from 2011, and the Total Percentile Error (TPE) emerged later.

They conclude that, on the call center data, their method outperformed the others. The claims, however, would be more compelling with an explanation of the pruning method used to reduce all the time series to 91 points.

Conclusions

Probability forecasting has taken over in weather reports and in politics. TV and radio audiences hear that there is “a 35% probability that it will rain today” and it supports their decision whether to pack raincoats. Prior to elections, analysts like Nat Silver announce that a candidate has “a 60% probability of winning.” This supports decisions by campaigns on strategy and by bookmakers on betting odds.

In sales forecasting for Manufacturing, it has yet to make its mark. It separates cleanly the roles of forecasting from that of making supply chain management decisions based on the forecast, assuming managers know what to do with probability distributions.

A sign that not many do is that the companies offering probability forecasting tools to manufacturers do not make this capability the focus of their presentation:

Toolgroup’s product “decodes demand uncertainty and minimizes inventory so you can be ready for anything.”

Wahupa is about “Next-generation supply chain management.”

Probability forecasting produces a richer output than point forecasting. Exploiting that output wisely, however, also requires more sophistication in, say, Sales & Operations Planning (S&OP) than the methods developed in the 1980s. In other words, for it to take root, other parts must move as well.

References

Tashman, L., Gilliland, M., Sglavo, U. (2021). Business Forecasting: The Emerging Role of Artificial Intelligence and Machine Learning, Wiley, ISBN: 9781119782582

Koenker, R. et al. (2018) Handbook of Quantile Regression, CRC Press, ISBN: 978-1-4987-2528, Chapter 17, pp. 293-330

Hyndman, R. J. & Athanasopoulos, G. (2018) Forecasting: Principles and Practice, Otexts, ISBN: 9780987507105

Ord, J. K., Fildes, R., & Kourentzes, N. (2017). Principles of business forecasting (2nd ed.). Wessex Press Publishing Co. ISBN: 9780999064917

Biau, Gérard & Patra, Benoît. (2011). Sequential Quantile Prediction of Time Series. Information Theory, IEEE Transactions on. 57. 1664 – 1674. 10.1109/TIT.2011.2104610.

Koenker, R (2005) Quantile Regression. Cambridge University Press, Cambridge, ISBN: 9780521608275

Aug 27 2021

## Sales Forecasts – Part 3. Generating Probability Forecasts

My last two long posts were about evaluating sales forecasts. They begged the question of how you

generatethese forecasts. This is a partial answer, about what you can tell from a history of sales through both classical methods and recent developments, particularly probability forecasting.Contents

## Are Sales Easy To Forecast?

Rob Hyndman stated five conditions a phenomenon that is easy to forecast:

## Sales and Hyndman’s Conditions

It’s not a foregone conclusion that your company’s sales within your planning horizon meet these conditions. Market research can improve your understanding, and you can collect more detailed and accurate data.

Demand is what you would like to know; Sales is what you have historical data about, to end-users or intermediaries like distributors, dealers, or retailers. Intermediaries buffer the manufacturer against fluctuations in demand but they also reduce the visibility of end-user demand to manufacturers.

## Data Categories for Complete Forecasting

For a complete picture, David Simchi-Levi lists the following categories of data that a consumer products manufacturer considers for sales forecasting:

All we are talking about here is the time-series analysis of historical sales data. It doesn’t mean it’s OK to ignore the other elements.

## Why Probability Forecasting?

Rob Hyndman

definesforecasting as “the estimation of a probability distribution of a variable to be observed in the future.”Pointforecasting, however, has defined most professionals’ understanding.## Forecasts and Decisions

As Bill Waddell put it, “in the end, the forecast is used to drive decisions (i.e. how many long lead time parts to buy); and this reduces all of it to a point.”

Indeed, a purchase order is for a specific quantity of parts, not a probability distribution. With the probability forecast, however, the materials manager chooses the risk to take. When issuing a point forecast, forecasters make that decision, and it’s not really theirs to make. Their job should be limited to providing the best possible forecast. It is enough of a challenge.

## Logical Issues with Point Forecasting

There is also something logically questionable in using an estimate of the mean, median, or mode of a distribution as a forecast. Numerical variables never match it, and we call the difference between the actual value and the forecast an “error.”

A random variable, however, will fluctuate around this number, no matter how perfectly you have estimated it. That’s why you call the variable “random.” No actual error is involved.

The real reason point forecasts have dominated the field is that probability forecasts require more computing power. It is readily available now but wasn’t when the most common point forecasting methods were developed decades ago.

## The Challenge

Probability forecasting is straightforward from a sample of independent and identically distributed variables. It’s not when you want to predict quantiles for the future of a time series. It should be noted that the justification for forecasting methods is not theoretical but empirical.

## Forecasting and Math

The math must make sense but math cannot prove a method works. Its validation depends on forecasting performance measured on both simulated and actual data. This is why Biau & Patra devote several pages of their paper to experimental results on call center data, and why forecasting competitions, from M1 in 1982 to M5 in 2020, are essential to progress in this field.

## Forecasting and Software

Software tools like Excel’s Data Analysis add-on or R’s lm function are not bashful about putting out prediction intervals. The way they do it, however, is by using multiples of the Root Mean Squared Error (RMSE), essentially the \sigma of the residuals in the training set. The implicit assumption is that the residuals are gaussian — a.k.a “normal” — and form a white noise, meaning that they are independent of each other. Residuals from data don’t always satisfy both.

We now have the computing power and the math available to predict quantiles from the raw data with looser assumptions. Please do not let the math intimidate you. Since forecasting

isapplied math, there is no clearer, simpler, or more concise way to describe the methods. As much as possible, we backed up every math statement with graphics illustrating it.## Simple Cases

To infer a probability distribution, we need a sample of data. Much of the challenge of probability forecasting is extracting from history a sample that is representative of the current condition. First, we need to discuss what to do once you have such a sample. It differs from common model fitting practices. Then we’ll go into the means of finding an appropriate sample, starting from the simplest cases.

## Independent and Identically Distributed Variables

What we are trying to do here differs from the traditional approach to choosing a model for a set \left ( y_1,...,y_n \right ) of numbers that we interpret as independent and identically distributed (i.i.d.) instances of a random variable Y.

## Common Practice With a Standard Distributions

Usually, based on prior knowledge of the meaning of the numbers, we select one from a small number of distribution families, like the Gaussian, binomial, Poisson, hypergeometric, Weibull, \chi^2, etc., for which the math is well-known. Each family has a small number of parameters, like mean, standard deviation, or degrees of freedom. We fit them to the data and then test for goodness of fit.

## Quantiles of Any Distribution

All we want to do with the distribution here is calculate its quantiles. For this purpose, we don’t need to limit ourselves to this small list from the infinite universe of possible distributions. We want to infer a distribution from the data without worrying whether it matches a well-known one.

The first step is sorting of the y_i by increasing value, which we denote by \left ( y_{\left (1 \right )},..., y_{\left (n \right )}\right ). Then y_{\left (i \right )} has exactly i-1 points below and n-i points over it.

Then, if we plot i/n versus y_{\left (i \right )}, we have the cumulative relative frequencies of the values in the sample, as shown in the following examples, with samples of increasing sizes drawn from a set of 1,000,000 points:

This staircase function is \hat{F}\left ( q \right ) = \frac{1}{n}\times\sum_{i=1}^{n}I_{\left ( q > y_{i}\right )} where I_{\left ( q > y_{i}\right )} is the indicator function that is 1 when q > y_{i} and 0 otherwise.

It estimates the cumulative distribution function F of Y. Then, if it were strictly monotonous, it would have an inverse \hat{F}^{-1} and \hat{q} = \hat{F}^{-1}\left (\tau\right ) would be the estimate of the \tau-quantile that minimizes the pinball loss function for \hat{F}, as discussed in a previous post.

## From Staircase to Smooth Cumulative Frequency Curves

A staircase, however, is

notmonotonous, and \hat{F}^{-1}\left (\tau\right ) is actually the interval [y_{\left ( i \right )}, y_{\left ( i+1 \right )}[ whenever \tau matches the top of a step. With large samples, the steps are so narrow that the curve looks continuous and, in this example, monotonous.You don’t choose the size of the sample to work with. If life gives you 30 points, that’s what you have. With 30 points, this method gives you an estimate for the 50% quantile, also known as the median, between 18 and 38. The center of this interval, at 28, is used as estimate. With 30,000 points, your median estimate is 42.

## The Probability Distribution Function and KDE

The corresponding probability distribution function (pdf) \hat{f}\left ( q \right ) = \frac{1}{n}\times\sum_{i=1}^{n}\delta_{y_i}\left ( q \right ) has a Dirac unit impulse at each y_i and is 0 in every ]y_{\left ( i \right )}, y_{\left ( i+1 \right )}[ interval.

While strictly based on the sample and free of any assumption, it doesn’t make much sense. If you only have one point, y, the pdf is \hat{f}\left ( q \right ) = \delta_{y}\left ( q \right ), which gives a probability of 1 to any interval containing y and 0 to any other interval, no matter how close to y. Prior knowledge of sales says otherwise: the next value is likely to be in such an interval.

Rather than \delta_{y}, a symmetric distribution g\left (q - y \right ) with a single mode at y and tapering off as you get away from y would make more sense. If we call this distribution a

kerneland substitute it for the Dirac at every y_i, we obtain a density g\left (q - y_i \right ) for each data point.Then the average of all these distributions is the Kernel Density Estimation (KDE).

\hat{f}\left ( q \right ) = \frac{1}{n}\times\sum_{i=1}^{n}g\left ( q- y_i \right )

## Applying KDE to a Small Sample

We discussed this concept in Updating the 7 tools of QC as an improvement over histograms when you have software to do it for you. It’s not free of assumptions because you have to choose a scale for the kernel. If the steps are visible in the cumulative proportion plot, the KDE is both more realistic and easier to work with.

Now, let’s look at how we can apply these methods to probability forecasting.

## Control Charts: Constant + Noise

This is the simplest model for a time series, and one for which probability forecasting is straightforward. When drawing SPC control charts, the sequence of points y_1,...,y_n,... is assumed the be the realization of random variables Y_1, ...,Y_n,… of the form

Y_i = C + S_i

where the S_i are a

white noise, meaning that they are independent and identically distributed and have a mean of 0. SPC sets control limits based on the additional assumption that the S_i are gaussian, a.k.a “normal.” It makes sense where, as in bar{X}-charts, each Y_i is the average of a sample of measurements of workpiece dimensions.## The Constant + Noise Model of Sales

On the other hand, if we want to apply the “Constant + Noise” model in

sales forecasting, the noises can haveanydistribution, which we need to infer from the s_1,…,s_n,... observed over time. To dramatize that the noise does not have to be Gaussian, we simulated the deliberately ugly camelback distribution. It’s bimodal, skewed, and pointy. Visually, as seen below, such a series may appear to have oscillations but they are an optical illusion:With a Gaussian white noise, it would look different:

## Forecasting with the Constant + Noise Model

In

sales forecasting, this model is often applied to the residuals of apointforecast. What remains once you subtract the point forecast from the data in the training set is supposed to be noise. If the mean of the noise is not 0, the point forecast isbiased, and removing biases is a major issue withpointforecasting. Inprobabilityforecasting, on the other hand, the output is not the point forecast, and the bias just indicates the location of the distribution.To check that the noises keep the same distribution over time, you check them for homogeneity of variance. Does their \sigma change over time? If it doesn’t the time series is said to be homoscedastic. It is often visually obvious. Then, if you can’t find any autocorrelation among the noises, you assume them to be independent.

If all of this holds, the y_i are i.i.d. and treatable by the methods described above. The probability forecast for this case is the same distribution, regardless of how far into the future we look.

## Naive Forecasting: Next = Last + Noise

Replenishment systems use the quantity consumed in period i as a point forecast for period i+1. The implicit time series model is

Y_{i+1} = Y_i + S_{i+1}

where the noises S_i, subject to the same conditions as in the Constant + Noise model, are assumed i.i.d.

## Naive Model Simulation

A simulation of this process with the same camelback noise distribution as above but with 1/4 of its amplitude, looks as follows:

The variability is lower day-to-day but higher quarterly. The quarterly oscillations on this chart are also an optical illusion because there is nothing periodic in this process.

If we have a sample y_1,..., y_n of n, we have a sample s_1,..., s_{n-1} of n-1 noises given by s_i = y_i-y_{i-1}. Then you estimate the distribution g of S_i as above. The distribution of Y_{i+1}, knowing that Y_i = y_i is

\hat{f}_{i+1}\left ( q | Y_i= y_i\right ) = g(q-y_i)

## Forecasting Multiple Steps Ahead

Then we can take quantiles from the cumulative distribution G. This is if we want to look ahead just one period. For k periods, then

Y_{i+k} = Y_i + \sum_{j=1}^{k}S_{i+j}

About the distribution of the sum of k i.i.d. noises is that, if they have a mean \mu and a standard deviation \sigma:

Since not all distributions have means and standard deviations, the assumption that S does is

nottrivial. We have known that sums of i.i.d. variables with such a distribution converge to a Gaussian for over 200 years. What is still remarkable is how quickly it happens.In the general case and for small k, we obtain the distribution of the sum of k noises by

convolvingg with itself k times. “Convolving” is less intimidating than it sounds once you realize that there is no more to it than invoking a function called “convolve.” The following figure shows the convergence happening with the camelback distribution, which converges to a smooth bell-shape in just 15 iterations:## Autoregressive(AR) Models

When forecasting a time series from its own history, you assume that its past values inform its future. A common way of expressing this is in terms of correlations between values in different time periods, assuming they depend only on the lag between them. The correlation between today’s sales and the sales of a week ago is the same as between the sales of one day 6 months ago and the day 6 months + 1 week ago. Then the correlation between values that are j days apart is a function of j called the series

autocorrelation, and it usually has a finite range. Beyond a certain lag, the correlations vanish.If there is autocorrelation of range p in your time series, you can use

regressionto forecast future values against the last p. Without autocorrelation, the forecaster is limited to using external factors like the weather, the season, or economic indicators, and applying the Constant + Noise model to residuals. With autocorrelation, the forecaster can learn from the time series’ own history.## Examples of Autocorrelation

The Naive Model is an example of

positive,1-step autoregression. If today’s Sales is the point forecast for tomorrow’s, the more we sell today, the more we expect to sell tomorrow.## Eli Godratt’s Hockey Stick Effect

By contrast, what Eli Goldratt called the

hockey stick effecthas high sales on the last day of an accounting period followed by low sales on the first day of the next. By scrambling to close deals at the end of the reporting period, the sales reps are actually borrowing from future demand and thereby depleting the sales of the following days. This isnegativeautocorrelation, and the Naive Model would be no help.## Autocorrelation Induced by Averaging

Autocorrelation may be present in raw data but is also routinely introduced in processing it. For example, many published time series are moving averages of daily values. For example, the daily case counts of COVID-19 for the county of Santa Clara in California are reported as follows:

The daily cases are in light blue; the 7-day averages, in navy blue. The purpose is to provide daily values while filtering fluctuations, assumed to be noise. Advisor Perspectives does the same with monthly car sales data:

“Exponential,” in the legend, means that, in the value for a given month, the values for past months are assigned weights that decrease exponentially with lag. When working with such data, a forecaster needs to know that they are autocorrelated whether the individual values are or not, simply because consecutive values are averages of overlapping series of terms. If the y_i are i.i.d., the k-period moving averages z_i for i > k are

z_i = \frac{y_{i-k+1}+...+y_i}{k}

and if the y_i have 0 mean and a standard deviation of \sigma the covariance of z_{i+1} and z_i is

\begin{aligned}cov(Z_{i+1}, Z_i) &= \frac{1}{k^2}E\left [(Y_{i-k+1}+...+Y_i)\left (Y_{i-k+2}+...+Y_{i+1} \right ) \right ]\\& = \frac{1}{k^2}E\left [ \left (Y^2_{i-k+2}+...+Y^2_i \right ) \right ]\\& = \frac{\left ( k-1 \right )\sigma^2}{k^2}\end{aligned}

because the cross terms vanish: when i\neq j, E\left ( Y_i Y_j \right )= E\left ( Y_i \right )E\left ( Y_j \right ) = 0.

## Autocorrelation in Moving Range Charts

In SPC, the moving range part of the XmR/Process behavior chart has 1-step autocorrelation because successive values R_i =\left | Y_i - Y_{i-1} \right | and R_{i+1} =\left | Y_{i+1} - Y_i \right | are both functions of Y_i . It is ignored in the way control limits are set.

## Probability Forecasting with Autoregression

The naive forecast is a special case of autoregression of order p, known as AR\left ( p \right ). In general AR\left ( p \right ), the point forecast for Y_{i+1} is a linear combination of the last p past values:

Y_{i} = c + \sum_{k=1}^{p}\phi_kY_{i-k} + S_{i}

where c is a constant, the S_{i} are a white noise and the point forecast is \hat{Y}_{i} = c + \sum_{k=1}^{p}\phi_kY_{i-k}.

To fit this kind of model, the procedure is straightforward:

This yields a probability forecast for the next period. You can still look ahead k periods into the future, but it is not as simple as for the naive model where the noises just add up.

## A More General Model

In all of the above, we have assumed that the noises around point forecasts are i.i.d. Their distribution does not change with the period i, the value y_i , or the history \left (y_1,…,y_i \right ). These assumptions allow us to use

allthe available past residuals as a sample from which to infer the distribution of the noise. Where they don’t apply, the challenge is to select for each y_i , a sample \left (s_{i_1},...,s_{i_m} \right ) a sample that is representative of the noise around y_{i+1} .## A Few Assumptions

To make some headway, we retain the assumptions that the sales process has the following characteristics:

stationary: none of its characteristics change over time. Its expected value, the distribution of fluctuations around it, autocorrelations, etc., all remain the same.ergodic: you can infer the distribution of the values at one time from the variability of the series over time.These concepts are best understood by example.

## Stationary Time Series

Let’s say you have a group of machines on a production floor that are either running, ready, or down. They are all fitted with andon lights that are green when they are running, yellow when ready, and red when down. The dynamics of transitions between these states do not change hour-by-hour or day-by-day. Improvement efforts may reduce downtime within 3 to 6 months, but not by tomorrow. For daily operations, you consider this process to be

stationary.Sales are generally

notstationary but there are often ways to extract from sales data a stationary process that contains all the randomness:## Ergodic Time Series

At various times during the day, you may count how many Andon lights on machines are green, yellow, and red. If you see that, usually, about 15% of the machines are down, you may infer that any given machine is, on average, down 15% of the time. When you are doing this, you are assuming that snapshots of the states of multiple machines inform you about the behavior of a single machine over time. You want this to be true because taking snapshots is easier and faster than collecting machine histories. The same assumption is made in

work sampling.It is called

ergodicity. You would not guess it from the Wikipedia article about ergodicity, except for one sentence that actually states the converse: “the average behavior of the system can be deduced from the trajectory of a ‘typical’ point.” With your machines, you are actually deducing the difficult-to-observe trajectory of a single machine from an easy-to-observe average behavior of the entire shop but it is the same idea. With a time series, you start from the observed trajectory \left (y_1,…,y_i \right ) and deduce the distribution of Y_{i+1}.For

ergodicityto be a reasonable assumption, the data must be homogeneous. If the machines in your shop are the same models, operated and maintained by the same teams, you can make the assumption that their state transitions are ergodic but you shouldn’t if they include both machining centers and drill presses. With sales data, you wouldn’t want to mix items routinely made in high volumes with intermittent demand items.## Biau & Patra’s Method

Biau & Patra’s has the merit of being

published. Most of the available probability forecasting algorithms are proprietary to software companies.We present it here as an example, neither a recommendation nor an endorsement. Neither Biau nor Patra have published more on this subject since 2011. Biau’s later publications are about random forests; Patra has worked with software startups. The discussion is focused on the principles by which they choose samples from which to infer the forecast distribution, omits many details, and uses more graphics than their academic paper.

## Outline of the method

Biau & Patra explain an ensemble method for probability forecasting one period ahead, for a stationary, ergodic time series. It means that they forecast quantiles as weighted averages of forecasts obtained by multiple methods, based on sequences of preceding values of increasing lengths and nearest neighbors of these sequences in the data, in increasing numbers.

For the distribution of Y_{n} you start with just y_{n-1} and select samples of the form

J^{\left (1,ell \right )}_n = \left \{ t; 1<t<n,\textup{ such that }y_{t-1}\textup{ is among the }\ell\textup{ nearest neighbors of }y_{n-1}\right \}where the distance between y_{n-1} and y_t is \left | y_{n-1}-y_t \right |.

Then you select samples for pairs near \left (y_{n-2}, y_{n-1} \right ), triplets near \left (y_{n-3}, y_{n-2}, y_{n-1} \right ), etc.

For brevity, Biau & Patra use the notation Y^n_1 = (Y_1,...,Y_n) for a sequence of n random variables and likewise y^n_1 = (y_1,...,y_n) for a data sample drawn from Y^n_1.

We just need to remember that, in this context, the superscript n is an index and not a power of Y_1 or y_1. We also use NN as an abbreviation for “Nearest Neighbor.” Then

J^{\left (k,\ell \right )}_n = \left \{ t; k<t<n,\textup{ with }y^{t-1}_{t-k}\textup{ among the }\ell \textup{ NNs of }y^{n-1}_{n-k}\textup{ in the } y^k_1,...,y^{n-2}_{n-k-1} \right \}where the distance between sequences of length k is a norm \left \| y^{n-1}_{n-k} - y^{t-1}_{t-k} \right \| on vectors of dimension k.

Biau & Patra then estimate a quantile of the distribution of Y_{n}, given y^n_1 as a weighted average of the estimates from from each J^{\left (k,\ell \right )}_n.

With many technical details omitted, this is the general idea of their method. They do not justify it by math but by experimental results of workload forecasting in call centers.

## Application to Call Center Volumes

Biau and Patra applied their methods to 21 sets of daily call volumes at call centers tallied over 2 years, totaling 16,270 records. As shown below, the level of activity varied between call centers. It also exhibited periodic drops to near 0 that seem to reflect weekend lulls, and the time domains for available data were not the same for all centers.

## Pruning the Series

Biau & Patra did not use the complete series but instead a subset of 91 dates between 11/30/2006 and 12/13/2007, some consecutive and others with gaps of up to one month. This shows in the plots of the pruned series, the one-month gap being between the last two points:

Except for the desire to use dates for which volumes are available in all 21 series, they provide no justification for pruning the data in this particular way. Then they identified the J^{\left (k,\ell \right )}_n for 1\leq k \leq 14 and 1\leq \ell \leq 25.

## Examples of Nearest-Neighbor Samples

For k =1 and \ell = 25, it means selecting the 25 nearest neighbors of the last point in the pruned series. For k = 5 and \ell = 5, it means finding the 5 sequences of 5 consecutive points in the pruned series that are closest to the sequence of the last 5 points. This is what it looks like for Series 0:

The red dots in the top chart then become one sample from which to infer the distribution of y_{m+1}; the red dots in the third chart become another sample. If you only use the points in the top chart, you will get a 0 probability for the steep drops visible in the timeline. You begin to collect them with longer sequences.

If, as in the bottom chart, the sequence is too long, the sample is almost the entire data set, and not distinctive of the last point.

Biau & Patra collected 14\times 25 = 210 samples for each series, estimated quantiles with each sample, and combined them into a single estimate through

weightedaverages.The means for setting the weights are quite complex and given without a justification. The weight given to the sample J^{\left (k,\ell \right )}_n is a function of the pinball loss of the estimate for this sample.

These are the kind of technical details the reader can find in Biau & Patra’s paper. Our focus here has been on the concept of using multiple nearest-neighbor sequences to select samples from which to infer distributions before aggregating these distributions into a composite estimate.

## Other Methods

Biau & Patra compare their results with a number of pre-existing methods like Quantile Autoregression (QAR) for 10%, 50% and 90% quantiles in terms of the Pinball Loss and the Ramp Loss — that is, “the empirical fraction of quantile estimates which exceed the observed values y_{m+1}.”

Then they compare their results for the median, the 50% quantile with

pointforecasting methods like Moving Averages (MA), Autoregression (AR), and others, using metrics for point forecasting, like Average Absolute Error or Mean Absolute Percentage Error (MAPE). The paper is from 2011, and the Total Percentile Error (TPE) emerged later.They conclude that, on the call center data, their method outperformed the others. The claims, however, would be more compelling with an explanation of the pruning method used to reduce all the time series to 91 points.

## Conclusions

Probability forecasting has taken over in weather reports and in politics. TV and radio audiences hear that there is “a 35% probability that it will rain today” and it supports their decision whether to pack raincoats. Prior to elections, analysts like Nat Silver announce that a candidate has “a 60% probability of winning.” This supports decisions by campaigns on strategy and by bookmakers on betting odds.

In sales forecasting for Manufacturing, it has yet to make its mark. It separates cleanly the roles of forecasting from that of making supply chain management decisions based on the forecast, assuming managers know what to do with probability distributions.

A sign that not many do is that the companies offering probability forecasting tools to manufacturers do not make this capability the focus of their presentation:

Probability forecasting produces a richer output than point forecasting. Exploiting that output wisely, however, also requires more sophistication in, say, Sales & Operations Planning (S&OP) than the methods developed in the 1980s. In other words, for it to take root, other parts must move as well.

## References

Business Forecasting: The Emerging Role of Artificial Intelligence and Machine Learning,Wiley, ISBN: 9781119782582Handbook of Quantile Regression,CRC Press, ISBN: 978-1-4987-2528, Chapter 17, pp. 293-330Forecasting: Principles and Practice,Otexts, ISBN: 9780987507105Principles of business forecasting (2nd ed.). Wessex Press Publishing Co. ISBN: 9780999064917Sequential Quantile Prediction of Time Series. Information Theory, IEEE Transactions on. 57. 1664 – 1674. 10.1109/TIT.2011.2104610.Quantile Regression. Cambridge University Press, Cambridge, ISBN: 9780521608275#probabilityforecast, #salesforecast, #demandforecast

## Share this:

## Like this:

RelatedBy Michel Baudin • Data science • 0 • Tags: Demand Forecast, Probability Forecast, Sales forecast