# Sales Forecasts – Part 4. Generating Point Forecasts with Trends and Seasonality

This fourth post about sales forecasts addresses what you actually start with — that is, visualizing the time series of historical sales and generating point estimates for the future. Theyou analyze the residuals to determine the probability forecasts.

What prompted me to review this field is the realization based on news of the M5 forecasting competition that this field has been the object of intense developments in recent years. Some techniques from earlier decades are now accessible through open-source software that can crunch tens of thousands of data points on an ordinary office laptop.

Others are new developments. Thanks to Stefan de Kok, John Darlington, Nicolas Vandeput, and Bill Waddell for comments and questions on the previous posts, that made me dig deeper:

## The Starting Point

The first step in analyzing trends and periodic variations in a time series is plotting it. You summarize and smooth it at different levels to make salient features stand out. Often, but not always, this tells you everything the data can tell you but you can misread it.

### The Backstory of the Numbers

The numbers have a backstory. It makes a difference whether they are sensor readings, event counts, monetary transactions, or poll results. In sales forecasts, you expect to see a baseline of activity,  trend, periodic variations. The trend can be growth or contraction, and not necessarily linear. Periodicity can be by day of the week, day of the month, of season, and possibly multi-year cycles. Different periodic variations can coexist, and cycle durations vary.

Often, the plot confirms your expectations. Sometimes, it refutes them, and, sometimes, it makes you see patterns that aren’t there. That’s why you should use visualizations to explore your time series, yet be wary of hasty conclusions. Visually making sense of the sales history for one product is usually feasible. Doing it for 3,000 products is another challenge, that calls for more automation.

### Multiple Periods

The sales of many goods and services vary not just with the time of year but also by day of the month and day of the week and even by the time of day. Businesses scrambling to make deliveries in the final days of each month cause a surge in sales. It is followed by a dip at the start of the next month.

As society organizes around the 7-day cycle of a week, business-to-business sales ebb on week-ends while sales to consumers peak. Sales forecasts must reflect these phenomena.

When you plot a time series of daily sales, the existence of variations by day of the week and by month, the presence of an end-of-the-month or end-of-the-quarter peak, and the effect of national holidays may be visible. Whatever periodicity you suspect, you can test for its presence in the training set and quantify it. Detecting periodic variations that are not visually obvious is another matter.

#### Trend versus Plug

The first real  challenge I ran into with sales forecasts was for “turns” in the semiconductor industry — that is, sales the company booked and shipped within the same month. The official sales forecast was based on management’s insistence to sell $1M of chips the next month. Since long-term orders added up to$520K, the only way to make the numbers add up was to forecast $480K for turns. Since turns are small orders received from a large base of customers, they lend themselves to trend analysis. In a few minutes, I fit a straight line to the past 18 months of data by least-squares. It yielded a sales forecast of$350K ± $50K. The following month, turns came in at$360K. Unsurprisingly, even a crude analysis of historical data outperformed wishful thinking.

The lesson from this case was that the first requirement on sales forecasts is objectivity. Advanced analytics are useless if forecasters are punished for producing numbers that the boss doesn’t want to hear.

#### Unseen Spikes

The factory was cutting heat-shrink braided sleeves for automotive wire harnesses and, for one high-volume product, management perceived the demand to be unpredictable, and no one had plotted the flow of orders over time. Plotting it revealed that there was next to no activity of most days, with spikes of varying heights occurring like clockwork every other Wednesday:

Like Jonesy the sonar operator who tracks a “silent” sub in The Hunt for Red October, you exclaim “That’s got to be man-made.”

Further investigation revealed that the spikes were orders placed by a single buyer at a distributor. This meant that they could plan around these spikes, and possibly work with the buyer to smooth them out. It also revealed that a single buyer accounted for 90% of the quantities ordered for this product. The whole activity of the line dedicated to this product depended on the presence of this individual and was unrelated to any fluctuation in the consumption of the product by end-users.

The Kaggle site offers a number of datasets on which to try various approaches. Of particular interest for sales forecasts is a 5-month list of online transactions at an e-commerce cosmetics retailer.

##### Online Sales of a Cosmetics Retailer

It contains a record for each action taken online by a customer, including placing items in a shopping cart and, sometimes, buying it. It can be used for many purposes. By looking only at the “purchase” records and aggregating the sales amounts by day, you get the following picture of the activity:

The chart shows clear seasonal peaks, preceding events to which they may be imputed with a lag presumably due to the retailer’s order fulfillment lead time. The data also shows apparent weekly cycles but not end-of-the-month rush.

##### Which Variations Matter to the Supplier

The supplier clearly needs to anticipate the seasonal peaks in production planning. If the retailer receives weekly deliveries, the weekly cycles don’t matter to the supplier; if daily deliveries, then these cycles do matter.

##### Need to Dig Further

However informative this visualization is, it calls for further mining of the transaction data to know, for example, which products are involved in the peaks and valleys, and in what quantities.

A time series plot can be a Rohrschach test. Different people will see different things in it based on their assumptions, prior knowledge, or bias. Visualizing the data is essential but what we see can make us overconfident in our sales forecasts.

#### Oscar Ceremony Viewership

TV viewers don’t spend money to view the Oscar ceremony. They are, in fact, the product that ABC sells to advertisers but forecasting their number is the closest the network can come to sales forecasts.  Back in 2019, Mark Graban, plotted the number of viewers of Oscar ceremonies from 1991 to 2019 on an XmR chart from SPC, and concluded that it showed a step-drop in 2006 but was steady since.

Ryan Casey, applying LOESS regression, concluded that it “started dropping around 2000, leveled off in 2005, then started dropping again around 2015.” He didn’t provide a chart.

Looking at the same data and enlarging the scale for the y-axis, all I saw was a linear declining trend with a few outliers,  most of whom could be explained by specifics of each year.

The key here is the backstory. The data are not measurements on a manufactured part that you are trying to keep on target. They are estimates on the entire US population based on a poll of 5,000 households. The tools of SPC are not applicable. Is there a plateau from 2005 to 2015? If you draw it you may see it but you might wonder what could explain this plateau and whether or not the more complex model isn’t just overfitting.

#### Cyclical Airliner Sales

In 2009, I plotted the sales of two competing airliners since 1967, the Boeing 737 and the Airbus A320. They appeared to show 3- to 5-year boom-and-bust cycles that I advised my client to anticipate going forward:

Revisiting the same series 10 years later, I was stunned by how mistaken my sales forecasts had been:

##### Extrapolation Mistake

It was a mistake in 2010 to assume that the 3 to 5-years cycles would continue. It is most likely a mistake in 2019 to assume that the linear growth of the past 9 years will keep going forever. The chart is insufficient as a basis for long-term capacity planning.

##### Checking the Backstory

If you consider the backstory of the data, you may wonder whether it makes sense to place data from 1967 to 2018 on the same chart. While the products have kept the same names and outer shapes, everything else has changed, from the engines to fuselage materials and cabin equipment. As a result, they fly more passengers farther on less fuel. Meanwhile, the world’s population has gone from 3.5 billion people in 1967 to 7.6 billion in 2018.

The cycles through 2010 are related to the general economic growth of countries where the buying aviation customers are located and the multi-year delivery lead times. Airlines overbuy planes based on their own optimistic sales forecasts. Three years later, as the expected growth has failed to materialize, they have a surplus of planes and stop buying. Another three years later, activity has picked up, the airlines are short of planes, and overbuy again,… Why has this pattern been broken?

I asked an industry insider, who said: “The business has changed on both the supplier and the customer sides. By improving manufacturing operations, the aircraft makers have reduced their order fulfillment lead times to 10 months, making it easier for airlines to forecast. In addition, the airlines have been applying data science to improve their short-term ability to fill planes with passengers and their sales forecasts.”

#### Fake Periodicity

The early cycles were real to manufacturers in the aircraft supply chain who saw the demand for their product double or drop by half within one year but it does not necessarily mean that the series had real cyclical dynamics. Observe customers entering a retail store on a busy Saturday afternoon in a shopping district. The activity inside the store appears to be an alternation of feast and famine.

There is a crowd, and then no one for an extended time, and then a crowd again. It looks periodic but it isn’t. If customers arrive independently of each other at a constant mean rate, the time between two consecutive customers follows an exponential distribution, with many short intervals creating the crowds, and a few long ones creating the idle periods between crowds.

If you look at the arrivals per unit time through the afternoon, their ups and downs will look like periodic variations, even though the process that generated the data has no periodic component. As a consequence, any periodic model you fit to a training dataset will have no predictive value on a testing set, as can be seen in the following plots of cumulative customer arrivals in two 100-minute periods.

### Periodic Variations with Multiple Frequencies

The inconsistencies of our calendars make these variations more complex than they look.

#### Variations in Period Length

Months vary in length; March is almost 11% longer than February. Neither months nor years are an integral number of weeks. And we need to distinguish between business days and calendar days. Consumers are active on all calendar days but many businesses aren’t, and the end-of-the-month rush, in particular, is a function of the number of business days remaining until the end of the month.

#### Variations in Holidays

All countries also have holidays, and they affect sales. Some of them occur at fixed dates, like Christmas or New Year’s Day; others vary. Mother’s Day, in the US, is the 2nd Sunday in May — which, between 2021 and 2023, ranges from May 9 to May 14. Other than blurring the picture a bit, the oscillations of the exact date of Mother’s Day won’t affect production plans in a major way. More problematic holidays that impact business include the Chinese New Year, which was on February 20 in 1985 and on January 22 in 2004, and Ramadan, which has moved from November in 2000 to April in 2021.

Multi-year cycles also differ from the other variations in that they are not driven by time but by the internal dynamics of the series. The summer peak in bathing suits and the Christmas peak in toys are based on time. The booms and busts of the economy are not on a fixed calendar and are driven by events in the economy itself, like the bursting of a speculative bubble in the stock market. Occurrence is predictable but timing is not, and it is qualitatively different from time-driven periodic variations.

## Holt-Winters

The Holt-Winters approach comes up frequently in discussions of sales forecasts, as models of seasonality.  It dates back to 1960 and is described as a refinement by Peter Winters of an exponential smoothing technique developed by Charles C. Holt.

At each time, conditionally on the past, it represents the future of the time series as the sum of a level, a trend, a seasonal component, and noise. The trend is linear and the seasonal component can be either additive or multiplicative. For example, if, in the two week before Christmas we sell about 40,000 more units, it’s additive;  if, “we sell 30% more,” it’s multiplicative.

### Seasonality in Holt-Winters

“Seasonality,” in Holt-Winters, is any periodic variation but there is only one in the model. Inside this period, there may be subperiods but the full period must be divisible into these subperiods. Since the days of a week don’t fall on the same calendar days in consecutive years, weeks within years don’t qualify. Months, obviously don’t either.

Generally, I prefer to reserve the term “seasonality” for variations in level of activity that are tied to the time of year, as are frequently present, not just in manufacturing but in business in general. This is simply because human life is organized around the seasons of the year. Air conditioners sell more in summer than in winter; prestige cosmetics around Mother’s Day and Christmas; toys around Christmas, etc.

### Noise and Residuals

The term “noise” also implies that the values are independent and identically distributed, with zero mean. I prefer to call them residuals, as it carries no such assumptions. It is simply what remains of your time series after you have taken out level, trend, and periodic variations, by subtraction in an additive model or division in a multiplicative one. These residuals may have a richer structure than just noise, and they are what you apply probability forecasting to.

### Exponential Smoothing

The Holt-Winters model is a refinement on exponential smoothing, where the estimate for time n+1, knowing \left (y_1,..., y_n\right ) is of the form

\hat{y}_{n+1|n} = \alpha y_n + \left ( 1-\alpha \right )\hat{y}_{n|n-1}

If \alpha = 1, it’s the naive sales forecasts \hat{y}_{n+1|n} = y_n and, if  \alpha = 0\hat{y}_{n+1|n} = \hat{y}_{n|n-1}, and the forecast never moves, which works when the series is of the form Constant + Noise. With  0 < \alpha <1 , it is somewhere between these two extremes.

The reason it’s called exponential is that, when you express the estimate in terms of the \left (y_1,..., y_n\right ), it becomes

\hat{y}_{n+1|n} = \sum_{j= 0}^{n-1}\alpha\left (1 - \alpha \right )^{j}y_{n-j} + (1- \alpha)^{n}\ell_0

where \ell_0 is an initial value with vanishing effect as n grows and what remains is an exponentially weighted average of the last n values of the series.

If we assume that the residuals \varepsilon_{n+1\mid n} = y_{n+1} -\hat{y}_{n+1\mid n} are independent and identically distributed with zero mean and a standard deviation of \sigma, and look ahead h steps instead of just 1, we find that

\hat{y}_{n+h\mid n}= E\left (\hat{y}_{n+h\mid n+h-1} \right ) = \hat{y}_{n+1\mid n}

and that the standard deviation of the residual \varepsilon_{n+h\mid n} = y_{n+h} -\hat{y}_{n+h\mid n} increases like  \sqrt{h}:

\sigma_{n+h\mid n} = \sqrt{\left (\alpha^2\times\left ( h-1 \right ) +1 \right )}\times\sigma

#### Example

It is simpler than autoregression (AR) because it has only one parameter,   \alpha, instead of a weight for each lag. On the other hand, it’s intended for a series with no trend and no periodicity. As Sales usually have trends and seasonal variations, we need to illustrate this concept with a simulation example, a series of 2,000 points with autocorrelation of range 5. We used with Hyndman’s forecast package in R to fit a model to the first 1980 points, and used the last 20 as a testing set. The following figure shows the testing set with the predicted value and the tail of the training set:

The dark purple area is supposed to hold 80% of the values, and the light purple area 95%. While the testing set data are all within the dark purple area, this one-parameter model does not do much as a predictor!

### Holt-Winters

Holt-Winters generalizes exponential smoothing to series that have both a trend and periodicity, which is more realistic for sales data.

The formula is simpler to explain when predicting just the next step. We have a series \left ( y_1,...,y_n \right ) and we are looking for a point estimate \hat{y}_{n+1|n}. Then we refine it to look ahead h > 1 steps — that is to estimate \hat{y}_{n+h|n}.

In Hyndman & Athanasopoulos’s notations, the 1-period lookahead estimation in the Holt-Winters additive method reduces to:

\hat{y}_{n+1|n} = \ell_n + b_n + s_{n+1-m}

where

• \hat{y}_{n+1|n} is the estimate of y_{n+1} knowing \left (y_1,..., y_n\right )
• \ell_n and b_n are the estimates of level and slope going forward from n. If we look ahead h periods, the non-seasonal component becomes \ell_n + h\times b_n
• m is the number of steps in the period. If the period is 1 year and you have daily data, m = 365m = 365
• s_{n+1-m} is the seasonal adjustment from m intervals before n+1.

The \ell_n, b_n, and  s_n estimated from prior values are as a weighted average of a value derived from the last observation y_n and one derived exclusively from the last estimates \ell_{n-1}, b_{n-1}, and s_{n-m} :

• \ell_n = \alpha\left ( y_n - s_{n-m} \right ) + \left (1 - \alpha\right )\left ( \ell_{n-1} + b_{n-1} \right )
• b_n = \beta\left ( \ell_n - \ell_{n-1}\right )+\left (1 - \beta\right )b_{n-1}
• s_n = \gamma\left ( y_n - \ell_{n-1} - b_{n-1} \right ) + \left (1 - \gamma\right )s_{n-m}

where

• \alpha,  \beta, and   \gamma are coefficients fitted to the \left (y_1,..., y_n\right )
• \left ( y_n - s_{n-m} \right ) is the seasonally adjusted observation.
• \left ( \ell_{n-1} + b_{n-1} \right ) is the non-seasonal forecast
• s_n is the seasonal adjustment.

Because it has three parameters, each used in some form of exponential smoothing, Holt-Winters is described as “triple exponential smoothing.”

When you look h \leqslant m steps into the future rather than just 1, the induction formula becomes

\hat{y}_{n+h \mid n}=\ell_{n}+h b_{n}+s_{n+h-m}

For example, if m = 365 days, as long as h \leqslant 365 , or at most a year ahead, the periodic term is s_{n+h-m}, from a exactly a year before n+h .

When h > m, you have to go back more than one period for the periodic term. Then you set k = \left \lfloor \frac{h-1}{m} \right \rfloor to get the total number of completed periods between n and n+h, and

\hat{y}_{n+h \mid n}=\ell_{n}+h b_{n}+s_{n+h-m(k+1)}

If h = 366 , it will be from 2 years before n+h , etc.

Unlike the models of periodic variations from waves, telecommunications, or control system, Holt-Winters is free of sines and cosines. At all times i The periodic component simply takes the same value at i, i+m, i+2m, ..., i+km,..., and the waveform doesn’t matter.

#### Example of Daily Cosmetics Sales

We tried this approach on the cosmetics online daily sales data from Kaggle visualized above, cutting off the training set on February 15, 2020, and using the data from February 15 to 29 as a testing set.

##### Using Hyndman’s ets Function

We applied Rob Hyndman’s ets function in the forecast package in R, first to the complete series, and then to a series where we trimmed off the large peaks and valleys that we could impute to external events like holidays.

As discussed above, the Holt-Winters method deals with one an only one periodic variation, and you have to specify its length. In this case, it is one week, and you can verify this by tallying the sales by week day.

The ets software can automatically choose whether the level, trend and seasonal components are additive terms or multiplicative factors. “(M,N,M)” means that the level and seasonal components are both multiplicative, and that it found no trend. The holiday peaks and valleys did not prevent the software from seeing the seasonal component. The main difference that is apparent after trimming off these peaks and valleys is more precise estimates.

##### The Shape of the Purple Zones

In the first chart, based on the complete Cosmetic Daily Sales dataset, the purple zones for the 80% and 95% confidence interval in the sales forecasts is clearly wider above the prediction black line than below. This is because the model is multiplicative rather than additive. In additive models, the residuals are commonly treated like centered gaussians and the boundaries of the purple zones are set by the estimate \pm 1.28\times\sigma and \pm 1.96\times\sigma.

Multiplicative models are usually built from additive models of the logs of their factors, with residuals that follow a lognormal distribution, which is skewed. This has two consequences:

1. The point  sales forecasts are of the median rather that mean, and it means that they are biased. If the log of the forecast on a given day is a gaussian  \operatorname{N}\left(\mu, \sigma^{2}\right), then the forecast itself has median  \operatorname{exp}\left(\mu\right) and mean \operatorname{exp}\left(\mu + \frac{\sigma^2}{2}\right). In other words, the mean of the exp is larger than the exp of the mean.
2. The purple zones are asymmetric.

Basing production plans on median sales forecasts means giving up on the few large orders that occasionally occur, which doesn’t sound like a good idea. This is only an issue, however, if the point forecast is the final product given to the planner. A planner who instead receives a probability sales forecast can use the distribution to find the expected loss associated with every production quantity and decide accordingly.

### Software for Holt-Winters

In R, you have the HoltWinters function in the stats package and the  ets function in the forecast package.

## Other approaches to Seasonality

Holt-Winters deals with one and only one periodic component in the series, and you have to tell it the period. To deal with multiple, known periods, you could add more components to the model and, for holidays that occur at varying date, they could be a function of the number of days to or from that date. This would, however, put you outside the range of the available Holt-Winters software.

It is a different problem when you are dealing with time series with multiple, unknown periodic components. As discussed above, multiple periodicities do occur in sales but they only affect manufacturing planning and scheduling if they are long enough: variations in sales volume within a day don’t but monthly variations do. Furthermore, the periodicity that matters is usually visible in a simple plot.

Sales, however, are neither the first nor the only kind of time series to be objects of analysis and forecasting. Historically, methods that analyzed time series for unknown periodic patterns expressed in sines or cosines came first, particularly in telecommunications. This analysis in the frequency domain dominated until the 1980s, when the focus shifted to working directly in the time domain. The Holt-Winters method is from 1960.

### Seasonal Variations in Mortality

Series of Events in Manufacturing digs into the way the US CDC models seasonal variations in mortality with the Farrington Surveillance Algorithm. Pre-pandemic, US mortality had a pointy peak in January and a wide valley in the summer months. It is otherwise not affected by the day of the week or of the month.

If \mu\left ( t \right ) is the death rate in week t, the model they fit to the data is of the form:

log\left [ \mu\left ( t \right ) \right ] = \alpha + \beta\times t + \gamma\times sin \left(\omega t + \phi \right )

The sin term in log\left [ \mu\left ( t \right ) \right ] turns into a factor of the form exp\left ( sin \right ) in \mu\left ( t \right ), which is suitably pointy. What’s remarkable about this model of seasonality is that it doesn’t require anything beyond High School math, just sinexp, and log.

### Periodic Variations and Spectra

The theory of time series started in telecommunications. In physics, periodic variations are oscillations and waves. They are viewed as the superposition of variations in multiple frequencies and amplitudes, a priori unknown.

Tukey & Blackman, back in 1959, bragged about the discovery of a low-frequency peak in ocean waves induced by “a swell 1 mm high and 1 km wide, 10,000 km away in the Indian Ocean.”

A human speech is a sound wave and therefore a time series at any point it crosses, and its power spectrum decomposes it into a range of frequencies that is a signature of the person’s voice and shows the bandwidth needed for a channel to transmit it:

Priestley, in 1981, still organized the whole theory of time series around spectral analysis but, by 2017, Shumway & Stoffer made it a subtopic for time series with a cyclical or periodic component, and analysis in the frequency domain, as opposed to the time domain. In 2019, writing about forecasting time series, Hyndman & Athanasopoulos ignored it entirely and wrote exclusively in the time domain.

Work in the frequency domain lives on in signal processing and control systems but not in sales forecasts, even in fitting models with periodic variations. This is possibly because it is a domain with smaller datasets and where possible cycles of interest are few and known.