Series Of Events In Manufacturing

Factories are controlled environments, designed to put out consistent products in volumes according to a plan. Controls, however, are never perfect, and managers respond to series of events of both internal and external origin.

An event is an instantaneous state change, with a timestamp but no duration. An operation on a manufacturing shop floor is not an event but its start and its completion are. Machine failures and quality problem reports from customers are all events. Customer orders and truck arrivals are also events.

For rare events, you measure times between occurrences and mark each occurrence on a timeline. For frequent events, you count occurrences per unit of time and plot these numbers over time. On individual machines, you record the times between failures. For incoming online orders for a product, you count how many you receive each hour or each day. Every 200,000 years, compass needles reverse directions. That is a rare event, recorded in basalt oozing from the mid-Atlantic ridge as it cools down, as on a magnetic tape. (See Paleomagnetism.)

Series of events (rare)
Earth Magnetic Field Reversals (Chmee2)

Once a year, an American TV viewer tuning in to the Academy Awards is a frequent event. For this, we care how many millions do it, but not how much time elapses between two tune-ins.

Series of events (frequent)

Responding to series of events is central to many businesses. Stores respond to customers coming in, airlines to passengers with reservations showing up to fly — or not, maintenance crews to machine failures, social networks to subscriber activities,… The challenges posed by the randomness of the arrival and service processes has given rise to queuing theory and commonly used results like Little’s Law about lead times, throughput and inventory in steady-state, or Kingman’s Rule about the way lead times explode when systems saturate.

We are concerned here with a different and simpler topic: monitoring series of events to detect changes in their rates of occurrence and tell fluctuations apart from shifts with assignable causes. If the arrival rate of quality problem reports from the field suddenly doubles, a sophisticated analysis is not needed. If it increases by 10%, the conclusion is not so obvious.

The Old SPC Approach

While still taught, the SPC approach to this problem is not massively used in industry and is not used today by organizations that actually monitor series of events, like the U.S. Centers for Disease Control (CDC). They are mentioned here because they routinely come up in discussions of the subject in manufacturing but are truly only of interest as historical background.

In his system of Control Charts from the 1920s, Walter Shewhart included three about attribute characteristics that are counts of events:

  • The p-chart of fraction defective by lot.
  • The c-chart of defect counts by lot.
  • The u-chart of defect counts in lots of varying sizes.

They are described in Chapter 5 of Douglas Montgomery’s Introduction to Statistical Quality Control, pp. 147-200.

A slightly more modern approach, from the early 1940s, is the XmR chart, also known as Process Behavior Chart, and is heavily promoted by Don Wheeler.

The CDC’s Excess Deaths Model

But enough about the old stuff… Let’s see what we can learn from modern data science as practiced by CDC epidemiologists in determining, for example, whether weekly death counts for the entire U.S. or for individual states show evidence of a surge in mortality and, if so, in quantifying this surge.

The stakes in their work are higher than in manufacturing. They are counting losses of human lives and the decisions based on their analyses more consequential. If they make the wrong call, they can be accused of causing harm by overreacting or by failing to react.

To meet this challenge, they need the best tools data science has to offer; to meet the easier challenges of manufacturing, it’s worth taking a look at what they are doing.

Visualizations

It is best to view the CDC’s results on the CDC website for two reasons:

  1. They generate the charts with Tableau, and the details about a bar pop up when you hover on it.
Series of events visualization
Pop-up box in CDC Chart
  1. You can narrow the view from the US to selected territories.
Series of events visualization
Restricting the territory

The Approach

Based on explanations on the CDC website, they use a two-step approach:

  1. Determine whether there is evidence of excess deaths in a given week by comparing the Observed Deaths in that period with a Threshold for the period.
  2. If there is evidence of excess deaths, then estimate it from Excess Deaths = Observed Deaths – Expected Deaths

They use a model inferred from historical data to set estimate the Expected Deaths for each week and a distribution for the fluctuations around the Expected Deaths, which they use to set the Threshold arbitrarily at the 95% percentile of this distribution. The Threshold only serves to determine whether there have been Excess Deaths, not how many.

The goals and the strategy are the same as for SPC but the data sets are larger, the methods used for data cleaning are different, and the models are more sophisticated.

The Data

The CDC keeps records of Observed Deaths in the National Vital Statistics System database. Downloadable data sets include Weekly Counts of Deaths by State and Select Causes. Anyone can download weekly counts going back to 2017. The CDC itself uses data going back to 2013. The raw data since 2013 consists of ~20 million deaths, occurring in a population on the order of 330 million. The SPC charts were about handfuls of defectives found in lots of, say, 500 units.

According to the CDC, some states report deaths daily and others weekly, which makes a week the shortest time bucket for which the CDC can aggregate nationwide death counts. In addition, the time of death is rarely known exactly. The legal time of death, recorded in death certificates, may be off by hours from the actual time, making hourly and even daily tallies meaningless. On the other hand, it’s unlikely to be off by a week.

The CDC’s Model

To set Thresholds and estimate Expected Deaths by week and area, the CDC uses the Farrington Surveillance Algorithms. Paddy Farrington is a British epidemiologist who has been working on such algorithms since the 1990s. The CDC references versions developed in the wake of the SARS and Swine Flu outbreaks of the 2000s and applied in England and Wales.

The CDC also says they use the ‘surveillance’ package in R to implement the Farrington algorithms. This is remarkable for two reasons:

  1. The CDC data, the R software, apps to facilitate its use like RStudio, and packages like ‘surveillance’ are all free for anyone to download. Of course, you need to know R and it’s not a trivial skill to learn but there is no technical barrier to replicating the analysis done by the CDC. It is all open.
  2. The ‘surveillance’ package includes documents called vignettes explaining the models it implements and the way to use it. They contain explanations of the Farrington algorithms.

Expected Deaths in the Absence of an Outbreak

The model for the expected death rate μ(t) in week t in the absence of an outbreak is the product of three factors:

  • A baseline that is independent of time
  • An exponential factor corresponding to a steady trend over time.
  • A periodic oscillation to account for higher mortality in the winter than in the summer.

You work with log[μ(t)] rather than μ(t) to have an additive rather than a multiplicative model. log[μ(t)] = α + βt + γ×sin[ω (t + ϕ)] is the model they fit to historical data since 2013. Of course, there were a few outbreaks of other diseases between 2013 and 2019, like the one in January 2018 but their modeling tool is able to separate those from the regular pattern.

The sine is a simple model of periodic variation and familiar to anyone who learned trig in High School. The CDC’s plot of Weekly deaths does not suggest any other period than 12 months but shows January peaks more pointy and summer valleys flatter than a sine. Remember, however, that sin is a term in the formula for log[μ(t)]. Therefore exp(sin) is the corresponding factor in μ(t) and it is suitably pointy. The following chart shows this pattern:

Series of events seasonality
Periodic pattern with pointy peaks

This begs the question of why weekly death counts should have pointy peaks in winter, and whether it is a general pattern or only for the U.S. In any case, it is clearly present in the CDC data.

Manufacturing events often have different seasonal patterns. Product demand is often seasonal but equipment performance may also be affected by temperature and humidity. Other periodic effects appear as the effect of tool wear as a function of use counts. In this case, rather than time, you plot defect occurrences against production activity, as Shewhart did in his control charts.

Outbreaks in the Data

The CDC model includes a toggle between regular and outbreak times that switches to a higher death rate during outbreaks. This toggle is called a hidden Markov process. For details, see the Getting Started vignette of the ‘surveillance’ package.

In the SPC model, you had a constant rate estimated from a sample, and you manually excluded outliers from this sample to prevent values due to assignable causes from influencing the control limits against which you check for such values.

Thresholding

The Threshold is intended to be an upper bound of the fluctuations around the Expected Deaths that are compatible with the absence of an outbreak. For this purpose, they model the death count D(t) of week t as a Poisson variable with mean μ(t).

Why Poisson? Assume you know μ(t). Strictly speaking, the number of people D(t) in a population of size N who die in Week t follows a binomial distribution with a probability μ(t)/N of “success.” With this ratio small and N large, it approximates nicely to the simpler Poisson distribution with mean μ(t), with standard deviation \sigma\left ( t \right ) = \sqrt{\mu\left ( t \right )}.

For weekly death counts D(t) between 50,000 and 60,000, the mean μ(t) is also within that range and the standard deviation σ(t) is therefore between 224 and 245. In this kind of range, the distribution of the death counts is approximately Gaussian (or “Normal”) and its 95% percentile is at μ(t) + 1.645×σ(t).

If you look at the CDC tables, however, you find that the difference between Threshold and Expected Deaths is larger than the Poisson formula predicts. For the week of January 13, 2018, for example, based on the CDC numbers,

Threshold – Expected Deaths = 61,881 – 59,690 = 2,191

where the formula would lead you to expect this difference to be 402.

Why is the Threshold so much higher? The reason is that we don’t know μ(t). All we have is an estimate \hat{\mu}\left ( t \right )based on the algorithm that fits log[μ(t)] = α + βt + γ×sin [ω (t + ϕ)] to the observed weekly death counts. Knowing the estimate, the true value of μ(t) is somewhere around it.

The extra variability around μ(t) is called overdispersion, and they build it into the model through a dispersion factor θ, so that the standard deviation of D(t) is

\sigma\left ( t \right ) = \sqrt{\theta\times\mu\left ( t \right )}

 

The CDC references this as quasi-Poisson regression.

Approximating the CDC Model

How close can we get to the CDC’s results with simpler methods that we apply to series of events in manufacturing? Quite close. The method described below comes within 2% of the CDC’s estimates for both Expected Deaths and Thresholds, which is close enough to match the CDC’s diagnosis of whether weekly Death Counts reveal Excess Deaths in a Testing Set of recent weeks.

The tools are documented in Nina Zumel and John Mount’s Practical Data Science with R.

The Data

The CDC lets you download weekly death counts in the US from the beginning of 2017, or 4 years less than they use internally. Still, we can take more than 3 years of weekly data from January 2017 to March 2020 as a Training Set to fit a model and check it against the Testing Set from March to August 2020.

Seasonality

For the seasonal factor, we already know visualizing the data that the period is one year. Therefore, with times in weeks, \omega = 2\pi/52. Furthermore, since, every year, the maximum is on Week 2, we can rewrite the periodic component as
\cos\left (2\pi \left ( t-2 \right ) /52\right ). Then, all we need to estimate are the linear coefficients α, β,, and γ.

Cleaning the Data

Some of the weekly death counts in the Training Set are exceptionally high due to outbreaks of diseases like flu. We want to filter them out and model just the regular mortality, as observed in the absence of special boosters. Rather than manually plucking out the exceptional values as in Shewhart’s days, we detect the high-influence data points automatically.

You fit a model to the complete training set, redo it by removing one point from the training set and assess how much difference this point makes to the resulting estimates. This is called the Cook’s distance for this point and it’s not a distance in the usual sense of the word. You calculate it for all the points in the training set and eliminate the ones with the highest Cook’s distances.

The method used for model fitting is the Generalized Linear Model (GLM), which transforms the data with the function you specify, fits the transformed data to a linear model, and expresses the model in terms of the original data. For the CDC weekly death counts, we used the above formula and told the glm function in R’s ‘stats’ package to fit a quasi-Poisson model, then we used the influence.measures function from the same package to identify the high-influence points.

It flagged the roughly the same periods that the CDC pointed out as having had Excess Deaths in January 2017 and January 2018:

Week Ending DateUS Death Count
01/21/201759,444
01/28/201758,186
02/04/201758,537
01/06/201866,317
01/13/201867,664
01/20/201864,820
01/27/201862,922

Fitting the Model

The next step was to fit a model to the Training Set with the high-influence points removed. The result of running the GLM to fit the exponential trend and the periodic factors simultaneously, however, was that only the periodic factor was retained. It accounts for much more of the variability than the trend and crushed it.

If, on the other hand, we first fit the trend to the data, we get a 0.3% growth rate. Then we fit the periodic model to the residuals, combine the two models, and estimate the dispersion factor θ as the variance of D\left ( t \right )/\sqrt{\hat{\mu}\left ( t \right )}. As measured on the training set with the common estimator of standard deviation,

\sqrt{\hat{\theta}}= Standard\, Deviation\left (\frac{D\left ( t \right )}{\sqrt{\hat{\mu}\left ( t \right )}} \right ) = 5.99

Our estimate of Expected Deaths for week t is then  \hat{\mu}\left ( t \right ), and

Threshold = \hat{\mu}\left ( t \right ) + 1.645\times \sqrt{\hat{\theta}\times\hat{\mu}\left ( t \right )}

Comparing with the CDC Model

The justification of the model is based not on theory but on predictive ability with a Testing Set that is disjoint from the Training Set. The Testing Set contains 23 weeks starting with Week 10, or early March 2020. Comparing this Alternate model with the CDC‘s on the Testing Set, we find that it is within 2% for both the Expected Deaths and the Thresholds. The following chart shows how close the two models are and, in particular, how they flag the same weeks as having Excess Deaths:

Series of events models
Comparison of CDC and alternate model

Conclusions

In 2020, the analysis of series of events with the data science tools demonstrated here is no more difficult for a manufacturing professional today than a process capability study was for Walter Shewhart in 1924. The tools are more powerful and work on larger data sets.

Data on series of events in Manufacturing are not as sensitive as Death Counts, and often do not require the superior expertise of the CDC epidemiologists. It would be impossible in any case for factories to recruit them. Fortunately, methods that are similar to theirs but easier to use can produce useful results in this context.

Free software is available to do all the calculations. You need to understand what it does but not how it does it. It’s like a driver’s understanding of a car versus a mechanic’s.

Further Reading

#seriesofevents, #timeseries, #queueing, #alarmgeneration, #equipmentfailure