True And False Alarms in Quality Control

The SPC literature does not consider what happens when an organization successfully uses its tools. It stabilizes unstable processes so that disruption from assignable causes becomes increasingly rare. While this happens, the false alarms from the common causes remain at the same frequency, and the ratio of true to false alarms drops to a level that destroys the credibility of the alarms.

This is a signal that further quality improvement can only be pursued with other tools, typically the conversion to one-piece flow to accelerate the detection of problems and, once human error becomes the dominant cause of defects, error-proofing. This article digs into the details of how this happens with control charts. 

SPC and Statistical Testing

In his latest article in Quality Digest, Don Wheeler celebrates the 100th anniversary of Shewhart’s control charts. He describes the use of these charts for process control as follows:

“A baseline period is used to compute limits that define what a predictable process is likely to produce, and then these limits are used with additional data as they become available. Every time we add a point to our chart, we perform an act of analysis; the chart asks, ‘Has a change occurred?’”

So, first you do a process capability study to set control limits, and then you issue an alarm, true or false, every time a new point crosses one of these limits.

Never mind that factories where production operators routinely update control charts in this fashion are so hard to find that I have never seen one; this concept still commands attention as part of quality control training courses.

SPC experts, Wheeler in particular, deny that this is a null hypothesis significance test (NHST), but it walks and quacks like one. In the vocabulary of orthodox statistics, control limits for measured variables are calculated under the null hypothesis that they are independent, identically distributed, and Gaussian. Control limit crossings support the rejection of this hypothesis at a significance level of 0.27%.

SPC is an Approach to Process Control

SPC stands for “Statistical Process Control.” Process control generally refers to programming machines, applying automatic feedback as needed to keep parameters from drifting, and feedforward to let later steps compensate when possible for deviations in earlier steps. This is meant to ensure output consistency.

When SPC emerged, machines could only be programmed by punch tapes like Jacquard looms. Feedback control was governor valves on steam engines, and feedforward was chefs tweaking recipes to match the day’s vegetables. The idea of SPC was to apply statistics to control processes by measuring finished workpieces manually and responding promptly to disruptions.

It was about what just happened, not what happened a day or a week ago. In other words, it was about controlling, not auditing, the processes. Unlike closed-loop feedback control, it does not automatically adjust machine settings like an aircraft autopilot. Instead, it’s about stabilizing the process by manually eliminating assignable causes of variation.

From Significance Levels to p-Values

In the statistics literature, the p-value generalizes the significance levels in NHST. The p-value of a test is the probability that a test statistic exceed a threshold if the null hypothesis is true. In the context of control charts, it is the probability of a false alarm.

A few decades ago, you had to look up limits for a given p-value in printed statistical tables, and such tables were available only for a few distributions and a few p-values, arbitrarily chosen and called “levels of significance.” It led statisticians to focus on a few, arbitrary values like 5% or 1%, for which you could look up limits in statistical tables for a few distributions. It is no longer a challenge. Today, it is easily calculated for many distributions and statistics.

In social sciences, p = 0,05 and p=0.01 are widely used. For control charts, Shewhart chose p=0.0027, corresponding to \pm 3\sigma limits, “because this is about the magnitude customarily used in engineering practice.”

The p-value is well-defined, and software readily provides it. Instead of setting a level of significance upfront, you calculate your statistic and its p-value, and it tells you the level of significance at which you can reject the null hypothesis. Trouble starts with where you go from there:

Randall Munroe’s cartoon about p-values

The p-values are controversial among data scientists and commonly misused, particularly for p-hacking. The most common mistake is to take a high p-value as a reason to accept the null hypothesis. Just because a test or multiple tests failed to reject the null hypothesis does not prove it’s true. Conversely, a low p-value does not tell you what to replace the null hypothesis with.

What Are You Trying To Do?

When social scientists run statistical tests, they usually aim to refute the null hypothesis. For example, the null hypothesis may be that there is no difference in outcomes between two methods for teaching kids to read. If you fail to find a “significant” difference, you don’t have anything to publish, and that’s why p-hacking is common.

Within a manufacturing process, on the other hand, you do not want to conclude that it has gone out of statistical control unless you have to. That’s why you use a p-value of 0.27%, 18.5 times lower than the 5% commonly used in social sciences.

Control Limits And Tolerance Intervals

As discussed in an earlier post, tolerances are a one-way filter. If a workpiece is out of spec on any characteristic, we know it’s defective; if it is within all specs, on the other hand, we don’t know that the product will work as expected.

The control limits of a control chart are supposed to be within the spec limits, and their crossing by a workpiece does not mean it is defective. Instead, it signals a shift that may eventually make workpieces go out of spec if not nipped in the bud. If a value is within control limits, then it is also within spec limits, but it does not allow you to conclude that the parts you measured are actually good.

Process Tweaking And Assignable Causes

The response to an out-of-control-limits signal does not involve scrapping parts or stopping production, and is not supposed to be limited to tweaking machine settings.

When the materials used in a process are mined from the earth, they have variations due to geology; when they are agricultural crops, variations due to the weather. To produce consistent results, these processes often require homogenizing batches, analyzing samples, and adjusting process settings based on the results. I have seen control charts used for this, but it wasn’t their intent.

Instead, points that cross control limits are supposed to trigger the search for an assignable cause through problem-solving. If a team finds a cause and eliminates it promptly enough, then it denies this cause the opportunity to ever cause defects.

Disaster prevention is less visible than disaster recovery, and those who do it are vulnerable to accusations of overreacting. If you miss a true alarm, it may cause workpieces to go out of spec sometime in the future; if you issue a false alarm, you send a team on a wild goose chase right now.

Issuing True Alarms

As discussed in a previous post, what matters is the ratio of true to false alarms. If it is too low, the whole practice of issuing these alarms loses credibility with the organization it is supposed to serve.

Shewhart was aware of the problem. In his 1931 book, Economic Control of Quality of Manufactured Product, on pp. 347-348, he discusses this issue in terms of the probability P that a statistic on a sample of data falls between two limits:

“We must try to strike a balance between the advantages to be gained by increasing the value P through reduction in the cost of looking for trouble when it does not exist and the disadvantages occasioned by overlooking troubles that do exist.”

What I couldn’t find in his works, however, is the notion that thresholds for issuing alarms should be a function of existing quality. In a process that is never disrupted by assignable causes \pm 3\sigma limits will produce on the average one alarm for every 370 samples, all false; in one that is routinely disrupted, \pm 2\sigma limits may produce mostly true alarms.

The Ratio of True to False Alarms

Let’s consider an example. You conduct a process capability study and set control limits that account for common cause variability. Assignable causes in unknown numbers lurk in the environment, waiting for an opportunity to disrupt the process.

The Example

The following assumptions are realistic for many manufacturing processes in 2024:

  • A sensing and data acquisition system attached to your operation automatically collects one value for every completed workpiece.
  • The measurements are accurate and precise, meaning that any bias and variability in taking measurements are negligible compared with those due to the process.

The production volume is stable at 1,000 units/day, generating 1,000 data points/day, that you plot on an X-chart, among other charts. Every time there is a point outside the control limits, the production team, with engineering support, analyzes the root cause to find and remove an assignable cause. Most of the time, they find one; on occasion, they don’t, and conclude that the alarm was false. As Wheeler puts it in his latest Quality Digest article:

“When a process is being operated unpredictably, there will be signals of process changes. In this case, having one or two false alarms per hundred points is of no real consequence because the number of signals will usually greatly outnumber the number of false alarms.”

Fast forward a year, and 300,000 units later. These efforts have culled the assignable causes that problem-solving has focused on, but not the common causes. How fast does this happen? Using the logic of learning or experience curves, let us assume that true alarm generation drops by 30% every time the cumulative volume doubles. For the cumulative production count to go from 1,000 units at the end of day 1 to 300,000 after 1 year takes 8.23 doublings, bringing the rate of new alarm generation down to 0.7^{8.23} = 5% of what it was on day 1.

Evolution of the Ratio of True to False Alarms

If 1 piece in 10 generated a true alarm on Day 1, it works out to 100 true alarms/day. One year later, we are down to 5 true alarms/day. Meanwhile, the number of false alarms, generated by common causes, has not changed. It started out at a mean of 2.7 false alarms/day and it is still at this level. In other words, the ratio of true to false alarms has plummeted from 37 to 1.85.

We can use a simulation to visualize how true alarms rarefy while false alarms don’t:

Dealing with Low True to False Alarm Ratios

This happens without any change to the control procedure. It is entirely due to the condition of the process. Fewer alarms, of course, means less problem-solving work, but the proportion of it that is wasted is higher.

In the beginning, you may not even notice that some alarms are false, because there are so many assignable causes around that you are likely to bump into one when investigating an alarm that is actually false. It’s no longer the case with a true to false alarm ratio under 2.

Wheeler doesn’t see a problem, “because the generic, three-sigma limits will naturally result in a very low false alarm rate.” If you keep diligently investigating all alarms, it will often be a waste of time; if you don’t, there is no value in plotting the chart.

Options on Moving Forward

At this point, you have several options:

  1. Use the chart for process auditing rather than control. You take the chart offline, and feed it with a handful or measurements daily to focus on long-term trends.
  2. Enhance the process capability by focusing on common causes of variability. As we know, just staying within spec limits does not guarantee the quality of the product. If we adopt the Taguchi model that the quality characteristic has a target value, with losses proportional to the square of the distance to target, then it pays to improve the process and tighten the control limits around the target. This is engineering involving experimentation, not routine quality control, and it involves other tools.
  3. Switch to other methods of in-process quality control. Lack of process capability is not the only cause of defects. When, after an operation, you drop parts into a heap in a wire basket and do not pick any up until 500 have accumulated, you delay the detection of defectives. In addition, as you lose track of the process sequence, you make it more difficult to diagnose when the process went awry.

General Formulas

The above discussion was based on an example. We can formalize this in a more general way. The ROC curve for the test is one input, but the sensitivity and specificity are conditional probabilities:

\text{Sensitivity} = P\left ( \text{Alarm}\,\vert\, \text{Problem} \right )

 

and

\text{Specificity} = 1- P\left ( \text{Alarm}\,\vert\, \text{No Problem} \right )

 

For an unconditional probability of true and false alarms, we need to factor in the probability of a problem:

P\left ( \text{True Alarm} \right ) = P\left ( \text{Alarm}\,\vert\, \text{Problem} \right )\times P\left ( \text{Problem} \right )

 

and

P\left ( \text{False Alarm} \right ) = P\left ( \text{Alarm}\,\vert\, \text{No Problem} \right )\times P\left ( \text{No Problem} \right )

 

If problems almost never happen, then P\left ( \text{No Problem} \right ) is high, which boosts the probability of false alarms. Conversely, when problems frequently occur, then P\left ( \text{Problem} \right ) is high, which boosts the probability of true alarms.

Occurrence Rate of Assignable Causes

But how do we know the probability of a problem, namely a disruption of the process from an assignable cause? The process capability study focused on the variability from common causes, not assignable causes.

Once the process capability study has let you set control limits, you start monitoring limit crossings for evidence of assignable causes. With a new process, assuming there is always a problem is prudent. It means pretending that P\left ( \text{Problem} \right ) = 100\% and focusing exclusively on the sensitivity of the test.

True to False Alarms Ratio: A Lower Bound

A process that has been running for months or years has a history of quality problems reports from downstream operations or the field. These point out disruptions large enough to cause defects that others detected and reported, while we are looking to respond to disruptions large enough to be noticed but too small to cause defects. The relative frequency of reported defectives is therefore a lower bound on the probability of a problem. Let d = \text{Proportion of Defectives}, then

P\left ( \text{True Alarm} \right ) \geq P\left ( \text{Alarm}\,\vert\, \text{Problem} \right )\times d

 

and

P\left ( \text{False Alarm} \right ) \leq P\left ( \text{Alarm}\,\vert\, \text{No Problem} \right )\times\left ( 1-d\right )

 

In the example used in the ROC for the Gaussian, the ratio R of true to false alarms for an upward shift by 2\sigma with a threshold set at +3\sigma satisfies the following inequality:

R \geq \frac{16\%}{0.135\%} \times \frac{d}{1-d} = 119 \times \frac{d}{1-d}

 

For a high defective rate like d= 10%, then R \geq 13, meaning that there are at least 13 times more true than false alarms. With d= 3%, R \geq 3.7, or about 4 times more true than false alarms. Go down to d= 100\,\text{ppm}, however, and the inequality only gives you R \geq 1.2%, which only guarantees you less than 83 false alarms for every true alarm.

This is only about testing for a 2\sigma shift with a 3\sigma limit, and it only gives you a lower bound for the ratio of true to false alarms. There is no way to define the sensitivity of a test for any shift in any direction. We can only calculate it for an upward or downward shift of a given magnitude. It starts at 0 for no shift and tends to 1 for increasing shifts.

Conclusion

The relevance of an in-process quality control method depends on the condition of the process. Methods that work to stabilize a new process lose their effectiveness if they achieve this goal, and are replaced by others with a different focus.

References

#falsealarm, #truealarm, #spc, #controlchart