Measurement Errors

Like spouses in murders, errors are always the prime suspect when measurements go awry. As soon Apollo 13 had a problem, a Mission Control engineer exclaimed, “It’s got to be the instrumentation!”

It wasn’t the instrumentation. In general, however, before searching for a root cause in your process, you want to rule out the instrumentation. For that, you need to make sure it always gives you accurate and precise data.

Industry Practice

Scientific measurements are an issue among scientists. In Manufacturing, on the other hand, the concern about measurement errors goes beyond internal operations. Manufactured goods have technical characteristics shared with customers and government agencies. This ranges from emissions in gasoline engines to power consumption in appliances or fat content in foods. As a consequence, industrial measurement practices are subject to regulations.

Measurement System Analysis (MSA) is the process capability analysis of measurement instruments or systems. It encompasses Calibration for accuracy and Gage Repeatability & Reproducibility (Gage R&R) for precision. Gage R&R is where the Gaussian distribution comes up.

You need both calibration and Gage R&R, but which one should you pursue first? Some of the literature recommends accuracy first but, if we follow Taguchi’s thinking, we should focus first on precision. If you have precision, you can aim, and, once you can aim, you can adjust the aim if off target.

Gage R&R

Repeatability and reproducibility are both components of the variability in measurements of the same variable taken on the same part with the same instruments by one or more appraisers. The difference is that repeatability is measured with the same appraiser, while reproducibility is with different ones. They are both precision in systems with different boundaries. “Gage,” in this context, is synonymous with instrument: it designates any device used to measure.

Focus on Human Variability

The emphasis on the effect of people is striking because the elimination of human involvement in measurement well underway. In measuring blood pressure 30 years ago, for example, a nurse attached the armband, pumped up and released the pressure manually, and read numbers off a dial. Today, the patient attaches the armband, presses a button on a small electronic device, and reads the numbers off a screen.

 

The nurse, and the variability in measurements across nurses, is gone. If the machines in your home and in the doctor’s office give different results, it’s due to the machine, not human factors.

In Manufacturing in 2024, the role of people in measurement is usually limited to loading workpieces into instruments and taking them off when done. The collection and storage of measurements should not involve any human intervention. You still find shop floors where operators write down measurements on paper travelers with pens, but it is an obsolete practice. There is no technical impediment to having instrument controllers collect, process, and store the measurements, eliminating the human as a source of variability. Is reproducibility in Gage R&R a truly current concern, or is it a legacy of a still recent past when these tasks could only be done manually?

Reproducibility in Gage R&R versus Science

Reproducibility in Gage R&R is narrower than in science, where it is the ability of different scientists to replicate an experiment based on its published description and get the same results as the authors. It means by using the same method with different materials and equipment in a different location.

Gage R&R and Gaussian Residuals

In Gage R&R, humans take measurements on the same instrument that you group by individual and workpiece. Then, you apply Analysis Of Variance (ANOVA) to compare variability within each group to variability between groups. If measurements are both repeatable and reproducible, then the variance of group averages will be low compared to the variance within each group, and vice versa.

If X_{i1},\dots, X_{ik} are the k measurements group i, then the R_{ij} = X_{ij} - \overline{X_i} for i=1,\dots,k are called residuals and, for ANOVA to work, the residuals are assumed to be Gaussian.

In his Practical Guide to Measurement System Analysis, Mark Allen Durivage equates MSA with Gage R&R, and, in his Figure 1.4, does not appear to consider the possibility that measurement errors could be anything but Gaussian:

Why should they be? To answer this question, we need to consider the nature of measurements.

Calibration

Calibration is usually done by using the instrument on a standard – that is, an object for which the value is known. But what might that be? The definitions of meters, kilograms, seconds, amps, etc., are now based on physical constants. Until recently, they were based on physical prototypes.

Through 2019, the kilogram, for example, was, by definition, the mass of one cylinder of platinum-iridium kept under controlled conditions in one location for the entire world. It is now defined in terms of the speed of light in vacuum, Planck’s constant, and an atomic transition frequency of Cesium 133.

These constants, based on current physics, are more stable than any chunk of metal, no matter how well protected under three layers of glass bells. You can’t, however, use formulas to calibrate your instruments.

You must first convert them into imperfect physical objects that you can weigh, like the cast-iron weights used by grocers to calibrate their scales. These weights change mass as they rust but are still good enough for carrots and onions.

With calibration based on a physical standard, you have to wonder about the calibration of the standard itself. In any case, calibration is about bias in the measurements, not about the distribution of the errors.

Sound Level Measurement

Let’s start with sound level, as an example that is slightly more complex than measuring a rod length with a tape or a caliper. For under $50, you can buy a handheld sound level meter to measure ambient noise. You press a button and you get a reading in decibels (dB).

Measurement Process

The rounded square on top is a microphone that picks up sound within a frequency range, converts it to voltage oscillations, and digitizes this analog signal. It then calculates its root mean square (RMS) over one of three user-selected time periods to get a power. This provides an intensity I in Watts/m^2. The meter finally converts this intensity into dB as X= 10\log_{10}(I), which it displays.

The steps to produce the decibel level are summarized below:

Errors in Intensity versus Decibels

The sound entering the microphone goes through multiple stages to arrive at a display in decibels, and each stage contributes to the final error. The maker of the above meter quotes a margin of error of \pm 2 dB.

This suggests that the error in dB is symmetric around the true value, consistent with a Gaussian model for the error in decibels. However, it doesn’t prove that the Gaussian is a valid model. You would need to test it on actual data.

The intensity I is a measure of noise level, but so rarely used that it doesn’t even have named unit. It’s inW/m^2. Instead, we talk about dB, where an increase by 10 dB represents an intensity that is 10 times higher.

I and X are both measurements of noise level and, obviously, cannot both be Gaussian. For example, if X is Gaussian, then I is lognormal. This example shows that it makes no sense to assume all measurements are Gaussian.

ANOVA, however, does not assume the measurements to be Gaussian, only their residuals within each group. We’ll see that it makes a difference.

Never Perfect Measurements

Measurements are never perfect. When measuring noise level, the relationship between the observation and the final number is indirect.  When applying a vernier caliper to a hole in a turned part, it is direct.

As discussed in the earlier post on Tolerances, mathematically, measurements are always rational approximations of real numbers. In the last post, we recalled John Herschel’s analysis of measurement errors in the positions of stars in the sky. He showed they were Gaussian under two conditions:

  1. The errors in the two coordinates had to be independent.
  2. They depended only on the distance from the true point.

While it is astonishing for this special case, it doesn’t mean that all measurement errors are Gaussian.

True Values

As pointed out in an earlier post, Deming had a surprising perspective on true values, asserting in his foreword to Walter Shewhart’s Statistical Method from the Viewpoint of Quality Control that there is “no true value of anything.” John R. Taylor put it differently on p.130 of his treatise on measurement errors:

“What is the ‘true value’ of a physical quantity? This question is a hard one that has no satisfactory, simple answer. Because no measurement can exactly determine the true value of any continuous variable (a length, a time, etc.), whether the true value of such a quantity exists is not even clear. Nevertheless, I will make the convenient assumption that every physical quantity does have a true value.”

By definition, a physical quantity can be measured as a number with a unit, and should have a true value that exists independently of our ability to measure it. The only problem is that these quantities are not features of reality but of our models of it, and models are human-generated abstractions subject to having parameters with no counterpart in reality. In such cases, the pursuit of accuracy or precision is a wild goose chase.

How can you tell?

There is no certain way to determine that a physical quantity exists but there are telltale signs that it doesn’t.

Consensus of Experts

The absence of a consensus among experts should raise suspicion. In situations that arise in engineering, there is not much discussion of the nature of distance, mass, temperature, voltage, pH, or time. Stephen Hawking may have issues with time, but engineers working on the ignition timing of a car engine don’t. Accountants, on the other hand, can’t agree on the unit cost of a manufactured good.

Definition in a Physical Model

Thermometers predated the theory of heat by 2,000 years if you include early devices to measure “hotness” in antiquity. When you measure temperature today, however, it is as a characteristic of a physical system and within the context of a theory of heat. The theory says this temperature exists regardless of your ability to measure it. The temperature is not defined as the result of a measurement procedure. Instead, your measurement, however taken, approximates the true value.

The COVID19 pandemic provided an example of a characteristic first defined within a model, the reproduction number called R_0 (pronounced “r null”). Take a given disease and a given population where no one is infected, and introduce one infected person into it, R_0 is the expected number of other population members that this person will infect while contagious.

This number, obviously difficult to estimate, soon made its way into news media and political speech as if it were as straightforward as temperature. For COVID-19 in China, the US CDC gave a 95% confidence interval ranging from 3.8 to 8.9, while the Chinese government placed it between 2 and 3. But the existence of R_0 was never in question.

Procedurals

Quantities that are defined by procedures differ from physical quantities in that. if you apply the procedure, the result is exact by definition, and there is no physical reality to compare it to.

Aggregate Gross Income (AGI) is defined by the algorithm the tax authorities use. It is real in that income taxes are based on it, but it is not an estimate of any kind of true value in economics that would exist independently of our ability to measure it.

Flawed math

The trickiest cases are for measurements that we intuitively think of as obviously existing physical quantities. The length of a coastline is a case in point. We think of it as well-defined until we try to measure it on maps of increasing scale. If the length of a coast exists, the measurements should converge.

Effect of Map Scale

As more detailed features emerge in maps as their scale increases, however, the lengths grow to infinity. Finally, if instead of looking at maps, you stand on a beach and observe the surf, you don’t see any boundary line with a length you could measure:

What is tricking us here is a mistaken assumption.  We set bounds around a path and think all the paths within these bounds will have lengths close to it.

The Mistake

Assume a mountain path from A to B. By hugging the boundaries in turns, you can follow a shorter path. As you narrow the road, the length of this shorter path converges toward the original path. It doesn’t work this way for longer paths. Even on the narrowest of roads, you can have a path as long as you want near the original by zig-zagging:

This says that the length of a path is lower semi-continuous rather than continuous. A line on a small-scale map expands to a winding ribbon on the next larger-scale map. Within this ribbon, you can draw a path of arbitrary length. And this path again turns into a ribbon on the next larger-scale map…

The length of a coastline, as a physical quantity, does not exist. Wikipedia, however, still provides a ranking of countries by length of coastline. The problem is similar with measuring an inland border. To Portugal the length of its border with Spain is 986 km. To Spain, the same border is 1214 km, 23% longer.

Scaling

You can assess the distribution of measurement errors on a prototype, but it leaves open the question of scaling. The measurement errors with a vernier caliper do not depend on the lengths you measure. In other cases, they vary with the measured value.

Are Measurement Errors Gaussian?

In his Introduction to Error Analysis, the closest John R. Taylor comes to justifying the use of the Gaussian distribution for all measurement errors is the following:

“If the measurements are subject to many small random errors but negligible systematic error, their limiting distribution will be the normal, or Gauss, distribution.” (p. 153)

Let’s unpack this statement and check it out. His “systematic error” is the inaccuracy, or bias, of the measurement – that is, the expected value of the error, eliminated by calibration. His “many small random errors” are imprecision, and their expected value is 0. The point of qualifying them as “small” is to justify using the Gaussian law of error propagation introduced in the discussion of Tolerance Stacking.

Contributions to Measurement Errors

The sources of fluctuations in the value of a measurement may include the following:

  • The environment – that is, the ambient temperature, humidity, noise or vibration level,…
  • The precision of the sensor itself.
  • The interaction between the sensor and the object of the measurement.
  • The transmission of the sensor reading to the data acquisition system.
  • The digitalization of the sensor reading.
  • The transformation of the reading into a data point displayed or stored.

Model of Small Measurement Errors

It makes the measurement Y a function Y = f\left (\mathbf{X} \right ) of all its k influences \mathbf{X} = \left ( X_1, \dots, X_k \right ). If the true value is the unknown \mu_Y, and the measurement error is \Delta Y.

Y =\mu_Y + \Delta Y with \Delta Y having 0 mean, and if f has a gradient at \mathbf{ \mu_X} = \left ( \mu_1,\dots,\mu_k\right ), and \Delta\mathbf{X} = \mathbf{X}- \mu_{\mathbf{X}} = \left (\Delta X_1, \dots,\Delta X_k \right ), then

\Delta Y \approx \text{grad}\left [ f\left (\mathbf{ \mu_X} \right ) \right ]\cdot \Delta\mathbf{X} = \frac{\partial f }{\partial x_1}\times \Delta X_1 +\dots+ \frac{\partial f }{\partial x_k}\times \Delta X_k

 

If the errors are small enough for this approximation to be usable, it portrays the measurement error as the sum of many terms of the form \frac{\partial f }{\partial x_i}\times \Delta X_i, that are many independent errors \Delta X_i, weighted by their contributions \frac{\partial f}{\partial x_i} to the gradient of the measurement.

This approximates \Delta Y with a linear function of \Delta\mathbf{X}, which, unless f is linear, varies with \mu_{\mathbf{X}}.

We know that its expected value E\left ( \Delta Y \right ) = 0. Assuming that the \Delta X_i have standard deviations \sigma_i, as discussed about Tolerance Stacking, we know that the standard deviation \sigma_{\Delta Y} of \Delta Y is

\sigma_{\Delta Y}\ =\sqrt{\left (\frac{\partial f}{\partial x_1} \right )^2\times \sigma_1^2 +\dots+ \left (\frac{\partial f}{\partial x_k} \right )^2\times \sigma_k^2}

 

Does this make \Delta Y a Gaussian variable? We’ll examine two imperfect arguments.

Argument Based on the Central Limit Theorem

These terms in the approximation of \Delta Y are independent but not identically distributed, which precludes application of the classical Central Limit Theorem (CLT). The conditions have been weakened in the 20th century, which makes it reasonable to assume the Gaussian model for a finite sum of many small, independent terms with different distributions, with a few caveats.

This is a rationale for making this assumption, not a proof of its validity. You can, however, test it with actual measurements. The looser condition applies to the variances of terms in infinite sequences, and any finite sequence can be extended with i.i.d. variables into a sequence that meets it, but it says nothing about how close the finite sum comes to a Gaussian.

How Many Terms in the Sum?

If one of the \frac{\partial f }{\partial x_i}\times \Delta X_i dwarfs all others, then, for practical purposes, \Delta Y inherits the distribution of X_i, which may not be Gaussian. More commonly, out of the dozens of variables that may influence the result, only a handful actually do.

Convergence Speed with Classical versus Extended CLT

With the classical CLT, we know that convergence can be fast. For example, we can see what happens with a uniform distribution:

We can visualize the rapid convergence to a Gaussian by plotting the densities of sums of 2, 4, 8, and 12 by plotting them:

In fact, a classical method for simulating a Gaussian is to add just 12 uniformly distributed distributed variables from a random number generator like the RAND() function in Excel.

With the extended CLT, we can’t say in general. We have a handful of terms, perhaps independent but not identically distributed, and with unequal contributions to the measurement error. In this general case, convergence to a Gaussian takes more than a handful of terms

The Information Argument

As the CLT argument doesn’t quite cut it, we can look to information theory for another justification. In a way that we’ll clarify below, it says that the Gaussian distribution is the “most random” distribution for a continuous variable with a given mean \mu and standard deviation \sigma. Therefore, this argument is that, if all we know about the distribution of a measurement error is its \mu and \sigma, it makes sense to use a Gaussian model.

Shannon’s Entropy

Until the late 1940s, we had no metric of randomness. Then Claude Shannon borrowed the term “entropy” from physics for a quantity that we can use as such. If you take a binary variable X that is true with probability p and false with probability q=1-p, Shannon’s entropy is

H(X) = -p\times log(p) - q\times log(q)

 

If p=1 or q=1, the outcome is deterministic, there is no randomness, and H(X) = 0. On the other hand, if p=q= \frac{1}{2}, we think of it as the most random X can be, and H(X) = log(2) is maximized. If you take the log in base 2, then H(X) = 1, an amount of entropy that Shannon proposed calling a “bit.” For other values of p and q, 0 < H(X) < 1 can be used as a metric of randomness.

The definition extends easily to variables that can take k values with probabilities p_i, i=1,\dots, k:

H\left ( X \right ) = -p_1 \log \left ( p_1 \right ) -\dots - p_k \log \left ( p_k \right )

 

And not so easily to a continuous, real variable with p.d.f. f as

H\left ( X \right ) = -E\left (\log \left [f\left ( X \right ) \right ] \right ) = -\int_{-\infty }^{+\infty}\log \left [f\left ( x \right ) \right ]f\left ( x \right ) dx

 

It works for the Gaussian distribution, and it can be shown to maximize entropy, given \mu and \sigma.

Application to Gage R&R?

Can we apply this to Gage R&R residuals? By construction, \mu = 0, but we don’t know \sigma. In fact, the whole point of using ANOVA is to assess the difference between the \sigma within and between groups from estimates, and the groups are small samples.

If we estimate \sigma from the sample and then declare we know it, we can decide to apply the Gaussian model to any continuous variable. If there are 10 points within a group, the estimate of \sigma will be too loose a basis for any model; if 30,000 points, it will be a close approximation of the true value, but then you have plenty of data to fit a distribution, knowing much more than just the standard deviation.

As G.L. Bretthorst put it, “this is a rather ad hoc thing to do and has no justification whatsoever.” On the other hand, what does make sense, as Bretthorst shows, is using the Gaussian as a prior distribution to be refined based on data with Bayesian methods.

So the maximum entropy argument doesn’t quite work either.

When the Errors are Large

It is also essential that the error should be small. The COVID19 pandemic provided a counter-example with a reproduction number called R_0. Take a given disease and a given population where no one is infected, and introduce one infected person into it, R_0 is the expected number of other population members that this person will infect while contagious.

This number, obviously difficult to estimate, soon made its way into news media and political speech as if it were as straightforward as temperature. For China, the US CDC gave a 95% confidence interval ranging from 3.8 to 8.9. The small-error assumption is not warranted in this case and our basis for assuming Gaussian errors is out the window.

Conclusions

Because they must document their experiments so that peers around the world can replicate them, scientists have their professional reputations at stake in the accuracy and precision of their instruments. They understand the physical principles used in the measurements, take the time to study their performance, and describe the specifics in the methodology sections of their papers.

Compliance with Standards

The technicians who perform Gage R&R studies in Manufacturing are in a different world. They trained on standards they are expected to follow, and have nothing to gain by second-guessing them.

The literature on Gage R&R offers one-size-fits-all methods, based on standard deviations, as discussed here, or on ranges. It is oblivious to the backstory of the measurements. Whether they are lengths, pressure, or pH, they are just numbers to which you apply the same formulas.

Small Samples

The forum moderator of The Elsmar Cove made the following comment:

“Technically, the residuals of the ANOVA analysis should be normally distributed. However, with a typical sample size of 90 (3 x 3 x 10), the normality assumption is no longer important.”

Presumably, he is talking about groups of 3 appraisers measuring 3 parts, 10 times each, which a small sample. For his Honest Gage R&R Study, Wheeler considered even smaller groups:, 2 to 4 measurements for each appraiser and part.

A Gage R&R study is a designed experiment, and involves collecting data specifically for its purpose. It is based on small data sets, for which distribution assumptions are needed to decide whether the differences you observe between groups are significant at any level. It is with the large data sets you encounter in data mining that the distributions matter less, as any difference you can observe between groups of 30,000 points ends up having a minuscule p-value. Practically, you don’t need to worry about their significance. 

One Size Fits All?

A single method applicable to all measurement errors regardless of the variable may be desirable but it is not clear that it is technically feasible. In particular, I have not been able to find a compelling justification for assuming that all measurement residuals are Gaussian. This assumption is expedient, as it allows you to apply an established body of math but the primary purpose of Gage R&R is not expediency but the objective assessment of a measurement system’s precision.

References

#measuredvariable, #measurementerror, #measurement, #gager&r, #an0va