Jul 18 2022

The Most Basic Problem in Quality

Two groups of parts are supposed to be identical in quality: they have the same item number and are made to the same specs, at different times in the same production lines, at the same time in different lines, or by different suppliers.

One group may be larger than the other, and both may contain defectives. Is the difference in fraction defectives between the two groups a fluctuation or does it have a cause you need to investigate? It’s as basic a question as it gets, but it’s a real problem, with solutions that aren’t quite as obvious as one might expect. We review several methods that have evolved over the years with information technology.

A/B Testing

The latest incarnation of this problem is not about manufacturing parts but about interface designs for web services. A/B testing evaluates users’ response to web page variant A versus variant B. If the question is whether or not users press a Submit key, then the fraction of users that do is mathematically identical to the fraction of purchased components that fail an incoming quality test. You can summarize the results in a 2×2 contingency table:

	Candidate Supplier (A)	Existing Supplier (B)	Total
Pass	500	995	1,495
Fail	0	5	5
Total	500	1,000	1,500

The main solutions described in the literature are Fisher’s Exact Test (1935) and Barnard’s Exact Test (1945). In 2022, both methods look old. What is most surprising, however, is that this basic problem was not solved earlier.

Fisher’s Exact Test

Ronald Fisher published a method for testing the difference in performance between two groups in 1935. It is simple and easy to understand but, except for small samples, computationally intractable at the time.

We view each unit as having a probability $p$ of failing and all units independent. It makes the number fails in a group a binomial variable, entirely determined by the group size and $p$ , with $p$ estimated by the relative frequency $\hat{p}= \frac{Number\,Failed}{Group\,Size}$ .

The Exact Question

We tend to think of the problem as one of comparing the relative frequencies between the two groups but, in fact, $p$ is not what we are after. We are trying to establish whether the data tell us that one group has better quality than the other, not by how much. It makes $p$ a nuisance parameter, and the beauty of Fisher’s Exact Test is that it dodges it.

The null hypothesis $H_0$ is that the two groups have the same $p$ . In this case, we can pool them and treat each group as the result of sampling without replacement from that homogeneous pool.

The number of fails in each group then follows a hypergeometric distribution, and we can test the compatibility of the actual results with that distribution. It involves no approximation and no estimation of $p$ . It is based exclusively on the data and makes no use of any backstory of quality that the analyst might be aware of.

Details

If the two groups are $A$ and $B$ , $\left ( r_A, s_A \right )$ and $\left ( r_B, s_B \right )$ respectively the numbers of fails and passes in each group, and $n=r_A+s_A+r_B+s_B$ is the total number of units, then the contingency table is as follows:

	A	B	Total
Pass	$s_A$	$s_B$	$s_A+s_B$
Fail	$r_A$	$r_B$	$r_A+r_B$
Total	$s_A+r_A$	$s_B+r_B$	$n$

Given $H_0$ , the probability of experiencing $r_A$ fails in group $A$ is

h\left ( r_A, s_A, r_B, s_B \right )=\frac{\left(\begin{array}{c} r_A+s_A \\ r_A \end{array}\right)\left(\begin{array}{c} r_B+s_B \\ r_B \end{array}\right)}{\left(\begin{array}{c} n \\ r_A+r_B \end{array}\right)}

If we want to know whether Group A has a lower fail rate than Group B, the cumulative distribution function of $r_A$ gives us the p-value — that is, the probability that Group A is at least this good under $H_0$ . In classical statistical testing, you reject $H_0$ if the p-value is below a given threshold.

Computational Challenges

With the information technology of the 1920s and 30s, Fisher could apply his test only to tiny problems like the lady tasting tea, where the object was to determine whether Fisher’s Rothamsted colleague Muriel Bristol could recognize in a blind test whether tea had been poured over milk or milk over tea. She could. The counts in the contingency table were in single digits, and human computers could evaluate the formula. It wasn’t the case for samples with thousands or even hundreds of points.

Results

Now, we can pass the above contingency table to an R function called fisher.test that instantly tells us that the p-value is 13%, meaning that a perfect sample of 500 from a candidate supplier is not sufficient to establish that it performs better through our incoming quality checks than a sample of 1,000 with a 0.5% fail rate at incoming inspection from our existing supplier.

If we instead receive a 1,000-unit perfect sample from the candidate, the p-value goes down to 3.1%, which some traditional statisticians may deem significant, as it is below 5%.

In general, however, we have received for production many more pieces from the existing supplier than as an evaluation sample from the candidate. Let’s say that we still receive a 1,000-unit perfect sample from the candidate, and have received 100,000 units from the existing supplier over the past 3 months, with 0.5% failing our incoming tests or 500 fails. It gives the following contingency table:

	Candidate Supplier (A)	Existing Supplier (B)	Total
Pass	1,000	99,500	100,500
Fail	0	500	500
Total	1,000	100,000	101,000

For this table, fisher.test then gives us a p-value of 0.68%. It doesn’t mean that our incoming quality checks effectively detect defectives or that the candidate supplier can ramp up to tens of thousands of units per month at the same level of quality, but it’s a basis for further investigations.

Barnard’s Exact Test

While Fisher’s Exact Test is the most commonly used, Barnard’s Exact Test also has followers. It requires even more computation than Fisher’s and was impossible to use on anything but small problems when Alfred Barnard developed it in 1945. Times have changed but I found Barnard’s test to still take too long with large numbers.

Barnard’s Criticism of Fisher’s Test

In our example, we compare a perfect sample with one that has defectives. Then, $r_A = 0$ . In the general case of $r_A = k \gt 0$ , and the p-value is the probability that $r_A \leq k$ , or

\begin{array}{llll} P\left ( r_A\leq k \right )= & h\left ( 0, s_A+k, r_B+k, s_B-k \right )+\\ & h\left ( 1, s_A+k-1, r_B+k-1, s_B-k+1 \right )+ \dots+\\ & h\left ( k, s_A, r_B, s_B \right )\\ \end{array}

It’s a consequence of the model setup, with its constant columns and row totals in the contingency table. As a result, fewer fails from supplier A means more from supplier B, which Barnard saw as an artifact of Fisher’s Test rather than a feature of the actual problem.

Barnard’s Alternative

Instead of bypassing the nuisance parameter $p$ altogether, Barnard’s Test goes through a range of values of $p$ and chooses the one that maximizes the p-value. Words, whatever the value of $p$ is, the corresponding p-value of the test will be lower than its maximum over the $\left [ 0,1 \right ]$ range for $p$ .Given that I have found the test to be unusable on large samples, I put the details of the algorithm in an Appendix for the curious.

Results

The Barnard package in R is one of many tools available to run this test. While, on my laptop, R produces instant answers for the Fisher test, it’s not the case for the Barnard test, and it doesn’t seem to scale well.

For the case of:

	Candidate Supplier (A)	Existing Supplier (B)	Total
Pass	1,000	995	1995
Fail	0	5	5
Total	1,000	1,000	2,000

it takes about 15 minutes to return a p-value of $1.3%$ , whereas the Fisher test gave 3.1%.

To know what answers the Barnard test provides on a realistic contingency table I tried to run it on the following:

	Candidate Supplier (A)	Existing Supplier (B)	Total
Pass	1,000	99,500	100,500
Fail	0	500	500
Total	1,000	100,000	101,000

On my computer, it ran overnight without finishing, which says that it’s not usable for big data. The input is only four numbers, but it does an enormous number of operations on them. For the same numbers on the same computer, fisher.test gave the answer instantly.

Tests Based On Approximations

The Fisher and Barnard tests are exact in that they apply directly to the model. We assume that the number of fails in a group follows a binomial distribution and the tests are based directly on this distribution. However, if any part fails with probability $p$ , we know that, as the group size $n$ rises, the distribution of the relative frequencies $\hat{p} = \frac{p}{n}$ becomes indistinguishable from a Gaussian with mean $p$ and standard deviation $\sqrt{\frac{p\left ( 1-p \right )}{n}}$ .

This approximation is usable with as few as 30 points and has been the basis for tests that were practically usable 100 years ago. In SPC, p-charts are supposed to answer the question for a sequence of samples of equal size coming out of one production operation or line, with control limits based on the Gaussian approximation, but you cannot use p-charts to compare the quality of a large volume of parts received from a supplier with a small sample from an alternative source. We’ll review two methods that were usable pre-computers and answer the question in some, if not all cases.

Binomial Probability Paper

This geometric method is a clever trick from the paper-and-pencil age that gives you an answer by plotting a few points and lines on special graph paper. A technician can learn it, and it takes next to no calculation. “Computers,” up to the 1950s, were people with slide rules, mechanical calculators, and in some culture abacuses, who looked up values in books of tables and penciled in results on paper spreadsheets. We met them in Hidden Figures , which featured in particular the role of Katherine Johnson in the transition to electronic computing at NASA.

Human computing took time and was error-prone, and the pioneers of statistical quality worked to reduce the complexity of routine calculations. Range charts, for example, became part of SPC because sample ranges are easier to calculate on a shop floor than sample standard deviations. While we admire their achievements, they have lost their relevance, as we don’t live under the same constraints.

Background

Mosteller & Tukey came up with Binomial Probability Paper in 1946 and published it in 1949. I learned it in a Japanese textbook on quality control in 1981 and used it on occasion but never met anyone else who did. At the time, you could buy Binomial Probability Paper from a US stationery store. Today, Amazon-Japan stills sells it as Nikou Kakuritsu Kami (二項確率紙) and even sells books from the 1950s on how to use it, as quality training manuals.

Ratios To Angles

The idea is to plot the number $d$ of fails as a function of the number $s$ passes on a chart with square root axes , so that the length from the origin $\left ( 0,0 \right )$ to the point $\left ( d,s \right )$ is the total $n = d + s$ is the total number in the lot. The idea of the Binomial Probability Paper is to focus on the angle $\phi$ of this vector with the x-axis:

Example 1

The following example applies this method to our first contingency table above, with a perfect sample of 500 pieces from candidate supplier A and a sample of 1,000 pieces with 5 fails from our existing supplier B. It has the y-axis truncated because there is no point above 100. The split line in black is for the pooled output of the two suppliers and shows 5 failed units for 1,495 that pass.

The corresponding failure rate, ~0.33% can be read at the intersection of this line with the quarter circle. The red lines delimiting the ±3σ interval around the split line are parallel to it because the radial standard deviations are independent of $n$ . The points supplier A and B are well within this interval, leading to the same conclusion as the exact tests: we cannot reject $H_0]$ .

Example of Analysis Using Binomial Probability Paper

Example 2

What happens if we have a perfect sample of 1,000 units from the candidate supplier instead? The split line is slightly flatter, and the candidate supplier makes it to the 3σ limit.

With a larger sample, the candidate supplier makes it!

Example 3

With 100,000 pieces from the existing supplier and 500 failing incoming tests and 1000 perfect pieces from the candidate, we have a difference in sample sizes that becomes difficult to show on Binomial Probability Paper. The only issue that really matters, however, is the distance between the split line and the candidate’s point, which crosses 5 at 1,500. In the following, the existing supplier is off the chart but the candidate is well below the red line:

1,000 perfect pieces versus 100,000 with 500 fails

Details

Ronald Fisher had observed that, if failures follow a binomial distribution with $d$ as the number of fails and $n$ as the total number of units received, then the angle $\phi$ defined by $sin^2\phi = \frac{d}{n}$ . $\phi$ is known as the arcsine transform of the ratio $\frac{d}{n}$ .

If quality fails are binomial with probability $p$ and $\hat{p}$ is the relative frequency of failures in a sample of size $n$ , then $\hat{\phi} = arcsin\left [ \sqrt{\hat{p}} \right ]$ lends itself to the following approximation:

\hat{\phi}= \phi + \frac{\mathrm{d} }{\mathrm{d} p}arcsin\left [ \sqrt{p} \right ]\times \left ( \hat{p}-p \right ) +o\left (\hat{p}-p \right )= \phi + \frac{\hat{p}-p}{2\sqrt{p\left ( 1-p \right )}} +o\left (\hat{p}-p \right )

It is therefore approximately Gaussian, with mean $E\left ( \hat{\phi} \right ) \approx \phi$ and variance

Var\left ( \hat{\phi} \right ) \approx \frac{Var\left ( \hat{p} -p \right )}{4p\left ( 1-p \right )} = \frac{1}{4n}

Its standard deviation $\sigma \approx \frac{1}{2\sqrt{n}}$ . Therefore, at the end of a vector of length $\sqrt{n}$ the standard deviation of the radial fluctuations will be $\frac{1}{2}$ regardless of $n$ .

This is what makes the Binomial Probability Paper possible.

Post-Mortem

In the paper-and-pencil age, this was by far the faster, easiest way to do this kind of work. It involves no calculation. You didn’t even need a slide rule or a book of tables. JUSE, in Japan, made it part of Quality Control training but the ASQ never did, even though it still teaches older techniques.

Juran’s Quality Control Handbook does not cover it, and neither does Montgomery’s Introduction to Statistical Quality Control. Mosteller & Tukey’s paper came out in 1949; the first stored-program electronic computer ran in 1948. Perhaps they were just too late for the technique to get adopted in the US.

In the US and Europe, however, the arcsine transform had a life in other applications. According to David I. Warton & Francis K. C. Hui (2011),

“the arcsine square root transformation has long been standard procedure when analyzing proportional data in ecology, with applications in data sets containing binomial and non-binomial response variables.”

They think it’s “asinine” and recommend logistic regression instead, but logistic regression estimates the probability $p$ that a unit is good from a set of predictors that can be measurements or attributes. It can help you set quality acceptance procedures for deliveries from suppliers, but it’s not about deciding whether one group is different from another.

Z-Test

Like the Binomial Probability Paper, the Z-test compares the relative frequencies $\hat{p}_A$ and $\hat{p}_B$ of failures in groups A and B but through a formula rather than a chart:

Z=\frac{\hat{p_{A}}-\hat{p_{B}}}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_{A}}+\frac{1}{n_{B}}\right)}}

Details

This assumes the group sizes $n_{A}$ and $n_{B}$ large enough for the Gaussian approximation of $\hat{p_A}$ and $\hat{p_B}$ to be usable. Then $\hat{p_{A}}-\hat{p_{B}}$ is also Gaussian, with 0 mean and variance $Var\left ( \hat{p_A} \right )+Var\left ( \hat{p_{B}} \right )$ .

Per $H_0$ both groups have the same probability of failure $p$ , and $E\left ( \hat{p_{A}} \right )= E\left ( \hat{p_{B}} \right )= p$ . Therefore

Var\left ( \hat{p_A}-\hat{p_B} \right )= \frac{p(1-p)}{n_A} + \frac{p(1-p)}{n_B} = p(1-p)\times\left [ \frac{1}{n_A} + \frac{1}{n_B}\right ]

Over the entire population, the relative frequency of failure is

\hat{p} = \frac{\hat{p_A}\times n_A + \hat{p_B}\times n_B}{n_A +n_B}

If we estimate $p$ by the relative frequency $\hat{p}$ , we see that $Z$ is approximately Gaussian with 0 mean and unit variance and tells us how many $\sigma$ s apart the two groups are, under $H_0$ .

Results

The cumulative distribution function of the Gaussian at $Z$ then gives us one-sided p-values for the different contingency tables:

Supplier	A	B	A	B	A	B
Pass	500	995	1000	995	1000	99500
Fail	0	5	0	5	0	500
p-value	5.66%		1.26%		1.25%

Conclusions

On three different contingency tables, the different tests all produce different p-values but they all lead to the same decisions:

A perfect sample of 500 units is insufficient evidence that a candidate supplier is better than the existing supplier with 0.5% of fails at incoming tests.
A perfect sample of 1,000 is sufficient evidence to look into the matter further.
A larger sample size for the existing supplier does not change the conclusion.

For this problem as for many others, the quality statisticians of the 1920s through the 1940s developed a number of techniques based on approximations to work around the limitations of their era’s information technology. While we can respect and admire their work, we no longer need to use it.

References

Barnard, G.A. (1945) A New Test for 2 × 2 Tables. Nature 156, 177. https://doi.org/10.1038/156177a0
Barnard, G.A. (1947) Significance tests for 2×2 tables. Biometrika, 34:123-138.
Barnard, G. A. (1990). Must clinical trials be large? The interpretation of p-values and the combination of test results. Statistics in Medicine, 9(6), 601–614. doi:10.1002/sim.4780090606
Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd.
Mehta, C. & Senchaudhuri, P. (2003). Conditional versus Unconditional Exact Tests for Comparing Two Binomials, Cytel Software Corporation. https://www.nbi.dk/~petersen/Teaching/Stat2009/Barnard_ExactTest_TwoBinomials.pdf
Mosteller, F. & Tukey, J. (1949) The Uses and Usefulness of Binomial Probability Paper, Journal of the American Statistical Association 44, pp. 174–212
Nakazato,H, & Takeda, T. (1959) 二項確率紙の使い方: 付・正規確率紙の使い方. (1959). JUSE (How to use binomial probability paper, with how to use normal probability paper).
Warton, D.I. and Hui, F.K.C. (2011), The arcsine is asinine: the analysis of proportions in ecology. Ecology, 92: 3-10. https://doi.org/10.1890/10-0340.1

Appendix: Details of Barnard’s Exact Test

When we look at the algorithmic details of Barnard’s test, it becomes clear why it’s only practically usable with small data, even with the computing power available in 2022.

The Binomial Density

Given $p$ , under $H_0$ , the joint binomial probability of having $r_A$ fails out of $n_A = r_A + s_A$ units from Supplier A and $r_B$ fails out of $n_B = r_B+ s_B$ units from Supplier B is

b\left ( r_A, s_A, r_B, s_B |p \right )= \binom{n_A}{r_A}\binom{n_B}{r_B}p^{r_A+r_B}\left ( 1-p \right )^{s_A+s_B}

A 2-Dimensional p-value?

In Fisher’s test, the constraint that $r_A + r_B$ is constant made the p-value equal to the cumulative distribution function of $r_A$ . Removing this constraint makes it more complex because we must consider all the contingency tables representing at least as great a separation between the two suppliers as the one we have observed. For brevity, let us introduce the following notation:

\textbf{X}= \begin{pmatrix} s_A & s_B \\ r_A& r_B \\ \end{pmatrix}

Using it, we write

$b\left (\textbf{X} |p \right ) = b\left ( r_A, s_A, r_B, s_B |p \right )$ and use $\textbf{X}_0$ for the values in the example:

\textbf{X}_0 = \left ( \begin{array}{rrrr} 99500 & 1000\\ 500 & 0\\ \end{array}\right )

Then the Wald statistic $T\left ( \textbf{X} \right)$ compares the relative frequencies $\hat{p}_A = \frac{r_A}{n_A}$ and $\hat{p}_B= \frac{r_B}{n_B}$ of failures in groups A and B:

T\left ( \textbf{X} \right)=\frac{\hat{p_{B}}-\hat{p_{A}}}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_{B}}+\frac{1}{n_{A}}\right)}}

For large $n_A$ and $n_B$ , $T\left ( \textbf{X} \right)$ is approximately Gaussian with 0 mean and unit variance, and can be interpreted as a number of $\sigma$ s separating the two groups under $H_0$ . For $\textbf{X}_0$ ,

T\left ( \textbf{X}_0 \right)= 2.24

and the p-value of $\textbf{X}_0$ , given $p$ is

v\left ( \textbf{X}_0 | p\right)= \sum_{T\left ( \textbf{X} \right )\geq 2.24}^{} b\left (\textbf{X} |p \right )

For group sizes in the thousands, this can be a sum over millions of terms, most of them negligible.

Removing the Condition on $p$

We still need to get rid of the nuisance parameter $p$ . Under $H_0$ , all the units in groups A and B have an equal, unknown probability $p$ of failing at incoming inspections or tests. Barnard’s approach is to find the maximum of $v\left ( \textbf{X}_0 | p\right)$ for $p$ in $\left [ 0,1 \right ]$ This maximum value is not the exact p-value but an upper bound.

#abtesting, #incomingquality, #fisherstest, #barnardstest, #binomialprobabilitypaper, #ztest

By Michel Baudin • Data science • 0 • Tags: A/B testing, Barnard's Test, Binomial Probability Paper, Fisher's Test, Incoming QA, Z-test

The Most Basic Problem in Quality