The spirited discussions on LinkedIn about whether probabilities are relative frequencies or quantifications of beliefs are guaranteed to baffle practitioners. They come up in threads about manufacturing quality, supply-chain management, and public health, and do not generate much light. Their participants trade barbs without much civility, and without actually exchanging on substance.

The latest one, by Alexander von Felbert, is among the more thoughtful, and therefore unlikely to inspire rants. I do, however, fault it with using words like “aleatory” or “epistemic” that I don’t think are helpful. I am trying to discuss it here in everyday language, and to apply the concepts to numerically specific cases, with an eye to operations.

While there are genuinely great and not-so-great ideas, the root of the most violent disagreements is elsewhere, with individuals generalizing from different experience bases. You may map probability to reality differently depending on whether you are developing drugs in the pharmaceutical industry, enhancing yield in a semiconductor process, or driving down dppms in auto parts. The math doesn’t care as long as you follow its rules, and it doesn’t invalidate other interpretations.

Manufacturing professionals solve problems by using their prior knowledge, direct observation of the actual situation, and available data. These debates are about which tools to use with data, and, rather than reasoning philosophically, the analyst is better off asking John Seddon’s three questions:

Who invented this tool?

What problem was he or she trying to solve?

Do we have this problem?

Education In Math, Science, And Tech Ignores History

The way math, science, and technology are taught makes these questions more difficult to answer than they should be. You learn the current body of knowledge, not its backstory. Most of what follows I learned recently, in reaction to clashing dogmas in my news feed.

The Technology Context Matters

Once you know the inventors, you need to consider every facet of the context they worked in, not only whose shoulders they were standing on. In manufacturing, for example, Ken Alder reports that in 18th-century metal working, instructions for heat treatment in forging were to make the iron “cherry red,” which covers a range. Without appropriate thermometers, they had to work without temperature data. Forging quality suffered but sophisticated analysis tools would have been moot.

100 years ago, Walter Shewhart had better instruments but was limited to working with paper spreadsheets, slide rules, and distribution tables… When we have the same problems today, we can leverage the technology developed since, but some authors have not noticed. Mary McShane-Vaughn’s Probability Handbook (2016), ends with 42 pages of distribution tables, that solve a problem we no longer have. Today, the most common software packages provide built-in functions for these distributions.

Relative Frequencies

The view of probabilities as relative frequencies was used and expanded by several thinkers over 200 years. Jakob Bernouilli, Pierre-Simon Laplace, and Richard von Mises were three of them, each a century apart, looking all in the same direction.

Jakob Bernouilli First Used The Term “Probability”

300 years ago, Jakob Bernouilli was the first to use the term “probabilitas” in an analysis of games of chance. In his Ars Conjectandi (“The art of guessing”), from 1713, he uses it to denote the relative frequency of outcomes.

Laplace Formalized Bernouilli’s Definition

100 years later, in 1812, Laplace chimed in with the following definition, cited by E.T. Jaynes:

“The Probability of an event is the ratio of the number of cases favorable to it to the number of cases possible when nothing leads us to expect that any one of those cases should occur more than any other, which renders them, for us, equally possible.”

Von Mises Expanded Laplace’s Definition

Another century later, in Probability, Statistics, and Truth (1928), Richard von Mises introduced another wrinkle, the notion that the probability is the limit of relative frequencies when the number of observations increases, within a population, which he calls a “collective.” He adds the two caveats that such a limit must exist and that it must be independent of the sequencing of the observations.

The latter is a way of saying that they must be independent and identically distributed random variables without using the concepts of random variables, independence, and distributions. He couldn’t use these concepts in defining probability because he needs the concept of probability to define them. Von Mises’s full definition is as follows:

“The limiting value of the relative frequency of a given attribute, assumed to be independent of any place selection, will be called ‘the probability of that attribute within the given collective.'”

What about today?

It’s now 2022, yet another century later, and we are calculating probabilities in ways that are not covered by von Mises’s definition, as pointed out in an earlier post. Both the math of probability and its applications to reality have moved on, in ways I will elaborate on below. If you don’t know the proportion of red and white beads in an urn, keep pulling out one bead with replacement, and recording its color, the proportion of red beads will tend to a limit that satisfies the mathematical definition of a probability, with the probability spaces having just two elements, “red” and “white.” Such limits of relative frequencies are probabilities but not all probabilities are such limits.

Flows versus Single Events

When you are making auto parts at a takt time of 10 seconds, it makes sense to view fluctuations in critical dimensions as a sequence of independent draws, and growing as long as you keep producing. There are, however, many cases where it is not an option.

Analysis of Spatial Data

My first brush with the application of probabilities to a real problem was the estimation of an orebody with nonferrous metals, to support the decision of whether to mine it, as a student of Georges Matheron. Starting from values measured in a grid of boreholes, the job entailed modeling the grades of the various metals as a random function of location within the orebody, for which there was only one draw. There was no relative frequency to go by.

The “heavenly croupier,” geology, folded the rock layers and trapped the ore just once. By no stretch of the imagination could any connection be made between the probability of a copper grade being between 5% and 10% within a given cube underground and a relative frequency observed in repeated, independent draws.

Yet probability theory could be used, with inferences based on the mutual influence between measurements taken across different distances. Obviously, it was a different concept of probability from von Mises’s.

This is about whether to mine the orebody, and we can only know the accuracy of the estimate if we decide to mine it. If the estimate is too low, you don’t mine and, as a result, can never know the accuracy of the estimate. Regardless of the method you use, you can’t validate your estimate unless you actually mine.

Never or Rarely Observed Events

Acceptance Sampling In The Age Of Low PPM Defectives discusses the assignment of a probability to an event you have never observed, the probability that a supplier with a perfect record will next ship a defective unit. The relative frequency of the event in the history of shipments from the supplier is 0. Yet you know that the next one will be defective with a probability >0. Furthermore, it is not the same if the perfect run was for 1,000 or 1,000,000 units.

In some cases, the data set is too small and heterogeneous to provide meaningful relative frequencies. There have been four disasters in nuclear power plants in the past 65 years, in three different countries with varying levels of openness, and with four different reactor designs. While no one wishes for a richer data set, the existing one provides no way to calculate relative frequencies. Probabilistic Risk Assessment in such cases certainly entails assigning probabilities by a logical analysis of incomplete information.

Beliefs

Von Mises was aware that he was giving probability a meaning it doesn’t have in everyday life. His contemporary Frank Ramsey wanted instead to formalize the everyday life meaning into a foundation for the theory.

Meaning of Probability in Everyday Life

The German word for probability is “Wahrscheinlichkeit,” literally “degree of appearance of truth, ” which sounds like Stephen Colbert’s truthiness. We describe events as probable when we think they will happen but aren’t certain they will. If we don’t have to respond, it makes no difference.

On the other hand, if we must act, we must decide whether to assume the events will happen, as in packing an umbrella because the weather report announces rain. Then probability becomes the degree to which an event is probable.

Von Mises’s View

Von Mises quotes several statements along these lines:

“The probable is something that lies between truth and error” Thomasius (1688)

and

“Probability, in the subjective sense, is a degree of certainty which is based on strong or even overwhelming reasons for making an assumption.[…] In the objective sense, the probable is that which is supported by a number of objective arguments, […] There are differing degrees of probability and these depend upon the kind and number of reasons or facts upon which the assertion or conclusion of probability is based.” Robert Eisler’s Dictionary of Philosophic Concepts (1910)

Dictionary definitions attempt to capture this notion but, as von Mises points out, they are often useless and circular, and it is just as true today as it was 100 years ago. His own does not have these flaws but does not cover the entire field. He describes a way to calculate or approximate a number but not what this number means.

Frank Ramsey and Probability As Extension Of Logic

Frank Ramsey undertook to formalize the everyday usage of probability into an extension of logic, in which, instead of being true or false, propositions have a probability between 0 and 1. Different individuals may assign different values, and these probabilities are therefore subjective, which von Mises was trying to avoid.

There are cases, however, where subjectivity doesn’t matter. If, for example, you are responsible for repairing a broken machine that is holding up a production line, you have to choose between courses of action, from shutdown-and-restart to replacement of a subsystem.

Your assessment of the probability of success of each is the only one that matters because you are accountable for the outcome. Objectivity matters, on the other hand, when you measure the position of a star with a given telescope. The measurement errors follow a probability distribution that should not vary with each observer.

Probability In The Abstract

The seemingly irreconcilable perspectives on probability as relative frequency or strength of belief are not discussed in math courses on probability. These concerns are with the way the theory maps to reality and not the theory itself. Geometry was invented to restore the boundaries of fields when the Nile floods receded in ancient Egypt. Then it evolved into a general theory that serves us in architecture, engineering, and other fields. You can apply it without knowing anything about farming in ancient Egypt.

Likewise, probability has evolved into a mathematical theory that maps to reality in many more ways than its originators thought of. The theory determines which computations are legitimate with any kind of probability model; the pragmatics then provide guidelines on developing probability models to fit a variety of situations.

A.N. Kolmogorov: the Euclid of Probability

It happened with probability in the last 100 years, versus 2300 years ago for geometry. The Euclid of probability was A.N. Kolmogorov. His 1933 Foundations of the Theory of Probability provided a list of five axioms from which the entire theory could be logically deduced. The English translation came out in 1950.

You can summarize his axioms further by saying that a probability is any measure defined on an event-space and that it is 1 for the entire space. Underlying this simple sentence, however, is a body of early 20th-century concepts of measure theory developed for other purposes by H. Lebesgue and others.

From Vertical To Horizontal Stripes

In High School, following Bernhard Riemann, you calculate the area under a curve by dividing it into vertical stripes of shrinking widths; Lebesgue had the idea of dividing it into horizontal stripes instead. It doesn’t sound like much but it made a difference. Measure theory is now taught as part of calculus in Graduate School, but only to mathematicians, not to physicists or engineers. One book called Mathematics for Physicists(1967) has an overview of it but Calculus for Scientists and Engineers (2019) omits it.

Diffusion of Kolmogorov’s Theory

If you take a college course on probability theory or check out any textbook from the last 50 years, you learn Kolmogorov’s theory. It has become the standard for teaching probability, as can be seen in college syllabi. Kolmogorov’s own book was only 74 pages. William Feller’s modestly titled but forbidding 1,200-page Introduction to Probability Theory and Its Applications Vol. 1 and 2 (1957) was the first American book explicitly based on Kolmogorov’s method. Patrick Billingsley’s 624-page Probability and Measure (3r edition, 2012) is a more modern treatment.

Yet, some authors still stick to the von Mises view. Mary McShane-Vaughn still defines probability in terms of relative frequencies. The same restrictive conception is in Don Wheeler’s Myths About Data Analysis (2012), pp. 16-17, where he wrote “In the mathematical sense all probability models are limiting functions for infinite sequences of random variables.” This is just not sufficient.

Mapping To Reality

Kolmogorov’s theory is a formal game played with symbols, based on the rules in the axioms. It fills volumes with results on how to combine the probabilities of different events into a variety of quantities, but it is mute, or agnostic, on the right way to assign probabilities to basic actual events like “the next card out of the shoe is a jack of spades” or “the rod is 10.42 inches long.”

Kolmogorov does not map the concept of probability to the realities of coin tosses, poker hands, beads in urns, baseball scores, insurance underwriting, critical dimensions of goods, demand forecasts, failure diagnoses, passenger no-shows at flights, election results, or excess deaths from a pandemic. In other words, he doesn’t address pragmatics. You do this mapping when you apply the theory to a problem but his theory gives you no guidance. This is where relative frequencies, Ramsey’s degrees of belief, or both come into play.

Application To Manufacturing Quality

When Walter Shewhart developed statistical methods for quality control in the 1920s, he was applying the state of the art in probability theory, but his work predates Kolmogorov’s. As a result, his vocabulary is confusing to a reader who learned probability decades later.

For example, Shewhart uses terms like a “statistical universe” for a population, and “probability limits” for limits on control charts set for a particular p-value. If, however, you google “probability limit,” it now refers to a type of convergence for sequences of random variables. For these reasons, if we want to quote Shewhart in a way that is intelligible to a modern reader, we must translate his words but doing it accurately is a challenge.

“What we consider to be fully half of probability theory as it is needed in current applications – the principles for assigning probabilities by logical analysis of incomplete information — is not present at all in the Kolmogorov system.”

In his teaching of the subject at Stanford University, he found that students had no problem following the math but struggled to connect a real problem to abstract math. Pragmatics was the hard part. Through many examples, Jaynes shows you how to leverage what you know about a problem by using, for example, symmetries.

Management Decisions

For an analysis of management decision-making, I prefer to start with what Kaoru Ishikawa said in his book on TQC. His summary of decision-making by managers as he had observed it was “KKD,” which stood for three words:

Keiken (経験), or experience, as in “We’ve always done it this way.”

Kan (感), feeling, intuition, or truthiness as in “I can’t tell you why it’s the right decision, I just know it is.”

Dokyo (度胸), courage or guts, as in “A leader must decide quickly. You play to win.”

He contrasted this with his recommendation to base decisions on data and statistical analysis.

Should We Ignore the KKD?

Entirely disregarding experience, intuition, and guts and focusing exclusively on the data is exactly what Classical Statistics does. Ronald Fisher uses an experiment to determine whether a particular lady is able to tell tea with milk poured first from tea with milk poured last. The numbers showed that she could. The experimenter had no knowledge whatsoever about the way tastebuds work.

In real situations, however, there is information embedded in an individual’s experience and intuition. It is the backstory of the data and ignoring it is neither realistic nor wise. You can’t get humans to do it and, even if you did, it is doubtful that it would yield better decisions.

What you can do is use experience and intuition to formulate hypotheses or theories and analyze data to refute or support them. Then, knowing the imperfections of all stages of this process, it still takes guts to make a decision.

Also, statistical analysis, as I believe Ishikawa meant it, is too restrictive. The classical methods used in quality control are all centered on learning characteristics of populations from measurements on samples. They can tell you that a process is drifting from measurements on workpieces leaving it, but they won’t guide you in finding the root cause of one failure.

Needles and Haystacks

The classical methods can tell you the average number of needles per haystack and where it’s trending, but not how to find a needle in a haystack. This kind of search problem, since World War II, has been approached as a separate application of probability theory, not taught in statistics courses. It starts with prior knowledge, including any theory you might have learned, your experience of previous, similar situations, and responses that have become conditioned reflexes and are what we mean by intuition.

When troubleshooting an electronic device, you don’t need data analysis to tell you to first check whether it is plugged in or has an empty battery. It’s the most common cause, it’s easy to check and fix. In search, you use prior knowledge to assign probabilities that the needle is in various sections of the haystack. Then you have different methods to find it that vary in time required, cost, and effectiveness, defined as the probability of finding the needle by this method in a section if it is there.

You can visually inspect the outside of the haystack, use a metal detector, take apart a section and sift through the hay, or you can burn the hay. As you proceed through your search, you update the probabilities based on what you learn and adjust your methods. It works for finding wrecks on the ocean floor, diagnosing diseases, and finding the causes of failure in defective units in manufacturing.

Example: Measurement Errors in Two Dimensions

Perhaps the most compelling example of pragmatic thinking that E.T. Jaynes gives is from astronomer John Hershel. In 1850, he analyzed errors in the measurement of the position of a star based on two assumptions:

The North-South and the East-West errors are independent.

The magnitude of the error is independent of its direction.

Then simple math shows that a Gaussian distribution — also known as “Normal” — is the only model that is consistent with these assumptions.

Warning: High School Math Ahead

Mathphobes can skip this section, at the cost of missing out on an astonishing result, considering the simplicity of its proof. Jaynes outlines it, leaving it to the reader to work out the details that I am including here. With the possible exception of taking a partial derivative, it’s High School calculus.

Assumption 1 tells us that the probability distribution of the errors is of the form f\left ( x \right )\times f\left ( y \right ) as a function of the East-West errror x and the North-South error y, Assumption 2 then tells us that it is independent of the angle \theta and therefore of the form g\left ( r \right ) where r = \sqrt{x^2+y^2}. Consequntly, we must have

f\left ( x \right )\times f\left ( y \right )= g\left ( r \right )

By noting that, for any positive x, f\left ( x \right )\times f\left ( 0 \right )= g\left ( x\right ) and defining

h\left ( x \right )= log\left [ \frac{f\left ( x \right )}{f\left ( 0 \right )} \right ]

we have

h\left ( x \right ) + h\left ( y \right )= h\left ( r \right )

and

h^\prime(x)= \frac{\partial h\left ( r \right )}{\partial x}= h^\prime(r)\times \frac{x}{r} = \frac{h^\prime(r)}{r}\times x

This tells us that h^\prime(r)/r doesn’t vary with y, and, by symmetry, that it doesn’t vary with x either. It is therefore constant as a function of r. h^\prime(x) is proportional to x. Since h\left (0 \right )= 0, h(x) is proportional to x^2. Therefore, the probability distribution f is of the form

For it to be a probability distribution, it must sum up to 1, and therefore \alpha must be negative. With \alpha = \frac{-1}{2\sigma^2}, normalization gives you f\left ( 0 \right )= \frac{1}{\sigma \sqrt{2\pi}}

Running Operations Versus Diagnosing Problems

Let’s revisit the example of Final Inspection and Failure Analysis from a different perspective. Manufacturing processes where an average of 5% of the units fail final test perhaps should not exist but they do.

This week and next, production still needs to meet the demand with the current, flawed process. Meanwhile, the yield enhancement teams working on the process need to diagnose the failures. Maintaining the flow is a matter of relative frequencies; failure diagnoses, of matching process and product knowledge with symptoms.

Maintaining Flow With A Flawed Process

The challenge is to produce 100% of the demand with a process that yields less than 100%, knowing that yield losses vary. For this purpose, treating the limit of the relative frequency of failures as the probability of occurrence, à la von Mises, is legitimate, and you can use the control limits of a classical p-chart as a confidence interval, or p \pm 3\times \sqrt{\frac{p \times \left ( 1-p \right )}{n}}

Then you protect the output with a finished goods buffer to absorb the fluctuations and you replenish it using, for example, production kanbans. Details vary, particularly when there are rework opportunities, but the key point is that you are only looking at outcomes and using relative frequencies as probabilities.

Diagnosing Failures

Diagnosis is a different challenge, involving more than occurrence counts.

Root Cause Analysis — A Management Skill

Root Cause Analysis (RCA) is a management framework. It entails defining the problem, collecting data, determining root causes, and recommending/implementing solutions.

“Determining root causes” is only one line item. In RCA, the methods to do it are at the level of asking why five times, to determine why a defect was introduced in a unit of product and why it escaped the attention of the people in charge. FMEA and fault-trees are also listed as tools of RCA but, most of the time, RCA is not about analyzing data in terms of probabilities.

Failure Analysis — A Technical Skill

RCA’s cousin, Failure Analysis, sounds indistinguishable from RCA in some applications but is more technical in others, where it involves taking measurements with instruments, running tests, and analyzing data. RCA is a skill that all professionals should have; Failure Analysis, the province of technicians and engineers.

If you are on the receiving end of final test rejects and your job is to feed information back to the defects’ point of occurrence, then Failure Analysis, to you, is a process that you execute routinely and get better at over time. I have discussed the application of Naive Bayes to Failure Analysis in an earlier post. I would like to elaborate here on how, as an application of probability, it differs from what you do to maintain flow.

How Diagnosticians Work

Of course, many diagnosticians do not think in terms of probabilities. A good mechanic zooms in on why water is leaking from a car’s sunroof and solves the problem instantly by opening a clogged channel with a thin rod.

The local family doctor knows what’s “going around.” During the flu season, it will be treated as a priori more likely; in spring, hay fever; since the start of the pandemic, COVID-19. The patient’s history, age, gender, body language, and observable symptoms then provide a context, which the doctor interprets based on prior experience. Often, the doctor has a tentative diagnosis within a second of the patient’s walking into the office, and then it’s only a matter of confirming or refuting it. You use probabilities when dealing with more complex problems or when trying to automate the analysis process.

Inputs to Diagnosis

For a defect in a unit of product, the environmental factors may include which shift the unit was made on, or which part of a shift. If the process uses powders that clump when humid, then the weather can be a factor. Then there is what you know about the physics and chemistry of the product, and the symptoms it may exhibit as a result of various dysfunctions.

It certainly entails assigning probabilities by logical analysis of incomplete information. An oil leak will cause oil to ooze out of the unit — that’s how you know there is an oil leak — but a nonworking start switch won’t cause it. Then there are symptoms like temperature, pressure, or radiation anomalies that multiple failure modes may cause some of the time. This is what the failure analyst knows before receiving the defective workpiece.

A simple formula

The issues vary with every product. To discuss diagnosis in a language we all understand, let us switch to common respiratory diseases, just as an illustration. As for defective product units, the diagnosis starts with context. Then come symptoms and their analysis, and, finally, confirmation through lab tests or treatment attempts.

The end result is still a probability, of the form P\left ( disease \vert symptoms, context \right ), or the probability of the disease given the symptoms. The best prior information we can have, however, is the relative frequency of symptoms for a given disease, which we can choose to interpret as a probability: P\left ( symptoms \vert disease, context \right ). The conversion of the latter to the former is formally simple:

It’s the basic, 280-year-old Bayes formula. As we shall see, using it isn’t quite as straightforward as it looks, because the input data is elusive.

CDC Lists of Symptoms

The following table, based on the CDC’s lists of symptoms, evokes the whiteboards of TV’s Dr. House, or comparative tables of features for electronic products. It could also be a table of observed defects and causes in a product.

Symptom

Common cold

Flu

Hay fever

COVID-19

Bronchitis

Pneumonia

Allergic shiners

✔︎

Body aches

✔︎

✔︎

✔︎

Chest pain

✔︎

Chills

✔︎

✔︎

✔︎

Cough

✔︎

✔︎

✔︎

✔︎

✔︎

✔︎

Diarrhea

✔︎

✔︎

Difficulty breathing

✔︎

Excess phlegm or sputum

Fatigue

✔︎

✔︎

✔︎

✔︎

✔︎

Fever

✔︎

✔︎

✔︎

✔︎

Headaches

✔︎

✔︎

✔︎

Itchy nose

✔︎

Itchy roof of mouth

✔︎

Itchy throat

✔︎

Loss of smell

✔︎

Loss of taste

✔︎

Muscle aches

✔︎

✔︎

Nausea

✔︎

Post-nasal drip

✔︎

✔︎

Runny nose

✔︎

✔︎

✔︎

✔︎

Shortness of breath

✔︎

✔︎

Sneezing

✔︎

✔︎

Sore chest

✔︎

Sore throat

✔︎

✔︎

✔︎

✔︎

Stuffy nose

✔︎

✔︎

✔︎

✔︎

Vomiting

✔︎

✔︎

Watery eyes

✔︎

Watery, itchy, red eyes

✔︎

It makes the problem appear more complex than it is. Most patients self-diagnose and self-medicate for the common cold, the flu, and even hay fever. By now, COVID-19 testing is readily available, and it’s the other diseases that challenge health professionals.

Then there are symptoms that appear with only one disease, like Wheezing and Excess phlegm or sputum for COPD, Chest pain for pneumonia, or loss of smell and taste for COVID-19. If a patient has any of them, the search is over but not all patients with the disease do. If they have lost taste or smell, you know it’s COVID-19 but they can have COVID-19 without losing taste or smell.

The source of this table, CDC lists of symptoms, sometimes use different names for what appears to be the same symptom, like “congestion” and “stuffy nose,” one term being more colloquial than the other. A patient is more likely to talk about a stuffy nose than congestion. Likewise, “Muscle aches” and “Myalgia” are the same. The same happens in quality or maintenance when different technicians use different words for the same defect or the same machine failure.

No Frequencies In CDC Data

Not all patients with each of these diseases have all the symptoms listed, which begs the question of how often they occur. At least on its public pages, the CDC website does not give relative frequencies more specifically than “often” or “sometimes.” A patient’s list of symptoms can come from several diseases, and, based on symptoms alone, a medical doctor can easily confuse a common cold with flu or COVID-19, or Bronchitis with Pneumonia. The following table is from the Mayo clinic website:

Symptom

COVID-19

Flu

Cough

Usually (dry)

Usually

Muscle aches

Usually

Usually

Tiredness

Usually

Usually

Sore throat

Usually

Usually

Runny or stuffy nose

Usually

Usually

Fever

Usually

Usually — not always

Nausea or vomiting

Sometimes

Sometimes (more common in children)

Diarrhea

Sometimes

Sometimes (more common in children)

Shortness of breath or difficulty breathing

Usually

Usually

New loss of taste or smell

Usually (early — often without a runny or stuffy nose)

Rarely

Per the CDC, the tests to tell them apart include RIDT, RT-PCR, viral culture, immunofluorescence assays, spirometry, blood work, chest X-rays, pulse oximetry, sputum tests, CT scans, pleural fluid cultures,… The diseases have different severities and transmission rates, and all the tests take different amounts of time and resources, with different relative frequencies of false positives and negatives.

Symptom Frequency Studies

While the CDC is no slouch at probability when it comes to the surveillance of epidemics, the documents it provides on disease symptoms don’t include relative frequencies and do not use a standard nomenclature for symptoms. The closest their data goes to frequencies is telling you whether symptoms are “common” or occur “sometimes.” Numbers are only available through recent studies of samples of at most a few thousand patients:

The closest I could come for just the flu and COVID-19 is shown in the following, which still does not provide all the details we need to apply Bayes formula:

Symptom

Flu

COVID-19

Body or muscle aches

94%

62.0%

Cough

93%

71.0%

Diarrhea

6%

30.8%

Fatigue

94%

88.8%

Fever

68%

93.6%

Headaches

91%

35.1%

Runny or stuffy nose

91%

21.6%

Vomiting

15%

19.6%

We can apply it to a patient that has only symptoms that occur in frequencies that are sort of known for both the Flu and COVID-19.

Based on this combination of symptoms, we know that the patient doesn’t have any of the conditions other than Flu or COVID-19 in our list, and we don’t know which one of the two.

Prior probabilities of the diseases

From the monitoring done by the state of California, we know that, at the end of May, 2022, 13.9% of patients tested for flu were positive, as were 7.9% of patients tested for COVID-19, and 0.03% had both, or “flurona,” which we’ll choose to neglect.

This is not the infection rate of the population at large. It could be known by testing random samples but such studies have never been done in California. A person who seeks medical attention with the above symptoms is within the population that gets tested, and, with other options eliminated by logic, we know further that they have either the flu or COVID-19. The only question is therefore which one of the two diseases they have. Therefore, prior to learning from the relative frequencies of the symptoms,

The assumption that the symptoms are independently occurring in Flu or COVID-19 patients is naive. We know it’s generally not true but commonly make it because (1) it simplifies calculations and (2) it is difficult to improve upon. It lets you multiply the probabilities of each symptom into a probability that a patient has them all:

Because of the new information, these probabilities do not add up to 100%. Given the relative frequencies of the symptoms for each disease, there is a 34.8% probability that the patient has another disease. WebMD, for example, list many other diseases as matches for these symptoms, including lung cancer and lyme disease. If, for whatever reason, we exclude consideration of all these alternatives,

Assume that, instead of starting with relative frequencies of symptoms, you have a database of individual patients. For each one, you have symptoms and lab-confirmed diagnoses. Then you can apply logistic regression to predict a disease as a function of the symptoms. It requires more detailed data but, unlike Naive Bayes, it does not assume the predictors to be independent. It is the method used in Clinical Signs and Symptoms Predicting Influenza Infection.

Their study was on 3744 subjects with flu symptoms, 2470 (66%) were confirmed to have the flu. Using logistic regression they calculated Positive Predictive Values (PPV) and a Negative Predictive Values (NPV) for multiple symptom sets. The PPV is the probability of having the flu if you have a given set of symptoms; the NPV, the probability of not having the flu if you don’t have them.

As reported, the study did not use the context that informs the prior distribution above. They could turn the context information into additional predictors, at the risk of overfitting.

As discussed before, “logistic” in this context has nothing to do with logistics as used in a supply chain. It’s named after the logarithm of the odds, or logit function logit\left ( p \right )= log\left [p/\left ( 1-p \right )\right ], which logistic regression actually predicts.

Where Do We Go From Here?

Having excluded all other diseases, in context, a patient with the set of symptoms is three times more likely to have the flu than COVID-19. Hence, further testing to confirm the flu will be three times more “productive” than testing for COVID-19.

Focusing on the flu is like attacking defects based on a Pareto chart of occurrence counts. It would, however, be forgetting vital differences between the diseases. Among patients seeking medical attention, COVID-19 requires hospitalization twice as often as the flu and kills three times more often. Once you factor in severity, you decide to test first for COVID-19.

Conclusions

Needles hurt cattle eating hay, and you may want to make sure no haystack contains any. Alternatively, you may need the needle and search the haystack for it. Depending on your problem, you may use different methods to assign probabilities.

If you follow the axioms, you can use the math

As long as the way you do it is consistent with the axioms of probability theory, it’s a body of knowledge that you can leverage to solve your problem. Reality will then validate or refute your choices. It’s a practical, not a philosophical issue.

With No Prior Knowledge, Use Relative Frequencies

In one recent comment, a researcher in pharmaceuticals reported that he was mandated to use Bayesian methods, that refine prior knowledge with data in a situation where his team had no prior knowledge, and that the results were the same as if they had used frequentist methods, relying exclusively on the data.

This person’s experience makes one clear point: frequentist method are best when you don’t have any prior knowledge or when you are not allowed to use it. Ronald Fisher’s tea tasting experiment is an example of the absence of prior knowledge. The point was to establish whether a particular woman could tell whether tea had been added to milk or milk to tea.

The experimenter had no clue as to what mechanism could possibly allow her to make this distinction and must work only with the data. The woman in question, Muriel Bristol, was a colleague of Fisher’s at Rothamstead Research, and Fisher’s experiment showed that she could tell the difference.

It’s like James Bond being able to tell whether a martini is shaken or stirred, except that it was a real person.

The other situation where you are not allowed to use prior knowledge is when seeking certification from an external body like the Food & Drug Administration that doesn’t have this knowledge even if you do. It usually bases its decision exclusively on the data.

With Prior Knowledge, Use It

John Hershell had prior knowledge about measurement errors in start positions in the sky and used it to narrow down the family of applicable models. Prior knowledge tells you not to interpret data about TV viewership as if it were measurements of a critical dimension on a manufactured product. This is true whether or not you use this knowledge to determine a prior distribution to be refined based on data.

Jun 12 2022

## Perspectives On Probability In Operations

The spirited discussions on LinkedIn about whether probabilities are relative frequencies or quantifications of beliefs are guaranteed to baffle practitioners. They come up in threads about manufacturing quality, supply-chain management, and public health, and do not generate much light. Their participants trade barbs without much civility, and without actually exchanging on substance.

The latest one, by Alexander von Felbert, is among the more thoughtful, and therefore unlikely to inspire rants. I do, however, fault it with using words like “aleatory” or “epistemic” that I don’t think are helpful. I am trying to discuss it here in everyday language, and to apply the concepts to numerically specific cases, with an eye to operations.

While there are genuinely great and not-so-great ideas, the root of the most violent disagreements is elsewhere, with individuals generalizing from different experience bases. You may map probability to reality differently depending on whether you are developing drugs in the pharmaceutical industry, enhancing yield in a semiconductor process, or driving down dppms in auto parts. The math doesn’t care as long as you follow its rules, and it doesn’t invalidate other interpretations.

Contents

Tools And Context

Education In Math, Science, And Tech Ignores History

The Technology Context Matters

Relative Frequencies

Jakob Bernouilli First Used The Term “Probability”

Laplace Formalized Bernouilli’s Definition

Von Mises Expanded Laplace’s Definition

What about today?

Flows versus Single Events

Analysis of Spatial Data

Never or Rarely Observed Events

Beliefs

Meaning of Probability in Everyday Life

Von Mises’s View

Frank Ramsey and Probability As Extension Of Logic

Probability In The Abstract

A.N. Kolmogorov: the Euclid of Probability

From Vertical To Horizontal Stripes

Diffusion of Kolmogorov’s Theory

Mapping To Reality

Application To Manufacturing Quality

The Pragmatics of Probability

Management Decisions

Should We Ignore the KKD?

Needles and Haystacks

Example: Measurement Errors in Two Dimensions

Warning: High School Math Ahead

Running Operations Versus Diagnosing Problems

Maintaining Flow With A Flawed Process

Diagnosing Failures

Root Cause Analysis — A Management Skill

Failure Analysis — A Technical Skill

How Diagnosticians Work

Inputs to Diagnosis

A simple formula

CDC Lists of Symptoms

No Frequencies In CDC Data

Symptom Frequency Studies

Prior probabilities of the diseases

Prior probabilities of the symptoms

Probabilities of the diseases, given the symptoms

Use of Logistic Regression versus Naive Bayes.

Where Do We Go From Here?

Conclusions

If you follow the axioms, you can use the math

With No Prior Knowledge, Use Relative Frequencies

With Prior Knowledge, Use It

## Tools And Context

Manufacturing professionals solve problems by using their prior knowledge, direct observation of the actual situation, and available data. These debates are about which tools to use with data, and, rather than reasoning philosophically, the analyst is better off asking John Seddon’s three questions:

## Education In Math, Science, And Tech Ignores History

The way math, science, and technology are taught makes these questions more difficult to answer than they should be. You learn the current body of knowledge, not its backstory. Most of what follows I learned recently, in reaction to clashing dogmas in my news feed.

## The Technology Context Matters

Once you know the inventors, you need to consider every facet of the context they worked in, not only whose shoulders they were standing on. In manufacturing, for example, Ken Alder reports that in 18th-century metal working, instructions for heat treatment in forging were to make the iron “cherry red,” which covers a range. Without appropriate thermometers, they had to work without temperature data. Forging quality suffered but sophisticated analysis tools would have been moot.

100 years ago, Walter Shewhart had better instruments but was limited to working with paper spreadsheets, slide rules, and distribution tables… When we have the same problems today, we can leverage the technology developed since, but some authors have not noticed. Mary McShane-Vaughn’s Probability Handbook (2016), ends with 42 pages of distribution tables, that solve a problem we no longer have. Today, the most common software packages provide built-in functions for these distributions.

## Relative Frequencies

The view of probabilities as relative frequencies was used and expanded by several thinkers over 200 years. Jakob Bernouilli, Pierre-Simon Laplace, and Richard von Mises were three of them, each a century apart, looking all in the same direction.

## Jakob Bernouilli First Used The Term “Probability”

300 years ago, Jakob Bernouilli was the first to use the term “probabilitas” in an analysis of games of chance. In his

Ars Conjectandi(“The art of guessing”), from 1713, he uses it to denote the relative frequency of outcomes.## Laplace Formalized Bernouilli’s Definition

100 years later, in 1812, Laplace chimed in with the following definition, cited by E.T. Jaynes:

## Von Mises Expanded Laplace’s Definition

Another century later, in Probability, Statistics, and Truth (1928), Richard von Mises introduced another wrinkle, the notion that the probability is the limit of relative frequencies when the number of observations increases, within a population, which he calls a “collective.” He adds the two caveats that such a limit must exist and that it must be independent of the sequencing of the observations.

The latter is a way of saying that they must be independent and identically distributed random variables without using the concepts of random variables, independence, and distributions. He couldn’t use these concepts in defining probability because he needs the concept of probability to define them. Von Mises’s full definition is as follows:

## What about today?

It’s now 2022, yet another century later, and we are calculating probabilities in ways that are not covered by von Mises’s definition, as pointed out in an earlier post. Both the math of probability and its applications to reality have moved on, in ways I will elaborate on below. If you don’t know the proportion of red and white beads in an urn, keep pulling out one bead with replacement, and recording its color, the proportion of red beads will tend to a limit that satisfies the mathematical definition of a probability, with the probability spaces having just two elements, “red” and “white.” Such limits of relative frequencies are probabilities but

not all probabilities are such limits.## Flows versus Single Events

When you are making auto parts at a takt time of 10 seconds, it makes sense to view fluctuations in critical dimensions as a sequence of independent draws, and growing as long as you keep producing. There are, however, many cases where it is not an option.

## Analysis of Spatial Data

My first brush with the application of probabilities to a real problem was the estimation of an orebody with nonferrous metals, to support the decision of whether to mine it, as a student of Georges Matheron. Starting from values measured in a grid of boreholes, the job entailed modeling the grades of the various metals as a random

functionof location within the orebody, for which there was only one draw. There was no relative frequency to go by.The “heavenly croupier,” geology, folded the rock layers and trapped the ore just once. By no stretch of the imagination could any connection be made between the probability of a copper grade being between 5% and 10% within a given cube underground and a relative frequency observed in repeated, independent draws.

Yet probability theory could be used, with inferences based on the mutual influence between measurements taken across different distances. Obviously, it was a different concept of probability from von Mises’s.

This is about whether to mine the orebody, and we can only know the accuracy of the estimate if we decide to mine it. If the estimate is too low, you don’t mine and, as a result, can never know the accuracy of the estimate. Regardless of the method you use, you can’t validate your estimate unless you actually mine.

## Never or Rarely Observed Events

Acceptance Sampling In The Age Of Low PPM Defectives discusses the assignment of a probability to an event you have

neverobserved, the probability that a supplier with a perfect record will next ship a defective unit. The relative frequency of the event in the history of shipments from the supplier is 0. Yet you know that the next one will be defective with a probability >0. Furthermore, it is not the same if the perfect run was for 1,000 or 1,000,000 units.In some cases, the data set is too small and heterogeneous to provide meaningful relative frequencies. There have been four disasters in nuclear power plants in the past 65 years, in three different countries with varying levels of openness, and with four different reactor designs. While no one wishes for a richer data set, the existing one provides no way to calculate relative frequencies. Probabilistic Risk Assessment in such cases certainly entails assigning probabilities by a logical analysis of incomplete information.

## Beliefs

Von Mises was aware that he was giving probability a meaning it doesn’t have in everyday life. His contemporary Frank Ramsey wanted instead to formalize the everyday life meaning into a foundation for the theory.

## Meaning of Probability in Everyday Life

The German word for probability is “Wahrscheinlichkeit,” literally “degree of appearance of truth, ” which sounds like Stephen Colbert’s truthiness. We describe events as

probablewhen we think they will happen but aren’t certain they will. If we don’t have to respond, it makes no difference.On the other hand, if we must act, we must decide whether to assume the events will happen, as in packing an umbrella because the weather report announces rain. Then

probabilitybecomes the degree to which an event is probable.## Von Mises’s View

Von Mises quotes several statements along these lines:

and

Dictionary definitions attempt to capture this notion but, as von Mises points out, they are often useless and circular, and it is just as true today as it was 100 years ago. His own does not have these flaws but does not cover the entire field. He describes a way to calculate or approximate a number but not what this number

means.## Frank Ramsey and Probability As Extension Of Logic

Frank Ramsey undertook to formalize the everyday usage of

probabilityinto an extension of logic, in which, instead of being true or false, propositions have a probability between 0 and 1. Different individuals may assign different values, and these probabilities are therefore subjective, which von Mises was trying to avoid.There are cases, however, where subjectivity doesn’t matter. If, for example, you are responsible for repairing a broken machine that is holding up a production line, you have to choose between courses of action, from shutdown-and-restart to replacement of a subsystem.

Yourassessment of the probability of success of each is the only one that matters because you are accountable for the outcome. Objectivity matters, on the other hand, when you measure the position of a star with a given telescope. The measurement errors follow a probability distribution that should not vary with each observer.## Probability In The Abstract

The seemingly irreconcilable perspectives on probability as relative frequency or strength of belief are not discussed in

mathcourses on probability. These concerns are with the way the theory maps to reality and not the theory itself. Geometry was invented to restore the boundaries of fields when the Nile floods receded in ancient Egypt. Then it evolved into a general theory that serves us in architecture, engineering, and other fields. You can apply it without knowing anything about farming in ancient Egypt.Likewise, probability has evolved into a mathematical theory that maps to reality in many more ways than its originators thought of. The

theorydetermines which computations are legitimate with any kind of probability model; thepragmaticsthen provide guidelines on developing probability models to fit a variety of situations.## A.N. Kolmogorov: the Euclid of Probability

It happened with probability in the last 100 years, versus 2300 years ago for geometry. The Euclid of probability was A.N. Kolmogorov. His 1933 Foundations of the Theory of Probability provided a list of five axioms from which the entire theory could be logically deduced. The English translation came out in 1950.

You can summarize his axioms further by saying that a probability is any measure defined on an event-space and that it is 1 for the entire space. Underlying this simple sentence, however, is a body of early 20th-century concepts of measure theory developed for other purposes by H. Lebesgue and others.

## From Vertical To Horizontal Stripes

In High School, following Bernhard Riemann, you calculate the area under a curve by dividing it into vertical stripes of shrinking widths; Lebesgue had the idea of dividing it into horizontal stripes instead. It doesn’t sound like much but it made a difference. Measure theory is now taught as part of calculus in Graduate School, but only to

mathematicians, not to physicists or engineers. One book called Mathematics for Physicists(1967) has an overview of it but Calculus for Scientists and Engineers (2019) omits it.## Diffusion of Kolmogorov’s Theory

If you take a college course on probability theory or check out any textbook from the last 50 years, you learn Kolmogorov’s theory. It has become the standard for teaching probability, as can be seen in college syllabi. Kolmogorov’s own book was only 74 pages. William Feller’s modestly titled but forbidding 1,200-page Introduction to Probability Theory and Its Applications Vol. 1 and 2 (1957) was the first American book explicitly based on Kolmogorov’s method. Patrick Billingsley’s 624-page Probability and Measure (3r edition, 2012) is a more modern treatment.

Yet, some authors still stick to the von Mises view. Mary McShane-Vaughn still

definesprobability in terms of relative frequencies. The same restrictive conception is in Don Wheeler’s Myths About Data Analysis (2012), pp. 16-17, where he wrote “In the mathematical sense all probability models are limiting functions for infinite sequences of random variables.” This is just not sufficient.## Mapping To Reality

Kolmogorov’s theory is a formal game played with symbols, based on the rules in the axioms. It fills volumes with results on how to combine the probabilities of different events into a variety of quantities, but it is mute, or agnostic, on the right way to assign probabilities to basic actual events like “the next card out of the shoe is a jack of spades” or “the rod is 10.42 inches long.”

Kolmogorov does not map the concept of probability to the realities of coin tosses, poker hands, beads in urns, baseball scores, insurance underwriting, critical dimensions of goods, demand forecasts, failure diagnoses, passenger no-shows at flights, election results, or excess deaths from a pandemic. In other words, he doesn’t address

pragmatics. You do this mapping when you apply the theory to a problem but his theory gives you no guidance. This is where relative frequencies, Ramsey’s degrees of belief, or both come into play.## Application To Manufacturing Quality

When Walter Shewhart developed statistical methods for quality control in the 1920s, he was applying the state of the art in probability theory, but his work predates Kolmogorov’s. As a result, his vocabulary is confusing to a reader who learned probability decades later.

For example, Shewhart uses terms like a “statistical universe” for a population, and “probability limits” for limits on control charts set for a particular p-value. If, however, you google “probability limit,” it now refers to a type of convergence for sequences of random variables. For these reasons, if we want to quote Shewhart in a way that is intelligible to a modern reader, we must translate his words but doing it accurately is a challenge.

## The Pragmatics of Probability

In Probability Theory: The Logic of Science (2003), E.T. Jaynes addresses pragmatics head-on. As he says in his introduction:

In his teaching of the subject at Stanford University, he found that students had no problem following the math but struggled to connect a real problem to abstract math. Pragmatics was the hard part. Through many examples, Jaynes shows you how to leverage what you know about a problem by using, for example, symmetries.

## Management Decisions

For an analysis of management decision-making, I prefer to start with what Kaoru Ishikawa said in his book on TQC. His summary of decision-making by managers as he had observed it was “KKD,” which stood for three words:

He contrasted this with his recommendation to base decisions on

dataandstatistical analysis.## Should We Ignore the KKD?

Entirely disregarding experience, intuition, and guts and focusing exclusively on the data is exactly what Classical Statistics does. Ronald Fisher uses an experiment to determine whether a particular lady is able to tell tea with milk poured first from tea with milk poured last. The numbers showed that she could. The experimenter had no knowledge whatsoever about the way tastebuds work.

In real situations, however, there is information embedded in an individual’s experience and intuition. It is the backstory of the data and ignoring it is neither realistic nor wise. You can’t get humans to do it and, even if you did, it is doubtful that it would yield better decisions.

What you can do is use experience and intuition to formulate hypotheses or theories and analyze data to refute or support them. Then, knowing the imperfections of all stages of this process, it still takes guts to make a decision.

Also, statistical analysis, as I believe Ishikawa meant it, is too restrictive. The classical methods used in quality control are all centered on learning characteristics of populations from measurements on samples. They can tell you that a process is drifting from measurements on workpieces leaving it, but they won’t guide you in finding the root cause of one failure.

## Needles and Haystacks

The classical methods can tell you the average number of needles per haystack and where it’s trending, but not how to find a needle in a haystack. This kind of search problem, since World War II, has been approached as a separate application of probability theory, not taught in statistics courses. It starts with prior knowledge, including any theory you might have learned, your experience of previous, similar situations, and responses that have become conditioned reflexes and are what we mean by intuition.

When troubleshooting an electronic device, you don’t need data analysis to tell you to first check whether it is plugged in or has an empty battery. It’s the most common cause, it’s easy to check and fix. In search, you use prior knowledge to assign probabilities that the needle is in various sections of the haystack. Then you have different methods to find it that vary in time required, cost, and effectiveness, defined as the probability of finding the needle by this method in a section if it is there.

You can visually inspect the outside of the haystack, use a metal detector, take apart a section and sift through the hay, or you can burn the hay. As you proceed through your search, you update the probabilities based on what you learn and adjust your methods. It works for finding wrecks on the ocean floor, diagnosing diseases, and finding the causes of failure in defective units in manufacturing.

## Example: Measurement Errors in Two Dimensions

Perhaps the most compelling example of pragmatic thinking that E.T. Jaynes gives is from astronomer John Hershel. In 1850, he analyzed errors in the measurement of the position of a star based on two assumptions:

Then simple math shows that a Gaussian distribution — also known as “Normal” — is the only model that is consistent with these assumptions.

## Warning: High School Math Ahead

Mathphobes can skip this section, at the cost of missing out on an astonishing result, considering the simplicity of its proof. Jaynes outlines it, leaving it to the reader to work out the details that I am including here. With the possible exception of taking a partial derivative, it’s High School calculus.

Assumption 1 tells us that the probability distribution of the errors is of the form f\left ( x \right )\times f\left ( y \right ) as a function of the East-West errror x and the North-South error y, Assumption 2 then tells us that it is independent of the angle \theta and therefore of the form g\left ( r \right ) where r = \sqrt{x^2+y^2}. Consequntly, we must have

f\left ( x \right )\times f\left ( y \right )= g\left ( r \right )

By noting that, for any positive x, f\left ( x \right )\times f\left ( 0 \right )= g\left ( x\right ) and defining

h\left ( x \right )= log\left [ \frac{f\left ( x \right )}{f\left ( 0 \right )} \right ]

we have

h\left ( x \right ) + h\left ( y \right )= h\left ( r \right )

and

h^\prime(x)= \frac{\partial h\left ( r \right )}{\partial x}= h^\prime(r)\times \frac{x}{r} = \frac{h^\prime(r)}{r}\times x

This tells us that h^\prime(r)/r doesn’t vary with y, and, by symmetry, that it doesn’t vary with x either. It is therefore constant as a function of r. h^\prime(x) is proportional to x. Since h\left (0 \right )= 0, h(x) is proportional to x^2. Therefore, the probability distribution f is of the form

f\left (x \right )= f\left (0 \right )\times exp\left [ h\left ( x \right ) \right ] = f\left (0 \right )e^{\alpha x^2}

For it to be a probability distribution, it must sum up to 1, and therefore \alpha must be negative. With \alpha = \frac{-1}{2\sigma^2}, normalization gives you f\left ( 0 \right )= \frac{1}{\sigma \sqrt{2\pi}}

## Running Operations Versus Diagnosing Problems

Let’s revisit the example of Final Inspection and Failure Analysis from a different perspective. Manufacturing processes where an average of 5% of the units fail final test perhaps should not exist but they do.

This week and next, production still needs to meet the demand with the current, flawed process. Meanwhile, the yield enhancement teams working on the process need to diagnose the failures. Maintaining the flow is a matter of relative frequencies; failure diagnoses, of matching process and product knowledge with symptoms.

## Maintaining Flow With A Flawed Process

The challenge is to produce 100% of the demand with a process that yields less than 100%, knowing that yield losses vary. For this purpose, treating the limit of the relative frequency of failures as the probability of occurrence, à la von Mises, is legitimate, and you can use the control limits of a classical p-chart as a confidence interval, or p \pm 3\times \sqrt{\frac{p \times \left ( 1-p \right )}{n}}

Then you protect the output with a finished goods buffer to absorb the fluctuations and you replenish it using, for example, production kanbans. Details vary, particularly when there are rework opportunities, but the key point is that you are only looking at outcomes and using relative frequencies as probabilities.

## Diagnosing Failures

Diagnosis is a different challenge, involving more than occurrence counts.

## Root Cause Analysis — A Management Skill

Root Cause Analysis (RCA) is a management framework. It entails defining the problem, collecting data,

determining root causes, and recommending/implementing solutions.“Determining root causes” is only one line item. In RCA, the methods to do it are at the level of asking why five times, to determine why a defect was introduced in a unit of product and why it escaped the attention of the people in charge. FMEA and fault-trees are also listed as tools of RCA but, most of the time, RCA is not about analyzing data in terms of probabilities.

## Failure Analysis — A Technical Skill

RCA’s cousin, Failure Analysis, sounds indistinguishable from RCA in some applications but is more technical in others, where it involves taking measurements with instruments, running tests, and analyzing data. RCA is a skill that all professionals should have; Failure Analysis, the province of technicians and engineers.

If you are on the receiving end of final test rejects and your job is to feed information back to the defects’ point of occurrence, then Failure Analysis, to you, is a process that you execute routinely and get better at over time. I have discussed the application of Naive Bayes to Failure Analysis in an earlier post. I would like to elaborate here on how, as an application of probability, it differs from what you do to maintain flow.

## How Diagnosticians Work

Of course, many diagnosticians do not think in terms of probabilities. A good mechanic zooms in on why water is leaking from a car’s sunroof and solves the problem instantly by opening a clogged channel with a thin rod.

The local family doctor knows what’s “going around.” During the flu season, it will be treated as a priori more likely; in spring, hay fever; since the start of the pandemic, COVID-19. The patient’s history, age, gender, body language, and observable symptoms then provide a

context, which the doctor interprets based on prior experience. Often, the doctor has a tentative diagnosis within a second of the patient’s walking into the office, and then it’s only a matter of confirming or refuting it. You use probabilities when dealing with more complex problems or when trying to automate the analysis process.## Inputs to Diagnosis

For a defect in a unit of product, the environmental factors may include which shift the unit was made on, or which part of a shift. If the process uses powders that clump when humid, then the weather can be a factor. Then there is what you know about the physics and chemistry of the product, and the symptoms it may exhibit as a result of various dysfunctions.

It certainly entails assigning probabilities by logical analysis of incomplete information. An oil leak will cause oil to ooze out of the unit — that’s how you know there is an oil leak — but a nonworking start switch won’t cause it. Then there are symptoms like temperature, pressure, or radiation anomalies that multiple failure modes may cause some of the time. This is what the failure analyst knows before receiving the defective workpiece.

## A simple formula

The issues vary with every product. To discuss diagnosis in a language we all understand, let us switch to common respiratory diseases, just as an illustration. As for defective product units, the diagnosis starts with context. Then come symptoms and their analysis, and, finally, confirmation through lab tests or treatment attempts.

The end result is still a probability, of the form P\left ( disease \vert symptoms, context \right ), or the probability of the disease given the symptoms. The best prior information we can have, however, is the relative frequency of symptoms for a given disease, which we can choose to interpret as a probability: P\left ( symptoms \vert disease, context \right ). The conversion of the latter to the former is formally simple:

P\left ( disease \vert symptoms, context \right ) = \frac{P\left ( symptoms\vert disease, context \right )\times P\left ( disease\vert context \right )}{P\left ( symptoms\vert context \right )}It’s the basic, 280-year-old Bayes formula. As we shall see, using it isn’t quite as straightforward as it looks, because the input data is elusive.

## CDC Lists of Symptoms

The following table, based on the CDC’s lists of symptoms, evokes the whiteboards of TV’s Dr. House, or comparative tables of features for electronic products. It could also be a table of observed defects and causes in a product.

It makes the problem appear more complex than it is. Most patients self-diagnose and self-medicate for the common cold, the flu, and even hay fever. By now, COVID-19 testing is readily available, and it’s the other diseases that challenge health professionals.

Then there are symptoms that appear with only one disease, like

WheezingandExcess phlegm or sputumfor COPD,Chest painfor pneumonia, orloss of smell and tastefor COVID-19. If a patient has any of them, the search is over but not all patients with the disease do. If they have lost taste or smell, you know it’s COVID-19 but they can have COVID-19 without losing taste or smell.The source of this table, CDC lists of symptoms, sometimes use different names for what appears to be the same symptom, like “congestion” and “stuffy nose,” one term being more colloquial than the other. A patient is more likely to talk about a stuffy nose than congestion. Likewise, “Muscle aches” and “Myalgia” are the same. The same happens in quality or maintenance when different technicians use different words for the same defect or the same machine failure.

## No Frequencies In CDC Data

Not all patients with each of these diseases have all the symptoms listed, which begs the question of how often they occur. At least on its public pages, the CDC website does not give relative frequencies more specifically than “often” or “sometimes.” A patient’s list of symptoms can come from several diseases, and, based on symptoms alone, a medical doctor can easily confuse a common cold with flu or COVID-19, or Bronchitis with Pneumonia. The following table is from the Mayo clinic website:

Per the CDC, the tests to tell them apart include RIDT, RT-PCR, viral culture, immunofluorescence assays, spirometry, blood work, chest X-rays, pulse oximetry, sputum tests, CT scans, pleural fluid cultures,… The diseases have different severities and transmission rates, and all the tests take different amounts of time and resources, with different relative frequencies of false positives and negatives.

## Symptom Frequency Studies

While the CDC is no slouch at probability when it comes to the surveillance of epidemics, the documents it provides on disease symptoms don’t include relative frequencies and do not use a standard nomenclature for symptoms. The closest their data goes to frequencies is telling you whether symptoms are “common” or occur “sometimes.” Numbers are only available through

recentstudies of samples of at most a few thousand patients:The closest I could come for just the flu and COVID-19 is shown in the following, which still does not provide all the details we need to apply Bayes formula:

We can apply it to a patient that has only symptoms that occur in frequencies that are sort of known for both the Flu and COVID-19.

Based on this combination of symptoms, we know that the patient doesn’t have any of the conditions other than Flu or COVID-19 in our list, and we don’t know which one of the two.

## Prior probabilities of the diseases

From the monitoring done by the state of California, we know that, at the end of May, 2022, 13.9% of patients tested for flu were positive, as were 7.9% of patients tested for COVID-19, and 0.03% had both, or “flurona,” which we’ll choose to neglect.

This is not the infection rate of the population at large. It could be known by testing random samples but such studies have never been done in California. A person who seeks medical attention with the above symptoms is within the population that gets tested, and, with other options eliminated by logic, we know further that they have either the flu or COVID-19. The only question is therefore which one of the two diseases they have. Therefore,

P\left ( Flu\vert context \right )= \frac{13.9\%}{13.9\% +7.9\%}= 63.8\%priorto learning from the relative frequencies of the symptoms,and

P\left ( COVID\text{-}19\vert context \right )= \frac{7.9\%}{13.9\% +7.9\%}= 36.2\%## Prior probabilities of the symptoms

The assumption that the symptoms are independently occurring in Flu or COVID-19 patients is

P\left ( symptoms \vert flu, context \right ) = 94\%\times 93\%\times \cdots\times 15\% = 0.44\%naive.We know it’s generally not true but commonly make it because (1) it simplifies calculations and (2) it is difficult to improve upon. It lets you multiply the probabilities of each symptom into a probability that a patient has themall:and

P\left ( symptoms \vert COVID\text{-}19, context \right ) = 62\%\times 71\%\times \cdots\times 19.6\% = 0.27\%To get unconditional symptom probabilities, we weigh the relative frequencies for each disease with the probabilities of this disease, so that,

P\left ( symptoms \vert context \right ) = \left ( 94\%\times 63.8\% + 62\%\times 36.2\% \right )\times \cdots\times \left ( 15\%\times 63.8\% + 19.6\%\times 36.2\% \right ) = 0.58\%## Probabilities of the diseases, given the symptoms

P\left ( Flu \vert symptoms, context \right ) = \frac{0.44\%\times 63.8\%}{0.58\%} = 48.4\%and

P\left (COVID\text{-}19 \vert symptoms, context \right ) = \frac{0.27\%\times 36.2\%}{0.58\%} = 16.8\%Because of the new information, these probabilities do not add up to 100%. Given the relative frequencies of the symptoms for each disease, there is a 34.8% probability that the patient has

P\left ( Flu \vert symptoms, context, Flu\,or\,COVID\text{-}19 \right ) = \frac{48.4\%}{48.4\% + 16.8\%} = 74.2\%anotherdisease. WebMD, for example, list many other diseases as matches for these symptoms, including lung cancer and lyme disease. If, for whatever reason, we exclude consideration of all these alternatives,and

P\left (COVID\text{-}19 \vert symptoms, context, Flu\,or\,COVID\text{-}19 \right ) = \frac{16.8\%}{48.4\% + 16.8\%} = 25.8\%## Use of Logistic Regression versus Naive Bayes.

Assume that, instead of starting with relative frequencies of symptoms, you have a database of individual patients. For each one, you have symptoms and lab-confirmed diagnoses. Then you can apply logistic regression to predict a disease as a function of the symptoms. It requires more detailed data but, unlike

Naive Bayes, it does not assume the predictors to be independent. It is the method used in Clinical Signs and Symptoms Predicting Influenza Infection.Their study was on 3744 subjects with flu symptoms, 2470 (66%) were confirmed to have the flu. Using logistic regression they calculated Positive Predictive Values (PPV) and a Negative Predictive Values (NPV) for multiple symptom sets. The PPV is the probability of having the flu if you have a given set of symptoms; the NPV, the probability of

nothaving the flu if youdon’thave them.As reported, the study did not use the context that informs the prior distribution above. They could turn the context information into additional predictors, at the risk of overfitting.

As discussed before, “logistic” in this context has nothing to do with logistics as used in a supply chain. It’s named after the logarithm of the odds, or

logitfunction logit\left ( p \right )= log\left [p/\left ( 1-p \right )\right ], which logistic regression actually predicts.## Where Do We Go From Here?

Having excluded all other diseases, in context, a patient with the set of symptoms is three times more likely to have the flu than COVID-19. Hence, further testing to confirm the flu will be three times more “productive” than testing for COVID-19.

Focusing on the flu is like attacking defects based on a Pareto chart of occurrence counts. It would, however, be forgetting vital differences between the diseases. Among patients seeking medical attention, COVID-19 requires hospitalization twice as often as the flu and kills three times more often. Once you factor in severity, you decide to test first for COVID-19.

## Conclusions

Needles hurt cattle eating hay, and you may want to make sure no haystack contains any. Alternatively, you may need the needle and search the haystack for it. Depending on your problem, you may use different methods to assign probabilities.

## If you follow the axioms, you can use the math

As long as the way you do it is consistent with the axioms of probability theory, it’s a body of knowledge that you can leverage to solve your problem. Reality will then validate or refute your choices. It’s a practical, not a philosophical issue.

## With No Prior Knowledge, Use Relative Frequencies

In one recent comment, a researcher in pharmaceuticals reported that he was mandated to use

Bayesian methods, that refine prior knowledge with data in a situation where his team hadnoprior knowledge, and that the results were the same as if they had usedfrequentist methods, relying exclusively on the data.This person’s experience makes one clear point: frequentist method are best when you don’t have any prior knowledge or when you are not allowed to use it. Ronald Fisher’s tea tasting experiment is an example of the absence of prior knowledge. The point was to establish whether a particular woman could tell whether tea had been added to milk or milk to tea.

The experimenter had no clue as to what mechanism could possibly allow her to make this distinction and must work only with the data. The woman in question, Muriel Bristol, was a colleague of Fisher’s at Rothamstead Research, and Fisher’s experiment showed that she could tell the difference.

It’s like James Bond being able to tell whether a martini is shaken or stirred, except that it was a real person.

The other situation where you are not allowed to use prior knowledge is when seeking certification from an external body like the Food & Drug Administration that doesn’t have this knowledge even if you do. It usually bases its decision exclusively on the data.

## With Prior Knowledge, Use It

John Hershell had prior knowledge about measurement errors in start positions in the sky and used it to narrow down the family of applicable models. Prior knowledge tells you not to interpret data about TV viewership as if it were measurements of a critical dimension on a manufactured product. This is true whether or not you use this knowledge to determine a prior distribution to be refined based on data.

#bayesianstatistics, #statistics, #frequentist, #datascience, #probability, #statistics

By Michel Baudin • Data science • • Tags: Bayesian Statistics, data science, Probability, statistics