Oct 4 2022
In his latest column in Quality Digest, Don Wheeler wrote the following blanket statements, free of any caveat:
- “Probability models are built on the assumption that the data can be thought of as observations from a set of random variables that are independent and identically distributed.”
- “In the end, which probability model you may fit to your data hardly matters. It is an exercise that serves no practical purpose.”
Source: Wheeler, D. (2022) Converting Capabilities, What difference does the probability model make? Quality Digest
Michel Baudin‘s comments:
Not all models assume i.i.d. variables
Wheeler’s first statement might have applied 100 years ago. Today, however, there are many models in probability that are not based on the assumption that data are “observations from a set of random variables that are independent and identically distributed”:
- ARIMA models for time series are used, for example in forecasting beer sales.
- Epidemiologists use models that assume existing infections cause new ones. Therefore counts for successive periods are not independent.
- The spatial data analysis tools used in mining and oil exploration assume that an observation at any point informs you about its neighborhood. The analysts don’t assume that observations at different points are independent.
- The probability models used to locate a wreck on the ocean floor, find a needle in a haystack, and other similar search problems have nothing to do with series of independent and identically distributed observations.
Probability Models Are Useful
In his second statement, Wheeler seems determined to deter engineers and managers from studying probability. If a prominent statistician tells them it serves no useful purpose, why bother? It is particularly odd when you consider that Wheeler’s beloved XmR/Process Behavior charts use control limits based on the model of observations as the sum of a constant and a Gaussian white noise.
Probability models have many useful purposes. They can keep from pursuing special causes for mere fluctuations and help you find root causes of actual problems. They also help you plan your supply chain and dimension your production lines.
Histograms are Old-Hat; Use KDE Instead
As Wheeler also says, “Many people have been taught that the first step in the statistical inquisition of their data is to fit some probability model to the histogram.” It’s time to learn something new, that takes advantage of IT developments since Karl Pearson invented the histogram in 1891.
Fitting models to a sample of 250 points based on a histogram is old-hat. A small dataset today is more 30,000 points, and you visualize its distribution with kernel density estimation(KDE), not histograms.