Sep 3 2024
Using Regression to Improve Quality | Part I – What for?
In quality, regression serves to identify substitutes for true characteristics that are hard to observe and to find the root causes of technically challenging process problems. It is a major topic in data science, but oddly, the most extensive coverage I could find in the literature on quality is in Shewhart’s first book, from 1931! Later books, including Shewhart’s second, discuss it briefly or not at all. The ASQC, forerunner of the ASQ, published an 80-page guide on how to use regression analysis in quality control in 1985, but has not updated it since.
Regression analysis has been around for almost 140 years and has grown massively in scope, capabilities, and dataset size. Perhaps, it is time for professionals involved with quality to take another look at it.
Contents
Uses of Regression in Quality
Regression is a set of statistical/data science techniques for explaining a random variable through other random variables. At the dawn of statistical quality, Shewhart applied regression in the search for substitute characteristics and in root cause analysis when logic and process knowledge were insufficient.
Identifying Substitute Characteristics
The characteristics of products that customers expect can be numeric variables or go/nogo attributes that are not always easy to capture or understand. Observing these true characteristics may take too long, require expensive instruments, or entail the destruction of the product. We need to find easily observable substitute characteristics to use instead of the true ones.
As scatterplot gives a visual hint of a relationship between a substitute and a true characteristic. Correlation analysis validates its existence, and regression quantifies it to support decisions about the product.
This is what Lonnie Wilson had to say about substitute characteristics and regression in oil refining:
“A refinery is a perfect place to learn about ‘substitute quality characteristics,’ as everything varies. For example, to be classified as “gasoline,” the mixture must have, among many other things, an endpoint less than 437°F. There is no way to control that directly but by controlling the temperature and the pressure you can come close. But you cannot control the temperature directly; it is also indirect control, and so on. If you work in a refinery for more than a week, you become adept at correlation and regression. […] When I went into industry with Chevron, correlation and regression was a basic skill taught right away. In a refinery, we would have over 50,000 control points, and less than 2% were direct measures of quality; almost all were indirect measures. “
Root Cause Analysis
Sometimes, logic and process knowledge is sufficient to identify the causes of problems, à la Sherlock Holmes or Dr. House. This works for failure analysis on a specific door that flew off an airliner in flight. On the other hand, it is different when you are investigating a population of diecast cases that went from 5% to 10% of leakers.
Beyond die change and handling at leak testing, there is almost no human intervention in diecasting, but many settings and process variables can influence the outcome in ways that you need to identify and quantify. For this purpose, correlation and regression analysis are helpful.
The Literature
The literature on statistical quality control says surprisingly little about scatterplots, correlation, and regression, even though they are major topics in statistics/data science in general.
Shewhart’s First Book
Shewhart’s first book, published in 1931, extensively discusses regression. Measurements of tensile strength on aluminum die castings are destructive, and he examines the feasibility of using hardness and density as substitutes you can measure without tearing apart the workpiece. Fig. 14, on p. 52, summarize his findings in terms of simple regression for hardness and density separately, and multiple regression for both jointly:
.
Almost 100 years later, the search for substitutes to tensile strength is ongoing, now using X-ray diffraction.
Shewhart used a sample of 60 data points, and provided the raw data on p. 60. He discusses regression in many other locations throughout the book but doesn’t dwell on the challenges of computing coefficients or even plotting the data with the technology of his day. Contrary to the first, his second book, from 1939, does not contain any instance of “regression.”
The English-language Literature on Quality
Considering the coverage of regression – or lack thereof – in my collection of English-language books on statistical quality, it is clear that the authors did not consider it to be an important topic in this context:
Author | Title | Pages | Topic pages | Total pages |
---|---|---|---|---|
Douglas C. Crocker | How to Use Regression Analysis in Quality Control (1985) | 80 | 80 | |
Edward J. Dudewicz | Juran’s Quality Handbook (5th edition, 1999) | 44.88-198 | 20 | 1730 |
Thomas Pyzdek | Handbook for Quality Management (2012) | 319-331 | 12 | 483 |
Gary S. May & Costas Spanos | Fundamentals of Semiconductor Manufacturing and Process Control (2006) | 273-283 | 11 | 488 |
Multiple authors | Western Electric Statistical Quality Control Handbook (1956) | 143-148 | 6 | 328 |
John Early | Juran’s Quality Handbook (7th edition, 2016) | 697-698 | 2 | 1131 |
Douglas Montgomery | Introduction to Statistical Quality Control (2020) | 0 | 674 | |
Henry Neave | The Deming Dimension (1990) | 0 | 440 | |
Don Wheeler | Understanding Statistical Process Control (1992) | 0 | 406 |
The Crocker book is dedicated to the use of regression analysis in quality and the ASQC published it in 1985. I couldn’t find a reference to it on the ASQ site today. Instead, it references books on regression analysis that are not focused on quality. Asked why he did not cover the topic, Douglas Montgomery wrote back, “Basically lack of space…the book is already too long for some users.”
Quality Magazine published a one-page Simple Guide to Regression Analysis in 2023. For more specifics, there is a 2016 conference paper on ScienceDirect about Regression Methods for Predicting the Product’s Quality in the Semiconductor Manufacturing Process.
The Literature on Statistics and Data Science
Occurrences of the phrase “linear regression” are more and more frequent in the American literature, as seen in Google Books:
There is abundant literature on regression under statistics, data science or machine learning. Douglas Montgomery, who skipped it in his own book on Quality Control, devoted an entire 704-page book to it: Introduction to Linear Regression Analysis. You can find this book in lists like the 20 Best Regression Books of All Time, and it contains no reference to Quality Control. This list of “20 best…” actually has 66 entries. In addition to dedicated books, most general books on statistics, data science, or machine learning cover regression. Even the Harvard Business Review published a Refresher on Regression Analysis in 2015, covering only the most basic method.
Books About Math, Algorithms, and Applications
Some books explain the math. They are useful for students seeking certification and to inventors of new tools. Others explain algorithms, and are useful to software developers or applicants to data science jobs in coding interviews. Others yet describe the application of tools to problems, and are useful to practitioners engaged in achieving, sustaining, or enhancing process capabilities.
recently posted about data science job interviews where candidates were asked to code a regression program from scratch. Algorithm books can prepare them for such challenges, that do not come up in the jobs they are applying for. As practicing data scientists, they will never have to do anything like it again. It is irrelevant to the job but used to bias the selection in favor recent university graduates over seasoned professionals.
The applied books tell you what the techniques are for, what assumptions they require, how to prepare input data, how to operate the software, and how to interpret the outputs. To use a tool, you need a mental model of what it does, like “a straight line approximation of Y as a function of X,” but you don’t need the details. To drive a car, you need to know what the engine accomplishes, but you don’t need to know how fuel injection works.
Effect of Vintage
A book’s vintage has much to do with its category. Until the 1970s, data analysis, including regression, was manual, and 100 points was a large data set. The books focused on computations you had to perform manually and included statistical tables.
The IT of the 1980s made computing power available in business. This allowed knowledgeable analysts to write their own code, while commercial statistical software gradually took over.
Fast-forward another four decades, and 30,000 points is a small data set. And there is usually a software package for any published technique you may want to try. The challenge now is to find one you can trust and learn how to use it.
The books that help you navigate this world, like Nina Zumel and John Mount’s, reference a specific software technology, in their case R. It won’t be as useful to analysts who work in Python or use SAS, or Matlab, and this kind of book ages faster than a book on math or algorithms.
Regression in Business
56 years after the publication of Ishikawa’s book, in the US, scatterplots are almost nowhere to be found in business graphics, even though they are easy to generate and taught in Middle School using the example of time between eruptions versus duration of eruptions from the Old Faithful geyser in Yellowstone. In The Visual Display of Quantitative Information (1983), Edward Tufte pointed out that he couldn’t find statistical graphics based on more than one variable in American print media for the general public, except for Business Week and the New York Times.
Regression in Business Intelligence Software
In 2018, in Where have all the scatterplots gone?, I noted that none of the reviewed Business Intelligence packages bragged about generating scatterplots, except TIBCO Spotfire. Scientists and engineers routinely consider more than one variable at a time, but it is still a bridge too far for the business community and the newspaper-reading public. The latest scatterplot in the New York Times was in 2021.
An audience that doesn’t even look at scatterplots isn’t likely to be interested in quantitative models of the kind of relationships that scatterplots reveal in qualitative form.
Kaoru Ishikawa’s 7 Tools of QC
In his 1968 QC Methods for the shop floor, later known as the “7 tools of QC,” Kaoru Ishikawa included scatterplots but not regression. The English translation came out in 1976, as Guide to Quality Control.
Scatterplots are visualizations of relationships between variables, and regression goes further by fitting a mathematical model to this relationship. The only inference Ishikawa discusses is testing for the presence of a correlation by splitting the cloud of points by the medians or both variables and tallying the points in the quadrants I and III versus II and IV:
Perhaps Ishikawa thought regression was beyond what he could teach operators in QC circles. In this case, however, why did he include control charts, which seem to have the same level of statistical sophistication?
Why is it called “Regression”?
In everyday language, regression is the opposite of progress. You acquire a skill, you progress; you lose a skill, you regress. In software,“regression testing” applies to the testing of pre-existing features in upgrades. What does it have to do with fitting a linear combination of variables to another variable? Why this term?
In the late 19th century, Francis Galton observed that the children of extremely tall people, while taller than average, tended to be shorter than their parents. He called this phenomenon “regression to mediocrity,” and the name stuck. Based on a sample of 205 families, he concluded that, on average, the deviation from the mean in children’s heights was ⅔ of their parents’. He summarized his findings in this chart:
It is an invisible grid of 1-in \times[/katek] 1-in deviations from the mean heights for both generations, marked by the number of points in the center of each square. It's a contingency table rather than a scatterplot. Shewhart later used the same method in scatterplots of creosote penetration versus depth of sap wood in a sample of 1370 telephone poles:
It is the same concept that is now used to display scatterplots with thousands of points as heat maps, marking tiny grid squares with colors instead of numbers, as in the blowtorch chart of two characteristics of 6,135 Atlantic hurricanes:
In Bernouilli’s Fallacy, Aubrey Clayton advocates replacing the cryptic “linear regression” with the more descriptive, nonjudgmental “linear modeling,” Some already use this term. For example, in R, the function for linear regression is called “lm,” short for “linear modeling.” However, statisticians and data scientists have called it “regression” for 140 years, and such an entrenched term is unlikely to be replaced.
Conclusions
Since Shewhart’s first book, except in semiconductors, quality professionals seem to have regressed in using regression. Perhaps it is time to draw their attention back to it.
Part II is an overview of what regression actually consists of, the many ways it scope has expanded in the past 140 years, and its limitations.
Part III will cover tools available to validate regression models, with an emphasis on the analysis of residuals and the identification of high-influence and high-leverage points.
References
- Clayton, A. (2021). Bernoulli's Fallacy: Statistical Illogic and the Crisis of Modern Science. Columbia University Press.
- Crocker, D. C. (1990). How to Use Regression Analysis in Quality Control. American Society for Quality Control.
- Draper, N. R., Smith, H. (2014). Applied Regression Analysis. Wiley.
- Defeo, J. A., Defeo, J. A. (2016). Juran's Quality Handbook 7E (PB). McGraw Hill LLC.
- Frost, J. (2020). Regression Analysis. Statistics, Jim Publishing.
- Gallo, A. (2015) A Refresher on Regression Analysis, Harvard Business Revies, 11/4/2015
- Hastie, T., Tibshirani, R., Friedman, J. (2013). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
- Ishikawa, K. (1968). 現場のQC手法: やさしい解説と演習. 日本科学技術連盟.
- Ishikawa, K. (1976). Guide to Quality Control. Asian Productivity Organization (Translation of the above).
- Juran, J. M., Godfrey, A. B. (1999). Juran's Quality Handbook. McGraw-Hill Education.
- May, G. S., Spanos, C. J. (2006). Fundamentals of Semiconductor Manufacturing and Process Control. Wiley.
- Melhem, M., Ananou, B., Ouladsine, M., & Pinaton, J. (2016). Regression Methods for Predicting the Product’s Quality in the Semiconductor Manufacturing Process. IFAC-PapersOnLine, 49(12), 83-88. https://doi.org/10.1016/j.ifacol.2016.07.554
- Montgomery, D. C. et al. (2021). Introduction to Linear Regression Analysis. Wiley.
- Montgomery, D. C. (2020). Introduction to Statistical Quality Control. Wiley.
- Neave, H. R. (1990). The Deming Dimension. SPC Press.
- Pyzdek, T., Keller, P. (2012). The Handbook of Quality Management 2E (PB). McGraw Hill LLC.
- Shewhart, W. A. (1931). Economic Control of Quality of Manufactured Product. Martino Publishing.
- Shewhart, W. A. (1939). Statistical Method from the Viewpoint of Quality Control. Dover Publications.
- Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press.
- Webber, L., Wallace, M. (2011). Quality Control for Dummies. Wiley.
- Where have all the scatterplots gone?
- Wheeler, D. J., Chambers, D. S. (1992). Understanding Statistical Process Control. SPC Press.
- Zumel, N., Mount, J., (2019). Practical Data Science with R, Second Edition. United States: Manning.
- Gallo, A. (2015) A Refresher on Regression Analysis, Harvard Business Revies, 11/4/2015
#quality, #statisticalprocesscontrol, #statisticalqualitycontrol, #regression, #linear regression, #spc, #sqc
Lean Roundup #184 – September 2024 | Lean Office .org
September 30, 2024 @ 3:31 pm
[…] Using Regression to Improve Quality – Michel Baudin provides explanation and application for regression technique to improve quality in your process. […]