# Using Regression to Improve Quality | Part I – What for?

In quality, regression serves to identify substitutes for true characteristics that are hard to observe and to find the root causes of technically challenging process problems. It is a major topic in data science, but oddly, the most extensive coverage I could find in the literature on quality is in Shewhart’s first book, from 1931! Later books, including Shewhart’s second, discuss it briefly or not at all. The ASQC, forerunner of the ASQ, published an 80-page guide on how to use regression analysis in quality control in 1985, but has not updated it since.

Regression analysis has been around for almost 140 years and has grown massively in scope, capabilities, and dataset size. Perhaps, it is time for professionals involved with quality to take another look at it.

## Uses of Regression in Quality

Regression is a set of statistical/data science techniques for explaining a random variable through other random variables. At the dawn of statistical quality, Shewhart applied regression in the search for substitute characteristics and in root cause analysis when logic and process knowledge were insufficient.

### Identifying Substitute Characteristics

The characteristics of products that customers expect can be numeric variables or go/nogo attributes that are not always easy to capture or understand. Observing these true characteristics may take too long, require expensive instruments, or entail the destruction of the product. We need to find easily observable substitute characteristics to use instead of the true ones.

As scatterplot gives a visual hint of a relationship between a substitute and a true characteristic. Correlation analysis validates its existence, and regression quantifies it to support decisions about the product.

### Root Cause Analysis

Sometimes, logic and process knowledge is sufficient to identify the causes of problems, à la Sherlock Holmes or Dr. House. This works for failure analysis on a specific door that flew off an airliner in flight. On the other hand, it is different when you are investigating a population of diecast cases that went from 5% to 10% of leakers.

Beyond die change and handling at leak testing, there is almost no human intervention in diecasting, but many settings and process variables can influence the outcome in ways that you need to identify and quantify. For this purpose, correlation and regression analysis are helpful.

## The Literature

The literature on statistical quality control says surprisingly little about scatterplots, correlation, and regression, even though they are major topics in statistics/data science in general.

### Shewhart’s First Book

Shewhart’s first book, published in 1931, extensively discusses regression. Measurements of tensile strength on aluminum die castings are destructive, and he examines the feasibility of using hardness and density as substitutes you can measure without tearing apart the workpiece. Fig. 14, on p. 52, summarize his findings in terms of simple regression for hardness and density separately, and multiple regression for both jointly:

.

Almost 100 years later, the search for substitutes to tensile strength is ongoing, now using X-ray diffraction.

Shewhart used a sample of 60 data points, and provided the raw data on p. 60. He discusses regression in many other locations throughout the book but doesn’t dwell on the challenges of computing coefficients or even plotting the data with the technology of his day. Contrary to the first, his second book, from 1939, does not contain any instance of “regression.”

### The English-language Literature on Quality

Considering the coverage of regression – or lack thereof – in my collection of English-language books on statistical quality, it is clear that the authors did not consider it to be an important topic in this context:

AuthorTitlePagesTopic pagesTotal pages
Douglas C. CrockerHow to Use Regression Analysis in Quality Control (1985)8080
Edward J. DudewiczJuran’s Quality Handbook (5th edition, 1999) 44.88-198201730
Thomas PyzdekHandbook for Quality Management (2012) 319-33112483
Gary S. May & Costas SpanosFundamentals of Semiconductor Manufacturing and Process Control (2006) 273-28311488
Multiple authorsWestern Electric Statistical Quality Control Handbook (1956)143-1486328
John EarlyJuran’s Quality Handbook (7th edition, 2016) 697-69821131
Douglas MontgomeryIntroduction to Statistical Quality Control (2020)l0674
Henry NeaveThe Deming Dimension (1990)0440
Don WheelerUnderstanding Statistical Process Control (1992)0406

The Crocker book is dedicated to the use of regression analysis in quality and the ASQC published it in 1985. I couldn’t find a reference to it on the ASQ site today. Instead, it references books on regression analysis that are not focused on quality.

Quality Magazine published a one-page Simple Guide to Regression Analysis in 2023. For more specifics, there is a 2016 conference paper on ScienceDirect about Regression Methods for Predicting the Product’s Quality in the Semiconductor Manufacturing Process.

### The Literature on Statistics and Data Science

Occurrences of the phrase “linear regression” are more and more frequent in the American literature, as seen in Google Books:

There is abundant literature on regression under statistics, data science or machine learning. Douglas Montgomery, who skipped it in his own book on Quality Control, devoted an entire 704-page book to it: Introduction to Linear Regression Analysis. You can find this book in lists like the 20 Best Regression Books of All Time, and it contains no reference to Quality Control. This list of “20 best…” actually has 66 entries. In addition to dedicated books, most general books on statistics, data science, or machine learning cover regression.

#### Books About Math, Algorithms, and Applications

Some books explain the math. They are useful for students seeking certification and to inventors of new tools. Others explain algorithms, and are useful to software developers. Others yet describe the application of tools to problems, and are useful to practitioners engaged in achieving, sustaining, or enhancing process capabilities.

The applied books tell you what the techniques are for, what assumptions they require, how to prepare input data, how to operate the software, and how to interpret the outputs. To use a tool, you need a mental model of what it does, like “a straight line approximation of Y as a function of X,” but you don’t need the details. To drive a car, you need to know what the engine accomplishes, but you don’t need to know how fuel injection works.

#### Effect of  Vintage

A book’s vintage has much to do with its category. Until the 1970s, data analysis, including regression, was manual, and 100 points was a large data set. The books focused on computations you had to perform manually and included statistical tables.

The IT of the 1980s made computing power available in business. This allowed knowledgeable analysts to write their own code, while commercial statistical software gradually took over.

Fast-forward another four decades, and 30,000 points is a small data set. And there is usually a software package for any published technique you may want to try. The challenge now is to find one you can trust and learn how to use it.

The books that help you navigate this world, like Nina Zumel and John Mount’s, reference a specific software technology, in their case R. It won’t be as useful to analysts who work in Python or use SAS, or Matlab, and this kind of book ages faster than a book on math or algorithms.

56 years after the publication of Ishikawa’s book, in the US, scatterplots are almost nowhere to be found in business graphics, even though they are easy to generate and taught in Middle School using the example of time between eruptions versus duration of eruptions from the Old Faithful geyser in Yellowstone. In The Visual Display of Quantitative Information (1983), Edward Tufte pointed out that he couldn’t find statistical graphics based on more than one variable in American print media for the general public, except for Business Week and the New York Times.

### Regression in Business Intelligence Software

In 2018, in Where have all the scatterplots gone?, I noted that none of the reviewed Business Intelligence packages bragged about generating scatterplots, except TIBCO Spotfire. Scientists and engineers routinely consider more than one variable at a time, but it is still a bridge too far for the business community and the newspaper-reading public. The latest scatterplot in the New York Times was in 2021.

An audience that doesn’t even look at scatterplots isn’t likely to be interested in quantitative models of the kind of relationships that scatterplots reveal in qualitative form.

### Kaoru Ishikawa’s 7 Tools of QC

In his 1968 QC Methods for the shop floor, later known as the “7 tools of QC,” Kaoru Ishikawa included scatterplots but not regression. The English translation came out in 1976, as Guide to Quality Control.

Scatterplots are visualizations of relationships between variables, and regression goes further by fitting a mathematical model to this relationship. The only inference Ishikawa discusses is testing for the presence of a correlation by splitting the cloud of points by the medians or both variables and tallying the points in the quadrants I and III versus II and IV:

Perhaps Ishikawa thought regression was beyond what he could teach operators in QC circles. In this case, however, why did he include control charts, which seem to have the same level of statistical sophistication?

# Why is it called “Regression”?

In everyday language, regression is the opposite of progress. You acquire a skill, you progress; you lose a skill, you regress. In software,“regression testing” applies to the testing of pre-existing features in upgrades. What does it have to do with fitting a linear combination of variables to another variable? Why this term?

In the late 19th century, Francis Galton observed that the children of extremely tall people, while taller than average, tended to be shorter than their parents. He called this phenomenon “regression to mediocrity,” and the name stuck. Based on a sample of 205 families, he concluded that, on average, the deviation from the mean in children’s heights was ⅔ of their parents’. He summarized his findings in this chart:

It is an invisible grid of 1-in \times[/katek] 1-in deviations from the mean heights for both generations, marked by the number of points in the center of each square. It's a contingency table rather than a scatterplot. Shewhart later used the same method in scatterplots of creosote penetration versus depth of sap wood in a sample of 1370 telephone poles:

It is the same concept that is now used to display scatterplots with thousands of points as heat maps, marking tiny grid squares with colors instead of numbers, as in the blowtorch chart of two characteristics of 6,135 Atlantic hurricanes:

In Bernouilli’s Fallacy, Aubrey Clayton advocates replacing the cryptic “linear regression” with the more descriptive, nonjudgmental “linear modeling,” Some already use this term. For example, in R, the function for linear regression is called “lm,” short for “linear modeling.” However, statisticians and data scientists have called it “regression” for 140 years, and such an entrenched term is unlikely to be replaced.

# Conclusions

Since Shewhart’s first book, except in semiconductors, quality professionals seem to have regressed in using regression. Perhaps it is time to draw their attention back to it.

Part II is an overview of what regression actually consists of, the many ways it scope has expanded in the past 140 years, and its limitations.

Part III will cover tools available to validate regression models, with an emphasis on the analysis of residuals and the identification of high-influence and high-leverage points.