Sep 8 2024
Using Regression to Improve Quality | Part II – Fitting Models
This is a personal guided tour of regression techniques intended for manufacturing professionals involved with quality. Starting from “historical monuments” like simple linear regression and multiple regression, it goes through “mid-century modern” developments like logistic regression. It ends with newer constructions like bootstrapping, bagging, and MARS. It is limited in scope and depth, because a full coverage would require a book and knowledge of many techniques I have not tried. See the references for more comprehensive coverage.
To fit a regression model to a dataset today, you don’t need to understand the logic, know any formula, or code any algorithm. Any statistical software, starting with electronic spreadsheets, will give you regression coefficients, confidence intervals for them, and, often, tools to assess the model’s fit.
However, treating it as a black box that magically fits curves to data is risky. You won’t understand what you are looking at and will draw mistaken conclusions. You need some idea of the logic behind regression in general or behind specific variants to know when to use them, how to prepare data, and to interpret the outputs.
Dec 28 2024
Using Regression to Improve Quality | Part III — Validating Models
Whether your goal is to identify substitute characteristics or solve a process problem, regression algorithms can produce coefficients for almost any data. However, it doesn’t mean the resulting models are any good.
In machine learning, you divide your data into a training set on which you calculate coefficients and a testing set to check the model’s predictive ability. Testing concerns externally visible results and is not specific to regression.
Validation, on the other hand, is focused on the training set and involves using various regression-specific tools to detect inconsistencies with assumptions. For these purposes, we review methods provided by regression software.
In this post, we explore the meaning and the logic behind the tools provided for this purpose in linear simple and multiple regression in R, with the understanding that similar functions are available from other software and that similar tools exist for other forms of regression.
It is an attempt to clarify the meaning of these numbers and plots and help readers use them. They will be the judges of how successful it is.
The body of the post is about the application of these tools to an example dataset available from Kaggle, with about 30,000 data points. For the curious, some mathematical background is given in the appendix.
Many of the tools are developments from the last 40 years and, therefore, are not covered in the statistics literature from earlier decades.
Continue reading…
Contents
Share this:
Like this:
By Michel Baudin • Data science • 0 • Tags: Linear Model, Quality, regression, Validation