Technology Archives – Michel Baudin's Blog

Apr 3 2026

Deviating Standard Deviations

This basic concept deserves revisiting. The following is from a blog post from 2022 hosted by a supplier of statistical software intended to explain the meaning of some notations in plain, simple terms:

The author calls two different things by the same name. If the standard deviation of each variable is 1, how could its expected value be anything else? The confusion within this nonsensical statement is the same we make when we equate the temperature of a soup with a thermometer reading. In our mental model of a bowl of soup, it has a temperature that exists regardless of our ability to measure it, and the thermometer reading is only an estimate of it.

For the purposes of eating soup, confusing the two is harmless, unless the thermometer, poorly calibrated, always gives you an answer that is 15°F off. This is the situation we have with the most commonly used estimator of the standard deviation of a random variable from a small sample. It is biased, and $c_4(n)$ is a correction factor applicable when the random variable is Gaussian.

To describe $c_4(N)$ accurately, we need to dig into probability theory. It is, in fact, the expected value of the estimator $S=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}\left ( X_i -\bar{X} \right )^2}$ of the standard deviation from a sample of $N$ independent Gaussian variables $\left ( X_1, \dots, X_N \right )$ with unit standard deviation, $\sigma = 1$ . This is an accurate statement, but every term in it needs an explanation.

By Michel Baudin • Technology 0 • Tags: Control Charts, Probability, Quality, Six Sigma, SPC, Standard Deviation, statistics

Sep 17 2025

Label your charts!

Charts you share with others need a bodyguard of text to be self-explanatory, avert misunderstandings, and support learning. None of this matters when you chart exclusively for your own use, but it is obligatory when communicating with a team or making a case to management.

Generating an informative, actionable chart can take hours; documenting and labeling it should take minutes, yet we encounter charts with missing or unclear labels in business documents, published articles, and even textbooks.

By Michel Baudin • Data science 0 • Tags: Axis label, Chart, SPC

Jun 25 2025

Update on Data Science versus Statistics

Based on the usage of the terms in the literature, I have concluded that statistics has been subsumed under data science. I view statistics as beginning with a dataset and ending with conclusions, while data science starts with sensors and transaction processing, and ends in data products for end users. Kelleher & Tierney’s Data Science views it the same way, and so do tool-specific references like Gromelund’s R for Data Science, or Zumel & Mount’s Practical Data Science with R.

Brad Efron and Trevor Hastie are two prominent statisticians with a different perspective. In the epilogue of their 2016 book, Computer Age Statistical Inference, they describe data science as a subset of statistics that emphasizes algorithms and empirical validation, while inferential statistics focuses on mathematical models and probability theory.

Efron and Hastie’s book is definitely about statistics, as it contains no discussion of data acquisition, cleaning, storage and retrieval, or visualization. I asked Brad Efron about it and he responded: “That definition of data science is fine for its general use in business and industry.” He and Hastie were looking at it from the perspective of researchers in the field.

By Michel Baudin • Data science, Uncategorized 0 • Tags: data science, math, statistics

Dec 28 2024

Using Regression to Improve Quality | Part III — Validating Models

Whether your goal is to identify substitute characteristics or solve a process problem, regression algorithms can produce coefficients for almost any data. However, it doesn’t mean the resulting models are any good.

In machine learning, you divide your data into a training set on which you calculate coefficients and a testing set to check the model’s predictive ability. Testing concerns externally visible results and is not specific to regression.

Validation, on the other hand, is focused on the training set and involves using various regression-specific tools to detect inconsistencies with assumptions. For these purposes, we review methods provided by regression software.

In this post, we explore the meaning and the logic behind the tools provided for this purpose in linear simple and multiple regression in R, with the understanding that similar functions are available from other software and that similar tools exist for other forms of regression.

It is an attempt to clarify the meaning of these numbers and plots and help readers use them. They will be the judges of how successful it is.

The body of the post is about the application of these tools to an example dataset available from Kaggle, with about 30,000 data points. For the curious, some mathematical background is given in the appendix.

Many of the tools are developments from the last 40 years and, therefore, are not covered in the statistics literature from earlier decades.

By Michel Baudin • Data science 0 • Tags: Linear Model, Quality, regression, Validation

Sep 8 2024

Using Regression to Improve Quality | Part II – Fitting Models

This is a personal guided tour of regression techniques intended for manufacturing professionals involved with quality. Starting from “historical monuments” like simple linear regression and multiple regression, it goes through “mid-century modern” developments like logistic regression. It ends with newer constructions like bootstrapping, bagging, and MARS. It is limited in scope and depth, because a full coverage would require a book and knowledge of many techniques I have not tried. See the references for more comprehensive coverage.

To fit a regression model to a dataset today, you don’t need to understand the logic, know any formula, or code any algorithm. Any statistical software, starting with electronic spreadsheets, will give you regression coefficients, confidence intervals for them, and, often, tools to assess the model’s fit.

However, treating it as a black box that magically fits curves to data is risky. You won’t understand what you are looking at and will draw mistaken conclusions. You need some idea of the logic behind regression in general or behind specific variants to know when to use them, how to prepare data, and to interpret the outputs.

By Michel Baudin • Data science 0 • Tags: Bagging, Bootstrapping, Kriging, Linear regression, Logistic regression, MARS, Multiple regression, Multivariate regression, Substitute characteristic, True characteristic

Sep 3 2024

Using Regression to Improve Quality | Part I – What for?

In quality, regression serves to identify substitutes for true characteristics that are hard to observe and to find the root causes of technically challenging process problems. It is a major topic in data science, but oddly, the most extensive coverage I could find in the literature on quality is in Shewhart’s first book, from 1931! Later books, including Shewhart’s second, discuss it briefly or not at all. The ASQC, forerunner of the ASQ, published an 80-page guide on how to use regression analysis in quality control in 1985, but has not updated it since.

Regression analysis has been around for almost 140 years and has grown massively in scope, capabilities, and dataset size. Perhaps, it is time for professionals involved with quality to take another look at it.

By Michel Baudin • Data science, Tools 1 • Tags: Quality, regression, Statistical Process Control

Technology

Deviating Standard Deviations

Like this:

Label your charts!

Like this:

Update on Data Science versus Statistics

Like this:

Using Regression to Improve Quality | Part III — Validating Models

Like this:

Using Regression to Improve Quality | Part II – Fitting Models

Like this:

Using Regression to Improve Quality | Part I – What for?

Like this:

Follow Blog via Email

Recent Posts

Categories

Technology

Deviating Standard Deviations

Share this:

Like this:

Label your charts!

Share this:

Like this:

Update on Data Science versus Statistics

Share this:

Like this:

Using Regression to Improve Quality | Part III — Validating Models

Share this:

Like this:

Using Regression to Improve Quality | Part II – Fitting Models

Share this:

Like this:

Using Regression to Improve Quality | Part I – What for?

Share this:

Like this:

Follow Blog via Email

Recent Posts

Categories

Social links

My tags