Jun 25 2025
Update on Data Science versus Statistics
Based on the usage of the terms in the literature, I have concluded that statistics has been subsumed under data science. I view statistics as beginning with a dataset and ending with conclusions, while data science starts with sensors and transaction processing, and ends in data products for end users. Kelleher & Tierney’s Data Science views it the same way, and so do tool-specific references like Gromelund’s R for Data Science, or Zumel & Mount’s Practical Data Science with R.


Brad Efron and Trevor Hastie are two prominent statisticians with a different perspective. In the epilogue of their 2016 book, Computer Age Statistical Inference, they describe data science as a subset of statistics that emphasizes algorithms and empirical validation, while inferential statistics focuses on mathematical models and probability theory.
Efron and Hastie’s book is definitely about statistics, as it contains no discussion of data acquisition, cleaning, storage and retrieval, or visualization. I asked Brad Efron about it and he responded: “That definition of data science is fine for its general use in business and industry.” He and Hastie were looking at it from the perspective of researchers in the field.
Apr 3 2026
Deviating Standard Deviations
This basic concept deserves revisiting. The following is from a blog post from 2022 hosted by a supplier of statistical software intended to explain the meaning of some notations in plain, simple terms:
The author calls two different things by the same name. If the standard deviation of each variable is 1, how could its expected value be anything else? The confusion within this nonsensical statement is the same we make when we equate the temperature of a soup with a thermometer reading. In our mental model of a bowl of soup, it has a temperature that exists regardless of our ability to measure it, and the thermometer reading is only an estimate of it.
For the purposes of eating soup, confusing the two is harmless, unless the thermometer, poorly calibrated, always gives you an answer that is 15°F off. This is the situation we have with the most commonly used estimator of the standard deviation of a random variable from a small sample. It is biased, and c_4(n) is a correction factor applicable when the random variable is Gaussian.
To describe c_4(N) accurately, we need to dig into probability theory. It is, in fact, the expected value of the estimator S=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}\left ( X_i -\bar{X} \right )} of the standard deviation from a sample of N independent Gaussian variables \left ( X_1, \dots, X_N \right ) with unit standard deviation, \sigma = 1. This is an accurate statement, but every term in it needs an explanation.
Share this:
Like this:
By Michel Baudin • Technology 0 • Tags: Control Charts, Probability, Quality, Six Sigma, SPC, Standard Deviation, statistics