statistics Archives – Michel Baudin's Blog

Jun 25 2025

Update on Data Science versus Statistics

Based on the usage of the terms in the literature, I have concluded that statistics has been subsumed under data science. I view statistics as beginning with a dataset and ending with conclusions, while data science starts with sensors and transaction processing, and ends in data products for end users. Kelleher & Tierney’s Data Science views it the same way, and so do tool-specific references like Gromelund’s R for Data Science, or Zumel & Mount’s Practical Data Science with R.

Brad Efron and Trevor Hastie are two prominent statisticians with a different perspective. In the epilogue of their 2016 book, Computer Age Statistical Inference, they describe data science as a subset of statistics that emphasizes algorithms and empirical validation, while inferential statistics focuses on mathematical models and probability theory.

Efron and Hastie’s book is definitely about statistics, as it contains no discussion of data acquisition, cleaning, storage and retrieval, or visualization. I asked Brad Efron about it and he responded: “That definition of data science is fine for its general use in business and industry.” He and Hastie were looking at it from the perspective of researchers in the field.

By Michel Baudin • Data science, Uncategorized • 0 • Tags: data science, math, statistics

Pavimento_di_siena,_ruota_della_fortuna_small

Jun 12 2022

Perspectives On Probability In Operations

The spirited discussions on LinkedIn about whether probabilities are relative frequencies or quantifications of beliefs are guaranteed to baffle practitioners. They come up in threads about manufacturing quality, supply-chain management, and public health, and do not generate much light. Their participants trade barbs without much civility, and without actually exchanging on substance.

The latest one, by Alexander von Felbert, is among the more thoughtful, and therefore unlikely to inspire rants. I do, however, fault it with using words like “aleatory” or “epistemic” that I don’t think are helpful. I am trying to discuss it here in everyday language, and to apply the concepts to numerically specific cases, with an eye to operations.

While there are genuinely great and not-so-great ideas, the root of the most violent disagreements is elsewhere, with individuals generalizing from different experience bases. You may map probability to reality differently depending on whether you are developing drugs in the pharmaceutical industry, enhancing yield in a semiconductor process, or driving down dppms in auto parts. The math doesn’t care as long as you follow its rules, and it doesn’t invalidate other interpretations.

By Michel Baudin • Data science • 0 • Tags: Bayesian Statistics, data science, Probability, statistics

May 14 2015

Where Have The Scatterplots Gone?

What passes for “business analytics” (BI), as advertised by software vendors, is limited to basic and poorly designed charts that fail to show interactions between variables, even though the use of scatterplots and elementary regression is taught to American middle schoolers and to shop floor operators participating in quality circles.

But the software suppliers seem to think that it is beyond the cognitive ability of executives. Technically, scatterplots are not difficult to generate, and there are even techniques to visualize more complex interactions than between pairs of variables, like trendalyzers or 3D scatterplots. And, of course, visualization is only the first step. You usually need other techniques to base any decision on data.

By Michel Baudin • Data science • 4 • Tags: data science, Quality, scatterplot, statistics

Jul 30 2014

“Studies show…” or do they?

Various organization put out studies that, for example, purport to “identify performances and practices in place among U.S. manufacturers.” The reports contain tables and charts, with narratives about “significant gaps” — without stating any level of significance — or “exponential growth” — as if there were no other kind. They borrow the vocabulary of statistics or data science, but don’t actually use the science; they just use the words to support sweeping statements about what manufacturers should do for the future.

At the bottom of the reports, there usually is a paragraph about the study methodology, explaining that the data was collected as answers to questionnaires mailed to manufacturers and made available on line, with the incentive for recipients to participate being a free copy of the report. The participants are asked, for example, to rate “the importance of process improvement to their organization’s success over the next five years” on a scale of 1 to 5.

The results are a compilation of subjective answers from a self-selected sample. In marketing, this kind of surveys makes sense. You throw out a questionnaire about a product or a service. The sheer proportion of respondents gives you information about the level of interest in what you are offering, and the responses may further tell you about popular features and shortcomings.

But it is not an effective approach to gauge the state of an industry. For this purpose, you need objective data, either on all companies involved or on a representative sample that you select. Government bodies like the Census Bureau or the Bureau of Labor Statistics collect useful global statistics like value-added per employee or the ratio indirect to direct labor by industry, but they are just a starting point.

Going beyond is so difficult that I don’t know of any successful case. Any serious assessment of a company or factory requires visiting it, interviewing its leaders in person, and reviewing its data. It takes time, money, know-how, and a willing target. It means that the sample has to be small, but there is a clash between the objective of having a representative sample and the constraint of having a sample of the willing.

For these reasons, benchmarking is a more realistic approach, and I know of at least two successful benchmarking studies in manufacturing, both of which, I believe, were funded by the Sloan Foundation:

The first was the International Assembly Plant Study, conducted in the late 1980s about the car industry, whose findings were summarized in The Machine That Changed The World in 1990. The goal was not to identify the distribution of manufacturing practices worldwide but to compare the approaches followed in specific plants of specific companies, for the purpose of learning. Among other things, the use of the term “Lean” came out of this study.
The second is the Competitive Semiconductor Manufacturing Program, which started in the early 1990s with a benchmarking study of wafer fabrication facilities worldwide. It did not have the public impact of the car assembly plant study, but it did provide valuable information to industry participants.

The car study was conducted out of MIT; the semiconductor study, out of UC Berkeley. Leadership from prestigious academic organizations helped in convincing companies to participate and provided students to collect and analyze the data. Consulting firms might have had better expertise, but could not have been perceived as neutral with respect to the approaches used by the different participants.

The bottom line is that studies based on subjective answers from a self-selected sample are not worth the disk space you can download them onto.

By Michel Baudin • Management • 4 • Tags: Benchmarking, data science, Lean, Manufacturing, statistics, survey

May 13 2014

The GM Toyota Rating Scale | Bill Waddell

See on Scoop.it – lean manufacturing

“In a survey of suppliers on their working relationships with the six major U.S. auto makers – Toyota, Honda, Nissan, Ford, Chrysler and GM – GM scored the worst. But of course they did. They are GM and we can always count on such results from them. […] Toyota scored highest with a ranking of 318, followed by Honda at 295, Nissan at 273, Ford at 267, Chrysler at 245, with GM trotting along behind the rest with an embarrassing 244.”

Michel Baudin‘s comments:

While I am not overly surprised at the outcome, I am concerned about the analysis method. The scores are weighted counts of subjective assessments, with people being asked to rate, for example, the “Supplier-Company overall working relationship” or “Suppliers’ opportunity to make acceptable returns over the long term.”

This is not exactly like the length of a rod after cutting or the sales of Model X last month. There is no objective yardstick, and two individuals might rate the same company behavior differently.

It is not overly difficult to think of more objective metrics, such as, for example, the “divorce rate” within a supplier network. What is the rate at which existing suppliers disappear from the network and others come in? The friction within a given Supplier-Customer relationship could be assessed from the number of incidents like the customer paying late or the supplier missing deliveries…

Such data is more challenging to collect, but supports more solid inferences than opinions.

See on www.idatix.com

By Michel Baudin • Blog clippings • 1 • Tags: GM, statistics, Subjective data, Supply Chain Management, Toyota

statistics

Update on Data Science versus Statistics

Like this:

Perspectives On Probability In Operations

Like this:

Where Have The Scatterplots Gone?

Like this:

“Studies show…” or do they?

Like this:

The GM Toyota Rating Scale | Bill Waddell

Like this:

Follow Blog via Email

Recent Posts

Categories

statistics

Update on Data Science versus Statistics

Share this:

Like this:

Perspectives On Probability In Operations

Share this:

Like this:

Where Have The Scatterplots Gone?

Share this:

Like this:

“Studies show…” or do they?

Share this:

Like this:

The GM Toyota Rating Scale | Bill Waddell

Share this:

Like this:

Follow Blog via Email

Recent Posts

Categories

Social links

My tags