Based on the usage of the terms in the literature, I have concluded that statistics has been subsumed under data science. I view statistics as beginning with a dataset and ending with conclusions, while data science starts with sensors and transaction processing, and ends in data products for end users. Kelleher & Tierney’s Data Science views it the same way, and so do tool-specific references like Gromelund’s R for Data Science, or Zumel & Mount’s Practical Data Science with R.
Trevor HastieBradley Efron
Brad Efron and Trevor Hastie are two prominent statisticians with a different perspective. In the epilogue of their 2016 book, Computer Age Statistical Inference, they describe data science as a subset of statistics that emphasizes algorithms and empirical validation, while inferential statistics focuses on mathematical models and probability theory.
Efron and Hastie’s book is definitely about statistics, as it contains no discussion of data acquisition, cleaning, storage and retrieval, or visualization. I asked Brad Efron about it and he responded: “That definition of data science is fine for its general use in business and industry.” He and Hastie were looking at it from the perspective of researchers in the field.
They summarize the orbit of the field of statistics since the 19th century in the following triangle, and then explain each of the milestones:
Mathematical statistics
From Karl Pearson’s theory of the chi-square in 1900 to Abraham Wald decision theory in 1950, Efron & Hastie see statistics morphing into a branch of applied math, relying on theoretical sophistication to compensate for the paucity of data, and drifting away from applications. This led to what they describe as “a nadir of the influence of the statistics discipline on scientific applications” in the 1950s.
The advent of the computer
Computers changed the game by making massive data sets accessible and tractable, as dramatized in the 1957 movie The Desk Set. Efron himself noted that, while 100 points had been a large data set when he started out in the 1960s, 30,000 points was a small one by the time he retired.
Frederick MostellerJohn Tukey
In academic statistics, computers triggered a shift away from mathematical theories back towards applications, with an emphasis on what programs could do with data. This prompted Mosteller and Tukey to suggest renaming the discipline to data analysis.
Efron and Hastie view data science as branching out in 1995 from the probability models that were essential to support conclusions from small datasets. They view data science as a subset of statistics that extracts information from large datasets using methods validated empirically. The description of these methods, like SVM, reads more like geometry on a cloud of points with as many dimensions as they have characteristics.
Conclusions
Logically, data science cannot be both a superset and a subset of statistics. It is, however, the kind of dissonance that we routinely tolerate in geography, for example when we use the same name to refer to the territory of Hong Kong and to an island within it:
Likewise, we can get used to using “data science” both for the entire process of working with data, from collection to the delivery of results and for a subset of statistics, applying empirically validated algorithms to large data sets.
What matters is that data science in the sense of Efron and Hastie represents a fundamental departure from the approach that had dominated statistics through the 20th century, and that computers enabled this departure. Writing in 2016, Efron & Hastie were aware of tools that they would place under data science, like Random Forests, SVM, Bagging and Boosting, but not of later developments, like LLMs.
Jun 25 2025
Update on Data Science versus Statistics
Based on the usage of the terms in the literature, I have concluded that statistics has been subsumed under data science. I view statistics as beginning with a dataset and ending with conclusions, while data science starts with sensors and transaction processing, and ends in data products for end users. Kelleher & Tierney’s Data Science views it the same way, and so do tool-specific references like Gromelund’s R for Data Science, or Zumel & Mount’s Practical Data Science with R.
Brad Efron and Trevor Hastie are two prominent statisticians with a different perspective. In the epilogue of their 2016 book, Computer Age Statistical Inference, they describe data science as a subset of statistics that emphasizes algorithms and empirical validation, while inferential statistics focuses on mathematical models and probability theory.
Efron and Hastie’s book is definitely about statistics, as it contains no discussion of data acquisition, cleaning, storage and retrieval, or visualization. I asked Brad Efron about it and he responded: “That definition of data science is fine for its general use in business and industry.” He and Hastie were looking at it from the perspective of researchers in the field.
Contents
Efron & Hastie’s view of statistics since 1900
They summarize the orbit of the field of statistics since the 19th century in the following triangle, and then explain each of the milestones:
Mathematical statistics
From Karl Pearson’s theory of the chi-square in 1900 to Abraham Wald decision theory in 1950, Efron & Hastie see statistics morphing into a branch of applied math, relying on theoretical sophistication to compensate for the paucity of data, and drifting away from applications. This led to what they describe as “a nadir of the influence of the statistics discipline on scientific applications” in the 1950s.
The advent of the computer
Computers changed the game by making massive data sets accessible and tractable, as dramatized in the 1957 movie The Desk Set. Efron himself noted that, while 100 points had been a large data set when he started out in the 1960s, 30,000 points was a small one by the time he retired.
In academic statistics, computers triggered a shift away from mathematical theories back towards applications, with an emphasis on what programs could do with data. This prompted Mosteller and Tukey to suggest renaming the discipline to data analysis.
Efron and Hastie view data science as branching out in 1995 from the probability models that were essential to support conclusions from small datasets. They view data science as a subset of statistics that extracts information from large datasets using methods validated empirically. The description of these methods, like SVM, reads more like geometry on a cloud of points with as many dimensions as they have characteristics.
Conclusions
Logically, data science cannot be both a superset and a subset of statistics. It is, however, the kind of dissonance that we routinely tolerate in geography, for example when we use the same name to refer to the territory of Hong Kong and to an island within it:
Likewise, we can get used to using “data science” both for the entire process of working with data, from collection to the delivery of results and for a subset of statistics, applying empirically validated algorithms to large data sets.
What matters is that data science in the sense of Efron and Hastie represents a fundamental departure from the approach that had dominated statistics through the 20th century, and that computers enabled this departure. Writing in 2016, Efron & Hastie were aware of tools that they would place under data science, like Random Forests, SVM, Bagging and Boosting, but not of later developments, like LLMs.
#datascience, #statistics, #math
Share this:
Like this:
Related
By Michel Baudin • Data science, Uncategorized • 0 • Tags: data science, math, statistics