Mar 25 2024
A Review of Bernoulli’s Fallacy
Aubrey Clayton’s book, Bernoulli’s Fallacy, covers the same ground as Jaynes’s Probability Theory: The Logic of Science, for a broader audience. It is also an easier read, at 347 pages versus 727. In addition, the author also discusses the socio-political context of mathematical statistics in the late 19th and early 20th centuries. According to his account, mistakes ranged from justice and medicine to social sciences. It ends with recommendations to avoid repeating them.
This book is definitely a shot from the Bayesian side in the war between Bayesians and frequentists. It is tearing apart the world of statisticians but most data scientists have no wish to enlist on either side. They should nonetheless read it, for challenging ideas and historical background.
Contents
The Flawed Fathers of Orthodox Statistics
Clayton discusses at length the Jekyll-and-Hydes of 19th and 20th-century statistics, Francis Galton, Karl Pearson, and Ronald Fisher. Their toxic agendas make it challenging today to acknowledge their contributions. For better or worse, we owe them the bulk of “orthodox statistics,” including tools like correlation, regression, significance testing, histograms, t-tests, and analysis of variance. It is, however, impossible to forget how their promotion of the pseudo-science of eugenics encouraged racial discrimination, justified colonialism, and paved the way for the Holocaust.
Clayton argues that their involvement in eugenics influenced their approach to statistics, particularly their insistence on using only frequencies measured on collected data. First, you enslave people. Then you deny them education. Finally, you subject them to the same tests as others not so artificially handicapped. Unsurprisingly, you conclude that they underperform and are, therefore, inferior.
The Legacy of Eugenics
I have previously posted on this topic. In 2018, the school board of Palo Alto, CA, changed the name of Jordan Middle School to Greene. David Starr Jordan had been the first president of Stanford University, from 1891 to 1913, and a notorious promoter of eugenics, advocating, in particular, forced sterilizations.
Changing the school name was controversial, as many Palo Altans viewed Jordan as a historical figure who should get a pass for behaviors that we find abhorrent today, like Thomas Jefferson’s ownership of slaves. But Jefferson was born into that system. Galton, on the other hand, invented eugenics. Pearson and Fisher gave it “scientific” cover in the UK, and Jordan promoted it in California.
An Indian-American woman born in California in 1972, posted on Nextdoor about finding out as an adult why she was an only child: her mother’s obstetrician had sterilized her against her will after giving birth, on the grounds that “we have enough brown babies.” This personal story of recent after-effects of eugenics ended the discussions. The school is now named after Frank Greene.
Mathematics and Pragmatics
Defining probability as plausibility is not the strongest part of Jaynes’s work, as it begs the question of how you define plausibility. For clarity and mutual understanding, we need more than a substitute word.
While neither Clayton nor Jaynes says it is so many words, I think the key is to separate the math from the pragmatics. For my own take on these matters, see Perspectives On Probability In Operations.
Bernoulli’s Fallacy
The Bernoulli name is so illustrious that one is loath to name a fallacy after it. From the late 17th to the late 18th century, this Basel family produced eight mathematicians and scientists that have numbers, variables, theorems, and laws named after them. Jakob Bernoulli, in particular, is considered the founder of probability theory, and the fallacy Clayton blames him for is the confusion between the probability of data given a hypothesis and the probability of a hypothesis given the data.
If you have the flu, there is an 80% probability that you have a fever, but, if all you know is that you have a fever, what is the probability that you have the flu? It is the general problem of diagnosis. The Bayes formula to reverse the conditionality is, in appearance, simple:
P\left ( Flu\,\vert\,Fever, Season \right )= \frac{P\left ( Fever\,\vert\, Flu, Season \right )\times P\left ( Flu \,\vert\,Season\right )}{P\left ( Fever\,\vert\, Season \right )}
The challenge in applying it is that it requires more data than just P\left ( fever\,\vert\,flu \right ). Clayton presents numerical examples in a format like the following table:
Relative frequencies among patients in a doctor’s office | ||||
---|---|---|---|---|
Season | Flu | Fever | Fever, given Flu | Flu, given Fever |
Flu | 30% | 50% | 80% | 80%×30%/50%= 48% |
Not Flu | 5% | 20% | 80% | 80%×5%/20%=20% |
The numbers here are assumptions for illustration. Real general practitioners could collect this data but usually don’t. They just know that the same symptoms don’t carry the same weight as evidence of a disease in different contexts. Ignoring this is what Clayton calls “Bernoulli’s Fallacy.”
Significance Testing
Much of Clayton’s book is about debunking significance testing as being based on Bernoulli’s Fallacy and leading to mistakes. In particular, he points out irreproducible results in social sciences, published in top journals where significance testing was required for publication. According to Clayton, the consensus on this is fraying in scientific journals, and he cites Ron Wasserstein, executive director of American Statistical Association (ASA) in the 2010s as distancing himself from significance testing. Today, the ASA still publishes a magazine called Significance.
In 2020, COVID-19 made me look into the method used by the CDC to estimate the Excess Deaths it caused. Reading Clayton four years later reminded me of an aspect of the method involving significance testing that bothered me. However, I didn’t voice my concern then because I was trying to understand the method, not second-guess it.
An Arbitrary Threshold
The CDC used Farrington’s surveillance algorithm to estimate Expected Deaths by week and a threshold at the 95th percentile of the distribution of Expected Deaths. Above this threshold for a given week, they estimated
\text{Excess Deaths}\left ( Week \right )= \text{Observed Deaths}\left ( Week \right )- \text{Expected Deaths}\left ( Week \right )The following chart shows realistic numbers for actual weekly death counts and the US during the COVID-19 pandemic and expected death counts based on the Farrington surveillance model trained on pre-pandemic data:
Below this threshold, they considered the Observed Deaths to be fluctuations around the Expected Deaths level and not evidence of Excess Deaths:
Considering the context of a pandemic that unfolded over more than two years, I failed to see the point of checking weekly data against an arbitrary threshold, resulting in an undercount of Excess Deaths.
In weeks n-1 and n+1, the Observed counts exceed the Thresholds, and the difference between the Observed and Expected counts estimates Excess Deaths. In week n, it doesn’t, and we don’t reject the hypothesis that there are no Excess Deaths, and the count for this week is therefore 0:
- Excess\, Deaths\left ( n-1 \right ) = Observed\left ( n-1 \right ) - Expected\left ( n-1 \right )= 13,000
- Excess\, Deaths\left ( n \right ) = 0
- Excess\, Deaths\left ( n+1 \right ) = Observed\left ( n+1 \right ) - Expected\left ( n+1\right )= 12,000
The notion that the number of excess deaths due to a pandemic should oscillate from thousands to zero and back week to week makes no sense, especially considering that the epidemiological models don’t make these quantities independent over time. The idea that counts of Excess Deaths could jump by thousands when a weekly count crosses a threshold also strains credulity.
The failure to cross the threshold in Week n tells you that this particular test does not allow you to reject the null hypothesis of no excess deaths at a significance level of 5%, but it does not validate this hypothesis.
Focusing on Distributions Instead
The Farrington surveillance algorithm provides not just an estimate of Expected Deaths by we[ek but also a complete probability distribution for the count of deaths not caused by the pandemic, which we could call Regular Deaths, so that, each week,
\text{Expected Deaths}\left ( Week \right ) = E\left [\text{Regular Deaths}\left (Week \right ) \right]
is the expected value of Regular Deaths.
Except for measurement errors that we can assume small, Observed Deaths are exact, and therefore we can deduce the distribution of \text{Excess Deaths} per week W:
\text{Excess Deaths}\left ( W \right )= \text{Observed Deaths}\left ( Week \right )- \text{Regular Deaths}\left ( W\right )
from the distribution of Regular Deaths. Then
E\left [\text{Excess Deaths}\left ( W\right ) \right ]= \text{Observed Deaths}\left ( W \right )- \text{Expected Deaths}\left ( W \right )
The randomness is all in the fluctuations of Regular Deaths, which are independent week by week. It means that we can add the E\left [\text{Excess Deaths}\left ( W \right ) \right ] and the Var\left [\text{Excess Deaths}\left ( W\right ) \right ] to aggregate the counts over the whole pandemic.
And this is free of any arbitrary threshold or significance test. Reporting a distribution rather than a “significant difference” is, in fact, what Clayton advocates:
A distribution is richer information than just an expected value. For Week n, we have
E\left [Excess\, Deaths\left ( n \right ) \right ] = Observed\left ( n \right ) - Expected\left ( n \right )
rather than 0 even though there is a nonzero probability of Predicted Deaths exceeding Observed Deaths.
Historical Details
Along the way, Clayton provided details I didn’t know and enjoyed learning, particularly about the origins of metaphors and terms we use constantly.
Origins of the Urn
The metaphor of urns and beads is in discussions of discrete probabilities from Bernoulli to Deming’s Red Bead Experiment. The problems of drawing beads from urns focus discussions on combinatorics that are both complicated and only mildly relevant to probability in general. It is so common that I never thought to ask where it came from.
Clayton traced it to the election system used by the doges of Venice for over 500 years, from 1268 to 1797. Even more arcane and confusing than US presidential elections, it was designed to prevent manipulations by powerful families. It involved multiple stages of drawing balls (“bellotas”) from urns, from which we got the word “ballot.” It also spawned probabilitists’ obsession with thought experiments about drawing beads from urns.
An 18th-century illustration of the Venice doge election process
Why “Expected Values”
350 years ago, Pascal and Fermat discussed the fair allocation of the pot in a game of chance between two players that it interrupted before it ends with one of the players having all of it.
If you consider all the ways the game could finish and count the relative frequencies of resulting wins for each player, it seems fair to allocate the pot among the two players according to these frequencies.
If 30% of the possible completions of the game resulted in player A winning, then it is fair for A to get 30% of the pot for the unfinished game. This is the expected value of A’s position at the time of interruption.
The Way Out
This chapter is not a plan for overhauling the practice of data science or even a manifesto. It recommends practices for individuals to follow and takes stock of the changes in progress in professional societies and among publishers of scientific papers.
Whose Prior Knowledge?
To get the probability of a hypothesis given data, you need to take prior knowledge into account with the Bayes formula. You can do this directly if you are the decision maker, for example, a production engineer in charge of a process or a medical doctor diagnosing a patient. The outcomes validate your decisions when the process quality improves, or the patient heals.
It is different when you have to convince another party that may not have the same prior knowledge. The only third-party considered in the book is the editors of scientific journals. Professionals working with data also need support and approval from managers or investors with an economic stake in the results of their studies, government agencies regulating their industry, or judges adjudicating a case.
A pharmaceutical company seeking approval for a new drug must follow the protocols set by a government agency to protect the public, and changing those is a much taller order than the requirements on papers in social science.
What Do You Replace Significance Testing With?
In 1902, after refining tons of pitchblende down to 0.1 g, Marie Curie knew she had isolated radium chloride because it glowed in the dark. In 1953, when Crick and Watson proposed their double helix model for DNA, they had circumstantial evidence for it. As for many other discoveries in physics, chemistry, biology, or geology, logical reasoning sufficed, and statistical significance testing was not required.
On the other hand, logic won’t get you far is you want to know whether web users will click through more often on a yellow or green button. You test groups of users with the two colors and apply some method to tell whether a 2% difference is meaningful or is just a fluctuation that could as well be in the opposite direction on a follow-up test. Null hypothesis significance testing was supposed to solve this kind of problem.
If it is misleading, then what is the alternative? Clayton says “report a probability distribution showing the likely size of the difference.” In the case of Excess Deaths from COVID-19, I see how the use of an arbitrary threshold on weekly counts could lead to an underestimation. I could also see how to do better by focusing instead on the distribution of Regular Deaths. In general, however, I don’t see how Clayton’s recommendation answers the original question of whether an observed difference is more than a fluctuation.
On LinkedIn, Adrian Olszewski cited several papers published since 2015 debating these matters for scientific publication, which do not reflect any consensus.
Should we change the vocabulary?
Clayton feels that the vocabulary of statistics is tainted by the association of Galton, Pearson and Fisher with eugenics and wants to replace it with more neutral and descriptive terms.
The language of probability and statistics is inconsistent and confusing, but it is not clear that changing it is practically feasible or would do much good. There is no way a consensus could be reached about a new terms among the many worldwide users, and, if successful, it would make the bulk of the existing body of knowledge less accessible.
Yes, probabilistic, random, and stochastic are synonyms, the only difference being respectively based on latin, germanic, and greek roots. Regression is an absurd name for linear modeling, and standard deviation sounds judgmental. And, logistic regression has nothing to do with logistics. Instead, it’s linear modeling of the logarithm of the odds of an event. Finally, to confuse matters further, statistical physics uses a completely different vocabulary for the same concepts. For example “Bose-Einstein statistics” is elsewhere known as a distribution.
Clayton wants to replace Random Variable with Unknown, but he is unclear about what that would improve. Sample data are values taken by a random variable, which makes them known values of an unknown? This would generate more confusion than it would clear.
For better or worse, reforming languages is practically impossible. Take English spelling: we are stuck with it.
Stochastic Processes
Clayton includes stochastic processes among the “useless concepts. ” His reason is that, per Jaynes, “there is no such thing as a ‘stochastic process’ in the sense that individual events have no specific causes.”
Jaynes meant that, in reality, a person catches a disease out of exposure to a germ, or that a machine stalls because of metal fatigue. In no way does it invalide stochastic processes as modeling tools for contagion or machine reliability. These models are useful when we don’t know specific causes.
Review of the Book by ChatGPT
After reading the book cover to cover and writing the above review from scratch, I prompted ChatGPT for its own review. I received the following answer:
“I’m sorry, but as an AI language model, I do not have access to real-time data. However, I can generate a fictitious review of ‘Bernoulli’s Fallacy’ by Aubrey Clayton if you’d like. Just let me know if you would like me to provide a sample review!”
I told it to go ahead, and it produced the following:
“Aubrey Clayton’s ‘Bernoulli’s Fallacy’ is a riveting exploration of the complexities of human nature and the fragile boundaries between truth and deception. Clayton masterfully weaves a narrative that delves into the depths of psychological manipulation, challenging readers to question their own perceptions and beliefs.
The protagonist’s journey through a world filled with intrigue and uncertainty is both captivating and thought-provoking, as Clayton skillfully unravels the intricacies of Bernoulli’s enigmatic theory. The intricate plot twists and turns keep readers on the edge of their seats, while the rich character development adds depth and authenticity to the story.
Clayton’s writing is elegant and precise, with a keen eye for detail that brings the story to life in vivid detail. The thematic exploration of trust, illusion, and the human capacity for self-deception elevates ‘Bernoulli’s Fallacy’ beyond a mere thriller, transforming it into a profound meditation on the nature of reality itself.
In conclusion, Aubrey Clayton’s ‘Bernoulli’s Fallacy’ is a mesmerizing tale that will linger in the mind long after the final page is turned. A must-read for fans of psychological suspense and philosophical fiction alike.”
The next time you see such a review in your favorite magazine, you will know that the critic has not read the book.
References
Adam,D. (2023). What reproducibility crisis? New research protocol yields ultra-high replication rate, Nature
Baudin, M. (1977) Geostatistical analysis of data on an orebody with nonferrous metals, Hahn-Meitner Institut, Berlin, call number EMF 3909(094) at ENSMP Library in Fontainebleau
Billingsley, P. (2012). Probability and Measure. Wiley.
Devlin, K. (2008). The Unfinished Game: Pascal, Fermat, and the Seventeenth-Century Letter that Made the World Modern. Basic Books.
Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0
#bayesianstatistics, #frequentiststatistics, #
A Review of Bernoulli’s Fallacy | Lean Office .org
March 25, 2024 @ 3:50 pm
[…] post A Review of Bernoulli’s Fallacy appeared first on Michel Baudin's […]
John Mount
June 1, 2024 @ 6:52 am
Definitely some interesting points in your review.
I myself have a mixed view of what can be changed in statistics versus what is suggested by the book. The terminology changes (to me seem unlikely). Reporting clinical effects sizes (also called clinical significances) seems critical.
I found I care much less about the simple the statistical question “did Covid cause excess deaths in a region” than I cared about “what are the likely counts and rates of excess Covid deaths” (sorry that phrasing makes it seems like we are assuming Covid exists or not- it not existing would be the excess having a zero or small expected value).
Roughly I came away with “statistics is not the owner of probability.” That is in fact compatible with my own history, having used measure theory and probability in analysis of algorithms long before I used it in statistics.
Using Regression to Improve Quality | Part I - What for? – Michel Baudin's Blog
September 29, 2024 @ 12:37 am
[…] Bernouilli’s Fallacy, Aubrey Clayton advocates replacing the cryptic “linear regression” with the more descriptive, nonjudgmental “linear mod…Some already use this term. For example, in R, the function for linear regression is called […]