A Review of Bernoulli’s Fallacy

Aubrey Clayton’s book, Bernoulli’s Fallacy, covers the same ground as Jaynes’s Probability Theory: The Logic of Science, for a broader audience. It is also an easier read, at 347 pages versus 727. In addition, the author also discusses the socio-political context of mathematical statistics in the late 19th and early 20th centuries. According to his account, mistakes ranged from justice and medicine to social sciences. It ends with recommendations to avoid repeating them. 

This book is definitely a shot from the Bayesian side in the war between Bayesians and frequentists.  It is tearing apart the world of statisticians but most data scientists have no wish to enlist on either side.  They should nonetheless read it, for challenging ideas and historical background. 

The Flawed Fathers of Orthodox Statistics

Clayton discusses at length the Jekyll-and-Hydes of 19th and 20th-century statistics, Francis Galton, Karl Pearson, and Ronald Fisher. Their toxic agendas make it challenging today to acknowledge their contributions. For better or worse, we owe them the bulk of “orthodox statistics,” including tools like correlation, regression, significance testing, histograms, t-tests, and analysis of variance. It is, however, impossible to forget how their promotion of the pseudo-science of eugenics encouraged racial discrimination, justified colonialism, and paved the way for the Holocaust.

The two faces of Francis Galton

Clayton argues that their involvement in eugenics influenced their approach to statistics, particularly their insistence on using only frequencies measured on collected data. First, you enslave people. Then you deny them education. Finally, you subject them to the same tests as others not so artificially handicapped. Unsurprisingly, you conclude that they underperform and are, therefore, inferior.

The Legacy of Eugenics

I have previously posted on this topic. In 2018, the school board of Palo Alto, CA, changed the name of Jordan Middle School to Greene. David Starr Jordan had been the first president of Stanford University, from 1891 to 1913, and a notorious promoter of eugenics, advocating, in particular, forced sterilizations.

Changing the school name was controversial, as many Palo Altans viewed Jordan as a historical figure who should get a pass for behaviors that we find abhorrent today, like Thomas Jefferson’s ownership of slaves. But Jefferson was born into that system. Galton, on the other hand, invented eugenics. Pearson and Fisher gave it “scientific” cover in the UK, and Jordan promoted it in California.

An Indian-American woman born in California in 1972, posted on Nextdoor about finding out as an adult why she was an only child: her mother’s obstetrician had sterilized her against her will after giving birth, on the grounds that “we have enough brown babies.” This personal story of recent after-effects of eugenics ended the discussions. The school is now named after Frank Greene.

Mathematics and Pragmatics

Defining probability as plausibility is not the strongest part of Jaynes’s work, as it begs the question of how you define plausibility. For clarity and mutual understanding, we need more than a substitute word.

While neither Clayton nor Jaynes says it is so many words, I think the key is to separate the math from the pragmatics. For my own take on these matters, see Perspectives On Probability In Operations.

Bernoulli’s Fallacy

The Bernoulli name is so illustrious that one is loath to name a fallacy after it. From the late 17th to the late 18th century, this Basel family produced eight mathematicians and scientists that have numbers, variables, theorems, and laws named after them. Jakob Bernoulli, in particular, is considered the founder of probability theory, and the fallacy Clayton blames him for is the confusion between the probability of data given a hypothesis and the probability of a hypothesis given the data.

If you have the flu, there is an 80% probability that you have a fever, but, if all you know is that you have a fever, what is the probability that you have the flu? It is the general problem of diagnosis. The Bayes formula to reverse the conditionality is, in appearance, simple:

P\left ( Flu\,\vert\,Fever, Season \right )= \frac{P\left ( Fever\,\vert\, Flu, Season \right )\times P\left ( Flu \,\vert\,Season\right )}{P\left ( Fever\,\vert\, Season \right )}

 

The challenge in applying it is that it requires more data than just P\left ( fever\,\vert\,flu \right ). Clayton presents numerical examples in a format like the following table:

Relative frequencies among patients in a doctor’s office
SeasonFluFeverFever, given FluFlu, given Fever
Flu30%50%80%80%×30%/50%= 48%
Not Flu5%20%80%80%×5%/20%=20%

The numbers here are assumptions for illustration. Real general practitioners could collect this data but usually don’t. They just know that the same symptoms don’t carry the same weight as evidence of a disease in different contexts. Ignoring this is what Clayton calls “Bernoulli’s Fallacy.”

Significance Testing

Much of Clayton’s book is about debunking significance testing as being based on Bernoulli’s Fallacy and leading to mistakes. In particular, he points out irreproducible results in social sciences, published in top journals where significance testing was required for publication. According to Clayton, the consensus on this is fraying in scientific journals, and he cites Ron Wasserstein, executive director of American Statistical Association (ASA) in the 2010s as distancing himself from significance testing. Today, the ASA still publishes a magazine called Significance.

In 2020, COVID-19 made me look into the method used by the CDC to estimate the Excess Deaths it caused. Reading Clayton four years later reminded me of an aspect of the method involving significance testing that bothered me. However, I didn’t voice my concern then because I was trying to understand the method, not second-guess it.

An Arbitrary Threshold

The CDC used Farrington’s surveillance algorithm to estimate Expected Deaths by week and a threshold at the 95th percentile of the distribution of Expected Deaths. Above this threshold for a given week, they estimated

\text{Excess Deaths}\left ( Week \right )= \text{Observed Deaths}\left ( Week \right )- \text{Expected Deaths}\left ( Week \right )

The following chart shows realistic numbers for actual weekly death counts and the US during the COVID-19 pandemic and expected death counts based on the Farrington surveillance model trained on pre-pandemic data:

Observed and Expected Deaths by Week

Below this threshold, they considered the Observed Deaths to be fluctuations around the Expected Deaths level and not evidence of Excess Deaths:

Applying thresholds to actual weekly deaths

Considering the context of a pandemic that unfolded over more than two years, I failed to see the point of checking weekly data against an arbitrary threshold, resulting in an undercount of Excess Deaths.

In weeks n-1 and n+1, the Observed counts exceed the Thresholds, and the difference between the Observed and Expected counts estimates Excess Deaths. In week n, it doesn’t, and we don’t reject the hypothesis that there are no Excess Deaths, and the count for this week is therefore 0:

  • Excess\, Deaths\left ( n-1 \right ) = Observed\left ( n-1 \right ) - Expected\left ( n-1 \right )= 13,000
  • Excess\, Deaths\left ( n \right ) = 0
  • Excess\, Deaths\left ( n+1 \right ) = Observed\left ( n+1 \right ) - Expected\left ( n+1\right )= 12,000

The notion that the number of excess deaths due to a pandemic should oscillate from thousands to zero and back week to week makes no sense, especially considering that the epidemiological models don’t make these quantities independent over time. The idea that counts of Excess Deaths could jump by thousands when a weekly count crosses a threshold also strains credulity.

The failure to cross the threshold in Week n tells you that this particular test does not allow you to reject the null hypothesis of no excess deaths at a significance level of 5%, but it does not validate this hypothesis.

Focusing on Distributions Instead

The Farrington surveillance algorithm provides not just an estimate of Expected Deaths by we[ek but also a complete probability distribution for the count of deaths not caused by the pandemic, which we could call Regular Deaths, so that, each week,

\text{Expected Deaths}\left ( Week \right ) = E\left [\text{Regular Deaths}\left (Week \right ) \right]

 

is the expected value of Regular Deaths.

Except for measurement errors that we can assume small, Observed Deaths are exact, and therefore we can deduce the distribution of \text{Excess Deaths} per week W:

\text{Excess Deaths}\left ( W \right )= \text{Observed Deaths}\left ( Week \right )- \text{Regular Deaths}\left ( W\right )

 

from the distribution of Regular Deaths. Then

E\left [\text{Excess Deaths}\left ( W\right ) \right ]= \text{Observed Deaths}\left ( W \right )- \text{Expected Deaths}\left ( W \right )

 

The randomness is all in the fluctuations of Regular Deaths, which are independent week by week. It means that we can add the E\left [\text{Excess Deaths}\left ( W \right ) \right ] and the Var\left [\text{Excess Deaths}\left ( W\right ) \right ] to aggregate the counts over the whole pandemic.

And this is free of any arbitrary threshold or significance test. Reporting a distribution rather than a “significant difference” is, in fact, what Clayton advocates:

 

Death count distributions

A distribution is richer information than just an expected value. For Week n, we have

E\left [Excess\, Deaths\left ( n \right )  \right ] = Observed\left ( n \right ) - Expected\left ( n \right ) 

rather than 0 even though there is a nonzero probability of Predicted Deaths exceeding Observed Deaths.

Historical Details

Along the way, Clayton provided details I didn’t know and  enjoyed learning, particularly about the origins of metaphors and terms we use constantly. 

Origins of the Urn

The metaphor of urns and beads is in discussions of discrete probabilities from Bernoulli to Deming’s Red Bead Experiment. The problems of drawing beads from urns focus discussions on combinatorics that are both complicated and only mildly relevant to probability in general. It is so common that I never thought to ask where it came from.

Clayton traced it to the election system used by the doges of Venice for over 500 years, from 1268 to 1797. Even more arcane and confusing than US presidential elections, it was designed to prevent manipulations by powerful families. It involved multiple stages of drawing balls (“bellotas”) from urns, from which we got the word “ballot.” It also spawned probabilitists’ obsession with thought experiments about drawing beads from urns.

An 18th-century illustration of the Venice doge election process

Why “Expected Values”

350 years ago, Pascal and Fermat discussed the fair allocation of the pot in a game of chance between two players that it interrupted before it ends with one of the players having all of it.

If you consider all the ways the game could finish and count the relative frequencies of resulting wins for each player, it seems fair to allocate the pot among the two players according to these frequencies.

If 30% of the possible completions of the game resulted in player A winning, then it is fair for A to get 30% of the pot for the unfinished game. This is the expected value of A’s position at the time of interruption.

The Way Out

This chapter is not a plan for overhauling the practice of data science or even a manifesto. It recommends practices for individuals to follow and takes stock of the changes in progress in professional societies and among publishers of scientific papers.

Whose Prior Knowledge?

To get the probability of a hypothesis given data, you need to take prior knowledge into account with the Bayes formula. You can do this directly if you are the decision maker, for example, a production engineer in charge of a process or a medical doctor diagnosing a patient. The outcomes validate your decisions when the process quality improves, or the patient heals.

It is different when you have to convince another party that may not have the same prior knowledge. The only third-party considered in the book is the editors of scientific journals. Professionals working with data also need support and approval from managers or investors with an economic stake in the results of their studies, government agencies regulating their industry, or judges adjudicating a case.

A pharmaceutical company seeking approval for a new drug must follow the protocols set by a government agency to protect the public, and changing those is a much taller order than the requirements on papers in social science.

What Do You Replace Significance Testing With?

In 1902, after refining tons of pitchblende down to 0.1 g, Marie Curie knew she had isolated radium chloride because it glowed in the dark. In 1953, when Crick and Watson proposed their double helix model for DNA, they had circumstantial evidence for it. As for many other discoveries in physics, chemistry, biology, or geology, logical reasoning sufficed, and statistical significance testing was not required.

On the other hand, logic won’t get you far is you want to know whether web users will click through more often on a yellow or green button. You test groups of users with the two colors and apply some method to tell whether a 2% difference is meaningful or is just a fluctuation that could as well be in the opposite direction on a follow-up test. Null hypothesis significance testing was supposed to solve this kind of problem.

If it is misleading, then what is the alternative? Clayton says “report a probability distribution showing the likely size of the difference.” In the case of Excess Deaths from COVID-19, I see how the use of an arbitrary threshold on weekly counts could lead to an underestimation. I could also see how to do better by focusing instead on the distribution of Regular Deaths. In general, however, I don’t see how Clayton’s recommendation answers the original question of whether an observed difference is more than a fluctuation.

On LinkedIn, Adrian Olszewski cited several papers published since 2015 debating these matters for scientific publication, which do not reflect any consensus.

Should we change the vocabulary?

Clayton feels that the vocabulary of statistics is tainted by the association of Galton, Pearson and Fisher with eugenics and wants to replace it with more neutral and descriptive terms.

The language of probability and statistics is inconsistent and confusing, but it is not clear that changing it is practically feasible or would do much good. There is no way a consensus could be reached about a new terms among the many worldwide users, and, if successful, it would make the bulk of the existing body of knowledge less accessible.

Yes, probabilistic, random, and stochastic are synonyms, the only difference being respectively based on latin, germanic, and greek roots. Regression is an absurd name for linear modeling, and standard deviation sounds judgmental. And, logistic regression has nothing to do with logistics. Instead, it’s linear modeling of the logarithm of the odds of an event. Finally, to confuse matters further, statistical physics uses a completely different vocabulary for the same concepts.  For example “Bose-Einstein statistics” is elsewhere known as a distribution.

Clayton wants to replace Random Variable with Unknown, but he is unclear about what that would improve. Sample data are values taken by a random variable, which makes them known values of an unknown? This would generate more confusion than it would clear.

For better or worse, reforming languages is practically impossible. Take English spelling: we are stuck with it.

Stochastic Processes

Clayton includes stochastic processes among the “useless concepts. ” His reason is that, per Jaynes, “there is no such thing as a ‘stochastic process’ in the sense that individual events have no specific causes.”

Jaynes meant that, in reality, a person catches a disease out of exposure to a germ, or that a machine stalls because of metal fatigue. In no way does it invalide stochastic processes as modeling tools for contagion or machine reliability. These models are useful when we don’t know specific causes.

Review of the Book by ChatGPT

After reading the book cover to cover and writing the above review from scratch, I prompted ChatGPT for its own review. I received the following answer:

“I’m sorry, but as an AI language model, I do not have access to real-time data. However, I can generate a fictitious review of ‘Bernoulli’s Fallacy’ by Aubrey Clayton if you’d like. Just let me know if you would like me to provide a sample review!”

I told it to go ahead, and it produced the following:

“Aubrey Clayton’s ‘Bernoulli’s Fallacy’ is a riveting exploration of the complexities of human nature and the fragile boundaries between truth and deception. Clayton masterfully weaves a narrative that delves into the depths of psychological manipulation, challenging readers to question their own perceptions and beliefs.

The protagonist’s journey through a world filled with intrigue and uncertainty is both captivating and thought-provoking, as Clayton skillfully unravels the intricacies of Bernoulli’s enigmatic theory. The intricate plot twists and turns keep readers on the edge of their seats, while the rich character development adds depth and authenticity to the story.

Clayton’s writing is elegant and precise, with a keen eye for detail that brings the story to life in vivid detail. The thematic exploration of trust, illusion, and the human capacity for self-deception elevates ‘Bernoulli’s Fallacy’ beyond a mere thriller, transforming it into a profound meditation on the nature of reality itself.

In conclusion, Aubrey Clayton’s ‘Bernoulli’s Fallacy’ is a mesmerizing tale that will linger in the mind long after the final page is turned. A must-read for fans of psychological suspense and philosophical fiction alike.”

The next time you see such a review in your favorite magazine, you will know that the critic has not read the book.

References

Adam,D. (2023). What reproducibility crisis? New research protocol yields ultra-high replication rate, Nature

Baudin, M. (1977) Geostatistical analysis of data on an orebody with nonferrous metals, Hahn-Meitner Institut, Berlin, call number EMF 3909(094) at ENSMP Library in Fontainebleau

Billingsley, P. (2012). Probability and Measure. Wiley.

Devlin, K. (2008). The Unfinished Game: Pascal, Fermat, and the Seventeenth-Century Letter that Made the World Modern. Basic Books.

Mayo, D.G., Hand, D. Statistical significance and its critics: practicing damaging science, or damaging scientific practice?. Synthese 200, 220 (2022). https://doi.org/10.1007/s11229-022-03692-0

#bayesianstatistics, #frequentiststatistics, #