Apr 10 2023

The Possession of Data

Since the turn of the 21st century, the possession of data has become a greater source of economic power than ever before, dwarfing, in particular, analytics. It’s data about us humans, not about the physical world: our demographic segment, where we live, what we buy, our opinions, and our relationships.

This is a development that manufacturing companies have not been leading. To survive this disruption, their managers need to think more deeply about data, how they can collect it, and what they can do with it to stay competitive.

The Business of Data

Alphabet/Google and Meta/Facebook are now two of the top ten companies in the world by market capitalization, and their main asset is data. Ads comprise respectively 80% and 100% of their revenues. These ads target users who volunteer data about themselves in exchange for free services centered on access to other data.

By comparison, the leader in analytics software, SAS Institute, is privately held and had $3.2B in revenue in 2021, about 1/20 of Google Ads. This is counterintuitive, as we might expect the ability to make sense of data to be valued more highly than just possession.

As reflected in the functions offered by Business Intelligence (BI) packages, data analysis is limited to producing pie charts, stacked bar charts, and the occasional time series plot for most business people. Even though American Middle Schools teach scatterplots in 7th grade, they don’t even use them. As a result, the market for sophisticated tools is small.

In addition, individuals and small teams still invent new ways of analyzing data, as it does not require the server farms with petabytes of data that only large companies can maintain. Netflix now has 231 million subscribers and the history of their viewing choices, based on which it offers custom recommendations. In 2006, Netflix had 6.3 million subscribers and organized a contest to improve its recommendation system. A 7-person, 4-country team called BellKor’s Pragmatic Chaos won in 2009. The winners of the Makridakis forecasting competitions are likewise often individuals or small teams.

OpenAI’s ChatGPT and Dall-E are attracting attention in 2023. Both create new materials from user prompts, respectively text, and pictures, but it is only possible due to the data provided by investors to train the apps. What enables ChatGPT to have a conversation with you is its Large Language Model (LLM), that OpenAI built from a dataset that is a corpus of text with trillions of words.

The World’s Knowledge in Four Drawers?

In the latest 4/3/23 issue of the New Yorker, Jill Lepore wrote an article called The Data Delusion in the electronic version and Data-Driven in the print version. While she makes valid points about data misuse, her framework for the issues does not strike me as helpful, and I think a critique of it is a starting point. Lepore organizes the world’s knowledge into four drawers:

Jill Lepore’s organization of human knowledge

Later in the article, Lepore does not follow her own framework. For example, she writes about “converting facts into data to be read by machines” in the 1930s. If facts are “things humans can prove,” what does it mean to convert them into data? If data “must be extracted by a computer,” how could you do this in the 1930s, before computers were invented? Let’s look at each “drawer” in more detail, bottom to top.

Data

The word “data” has been in use with a consistent meaning since 1640, or ~310 years before the computer was invented. At the turn of the 20th century, the Wright brothers are well known for collecting lift data in notebooks, ~50 years before the computer was invented.

Computers are great at crunching data but you don’t need a computer to explain what data is. In the context of computers, as Don Knuth put it, data is “the stuff that’s input or output.” In a more general context, it’s the stuff that’s read or written.

While data can be many things, it is not knowledge. Data can be wrong. There is nothing in the concept of data that implies truth. That’s why data scientists spend most of their time cleaning and validating data. What Lepore calls “numbers” is not a separate category. It is data that has been collected since the eighteenth century and happens to be in the form of measurements or counts.

Numbers

Numbers are a special case of data, but data is not limited to numbers. Data can be text, pictures, sound and video recordings, scent, texture,…, and anything else that can be read or written. That computers translate it into 1s and 0s is a technical detail that is relevant only to microcode programmers.

Facts

Barcelona beating Madrid 4-0 on 3/20/2022

Facts are not files. They are true statements, and they are data, but you can’t jump from raw observations to facts without going through information, which is what you learn from reading data. It is a fact that Barcelona beat Madrid at soccer on March 20, 2022. That the score was 4 to 0 is raw data. If you didn’t already know that Barcelona won, reading the score gives you that information. As it’s true, it’s a fact:

Scientists understand that observation, detection, and experiment cannot prove an assertion but only fail to refute it. The plausibility of the assertion rises as failures to refute it accumulate but it never becomes a certainty. Accepting an assertion as fact is deciding to neglect the probability that it is false. What we think of as knowledge comprises the assertions we accept as facts and logical inferences from these assertions.

This leads to a fourth concept that is missing from Lepore’s “drawers.” As you accumulate knowledge, you become more resourceful in responding to familiar and unexpected situations. You can call this resourcefulness Wisdom.

Mysteries

If, as Lepore says, mysteries are what “only God knows,” then no human does, and it’s not knowledge.

Data Science

The article discusses data science, without actually explaining what it is and how it differs from statistics. It’s not obvious from looking up the terms, but there are at least two sources worth looking into:

The course syllabi of reputable schools.
The requirements from employers in job offers.

The two may or may not match. For data science, they mostly overlap. The picture that emerges is that statistics is subsumed in data science. Statistics starts with a dataset and ends with conclusions about it. Data science starts with the acquisition or retrieval of data and ends with the delivery of results.

Data scientists do more than statisticians:

They collect data: sensor readings, event occurrences, documents, video recordings,…
They organize, store and retrieve data from databases, data warehouses, data lakes, or web scrapings.
They build and validate datasets.
They analyze datasets and draw conclusions.
They package the conclusions into data products, including visual displays and explanations.
They present the results in different forms as needed to decision-makers.

The whole field of Statistics is in Step 4. The data scientist doesn’t stop with the “Aha!” moment of discovering a pattern. “Data science” is not a particularly apt name for this range of activities because it is too broad. One could argue, for example, that all science is data science, but current usage is specific.

Misuses of Data Science

Except in physics, statistics is always about deriving characteristics of populations from observations on individuals, as in aggregating buyers’ behavior in a supermarket into sales of beer and diapers. In statistical physics, you go in the opposite direction, as in inferring how individual gas molecules behave from pressure and temperature measured over large populations.

A point that Lepore misses is that data science gets into trouble when using population characteristics to predict individual behaviors. You can for a gas because it is comprised of identical molecules. Recognizing or classifying an inanimate object like a handwritten “8” is not a problem; selectively breeding cows to increase milk production is not a problem; making decisions about humans based on characteristics of groups they were born into, on the other hand, is the very definition of discrimination, and it is a problem.

Sins of the past

From Francis Galton and Karl Pearson to Ronald Fisher, the founders of classical statistics in the late 19th and early 20th century had a dark side, where they wrapped their prejudices in the pseudoscience of eugenics. Their analyses purported to show that some human races were innately superior to others based on data that were biased by prior discrimination. Deny education and economic opportunities to a group, steal their property, and they will underperform, based on which you justify continuing discrimination.

Descendants of immigrants using eugenics to bar new immigrants

The March-April 2016 issue of Harvard Magazine recounts Harvard’s Eugenics Era, which includes this cartoon showing the descendants of immigrants using eugenics to keep new immigrants out. In 2018, two early advocates of eugenics, David Starr Jordan, the first president of Stanford University, and Lewis Terman, a professor at that same university, had their names stripped off the two Middle Schools of Palo Alto, CA. This was the subject of a heated debate on NextDoor until a woman came forward, the only daughter of an Indian mother. She was born in California in 1972 and explained that she has no siblings because a eugenics-addled obstetrician had sterilized her mother against her will on the grounds that “we have enough brown babies.” This ended the discussion. The Palo Alto Middle Schools now bear the names of Frank Greene and Ellen Fletcher.

Sins of the present

In the 20th century, discrimination was the fate of many ethnic or racial minorities, as well as women, in many countries that have since made it illegal, Today’s AI, however, perpetuates it when applied to job or loan applications and the key reason is that it still attributes to people the characteristics of the intersection of groups they belong to, denying their individual character.

Gender, age group, or skin color do not determine human beings, and group averages do not predict their individual behavior or performance. Women may, on average, have less upper body strength than men, but Susan, here, can bench-press 200 lbs while Joe, there, can barely lift 70. Susan and Joe are individuals, not averages.

This is not to say that group membership is never relevant, particularly when it is the result of an individual’s efforts. You want a lawyer in the Bar and a board-certified surgeon. Even groups that an individual is born into may be relevant to effectiveness in some occupations, like policing. Beyond such special cases, we must judge each person based on individual characteristics that are not in any data.

We can apply data science to optical character recognition, the identification of active principles for drugs, the validation of bridge designs, tutorials to help human learners,…, and many other fields where it does not perpetuate discrimination. We don’t want to use the technology of the 21st century to repeat the monstrosities of the 20th.

Social Sciences Need More Data, Not Less

Later in the article, Lepore bemoans the supposed reliance of social sciences on data, and advocates returning to “other ways of knowing,” meaning what? Numbers are a special case of data, you derive facts from validated data, and mysteries are not knowledge at all.

A theory that can predict nothing can find followers if plausible and attractively presented, but it won’t solve any problem. The conventional wisdom of management today is laden with theories from social sciences that have, at best, flimsy bases in experiments or observations, such as the Hawthorne effect, the Dunning-Kruger effect, Kübler-Ross and her stages of grief, Maslow and his hierarchy of needs, the Myers-Briggs personality tests, Hofstede’s national culture dimensions, or IQ tests.

The Flaws of Surveys

100 years ago, Mary Parker Follett recommended that social scientists observe people. It is still good advice. What this has come to mean, however, is relying on surveys. As Emrah Mahir commented on “Hundreds of Studies Show…”, “In social sciences, the dataset almost always consists of subjectively rated answers to questions obtained from questionnaires.”

Voluntary surveys suffer from self-selection bias, in ways that aren’t necessarily easy to assess. In addition, the effort put into making questions objective is never fully successful. On a scale of one to ten, two individuals with the same sentiment will assign ratings as different as 3 and 8. Comparisons and averages of such ratings are meaningless, but it doesn’t seem to stop peer-reviewed journals from publishing them. You can do better if you have limited objectives and are willing to put in the work, as seen in the examples below.

Analyzing E-Commerce Reviews

Assume an e-commerce company collects reviews from customers, and only one in ten customers turn one in. What about the remaining 90%? Do they not provide reviews because they are happy? Are unhappy customers more likely to respond? To test this hypothesis, you need to work out measurable consequences.

Happy customers tend to buy again. If it’s true that happy customers are unlikely to write reviews, there should be more repeat customers among the non-reviewers than among the reviewers. In e-commerce, you know who your customers are because you ship goods to them. You can therefore test the counts of repeat customers among non-reviewers and reviewers for a difference.

Analyzing Employee Morale

If you survey your own workforce about job satisfaction, you only collect subjective answers from self-selected respondents who may lie out of fear. On the other hand, some employee actions reveal their morale.

Among those, the most drastic is quitting. The employee turnover rate is an objective indicator of morale, but external factors like the economic environment affect it. Its value, relative to comparable companies in the same area, is more informative than its absolute value.

For example, a turnover rate of 10%, in absolute terms, is high but it’s not in relative terms if the average of local companies is 40%, as is common among the maquiladoras lining the US-Mexico border. Absenteeism is another barometer of morale, also more meaningful in relative than in absolute terms.

Mining Centuries of Tax Records

In Capital in the 21st Century, Thomas Piketty plows through the tax records of multiple countries to assess the evolution of wealth and income distribution over two centuries. It’s 1,000 pages, at the end of which you know at what age Canadians inherit, how much, and how this has changed over time. He makes cautious inferences from massive data, which you cannot say of most of his fellow economists.

References

The inside story of how ChatGPT was built from the people who made it (2023)
Piketty, T. (2017) Capital in the Twenty-First Century. United Kingdom: Harvard University Press
Carlos A. Gomez-Uribe & Neil Hunt (2015) T he Netflix Recommender System: Algorithms, Business Value, and Innovation
Follett, M. P. (2013 reprint) Creative Experience. United States: Martino Fine Books
The BigChaos Solution to the Netflix Grand Prize (2009)

#data, #numbers, #facts, #surveys, #information, #knowledge, #wisdom

By Michel Baudin • Information Technology • 3 • Tags: Data, Facts, Information, Knowledge, Numbers, Surveys, Wisdom

3 Comments

Joerg Muenzing
April 28, 2023 @ 9:22 am

Dear Michel, great article like always. Thinking about the discrimination aspect, it may not be all bad. For example, the Israeli airport security department has found that most terrorist attacks are perpetrated by men between the ages of 19 and 27. So they focus security checks on this narrow group to reduce the risk, rather than being “fair” and requiring the 84-year-old grandma to go through a detailed special check when the random number generator targets her. Such correlations are different from causality, but might still useful in achieving a desired outcome.

Tim Sendler
May 4, 2023 @ 1:11 am

The impact of big data on modern society is enormous, and the aforementioned companies are proof of that. As a result, organizations that foster empowerment and develop a data-driven, collaborative culture are more likely to achieve favorable results compared to more traditional and controlled businesses.

A Review of Bernouilli’s Fallacy – Michel Baudin's Blog
March 25, 2024 @ 1:25 am

[…] have previously posted on this topic. In 2018, the school board of Palo Alto, CA, changed the name of Jordan Middle School to Greene. […]

The Possession of Data

The Business of Data

The World’s Knowledge in Four Drawers?