There Is More To Data Than Just Numbers

Don Wheeler’s Understanding Variation starts with a chapter entitled “Data are random and miscellaneous” that contains no discussion of any part of its title. Implicit in Wheeler’s book, however, is the view that data consists of tables of numbers, representing either measured variables — lengths, weights, densities,… — or event occurrence counts — defective units, defects, machine failures,…

Many times, I have quoted computer scientist Don Knuth on this subject, saying that data is “the stuff that’s input or output,” meaning anything that can be read or written, and it includes much more than tables of numbers. The data we work with today includes, for example, the following:

  • Unstructured text, like 25,000 incident reports written by maintenance techs all over the world in their versions of English about problems with jet engines, or thousands of product reviews posted by consumers on e-commerce sites
  • Images, like photographs of visual defects on products, or electron-microscope images of integrated circuits.
  • Videos recordings of operations.

Analyzing data about a manufacturing process today means extracting information from all sources. The state of the art, based on automatic data acquisition and databases includes analytical techniques that were unthinkable in Shewhart’s day, known under the labels of data science, data mining or machine learning.

Unstructured Text

Business data in the form of unstructured text is common because, whatever IT is used, data collection is focused on administration rather than technical issues. There are boxes or fields to formally capture when an event happened, what it affected, who intervened or when it was resolved, all of which is input to timesheets, but the technical observations of the responders, their diagnosis and the solutions they implemented are documented in their own words in comment blocks.

The most plausible reason for this is that administration is the same across many activities — making it possible to sell many software licenses — while technical content is specific to each. The same maintenance work order forms may be usable for injection molding machines and port cranes, but the technical content of the work will be different.

What technicians put into these blocks of text are descriptions of what happened and how they responded. The challenge is to get at it even though each technician may sequence the information differently. It may all be different ways of saying “the pump leaked and I replaced the gasket.”

Product reviews are also in unstructured text because you don’t want to do constrain customers giving feedback. To analyze these, first, you need to filter out the fake reviews that have proliferated on the web. Then, in the reviews you have authenticated, you need to infer the emotions of the authors. It’s called sentiment analysis.

The tools for this kind of analysis are generically known as text mining. The basic technique is feature extraction — that is, searching text for keywords or phrases and extracting the corresponding values. Then the text is turned into a list of name-value pairs that can be tabulated for number crunching. Beyond feature extraction is concept mining, which is the identification of topics covered in the texts that are not in a predetermined list. It’s about finding out what you didn’t know was an issue. It works by identifying words or phrases that appear more frequently in the incident reports about, say, a particular model of injection-molding machines than in the reports on all models in the company’s plants.

When unstructured text is produced in an international organization, it has the choice of requiring employees to write them all in the same language or letting them use their own. Requiring technicians whose native tongues may be Thai, Mandarin or Arabic to document what they do in English may restrict the content of their reports and make them difficult to understand; letting them use their own requires translation. It is a more attractive option today than it was even 10 or 15 years ago, as machine translation software has improved, particularly on specialized topics. Facebook’s general translator can give you an idea of what a post in a different language is about, but usually not what it is saying. Machine translations can be much more accurate and readable if focused on, say, defects observed in a category of products or failure of particular machine types.


Photographs have been used as data for a long time by spies, in criminal investigations and in the documentation of traffic accidents for insurance. In 1962, aerial photographs were the data that proved the presence of Soviet missiles in Cuba. Pictorial data in the form of first drawings and then photographs have been central in sciences like geology or zoology.

Until pictures were digitized, their analysis was entirely manual, by human experts. Many techniques developed over the past 40 years to make decisions based on pictures involve the transforming them to enhance features of interest and measuring these features. They enable you, for example, to detect blemishes in vegetables or scratches on a painted surface. These techniques reduce the pictures to numbers, as feature extraction does to unstructured text.

Bullet holes in returning World War II bombers

Other techniques retain and enhance pictorial details like the locations of features of interest, which can be accumulated over many similar pictures. A famous example, cited about sample bias in an earlier post, is the picture of bullet holes in World War II bombers returning to England. Likewise, if you superimpose the pictures of many units of the same painted surface, you may find that scratches all occur in the same area of the bottom left-hand corner, which helps you identify the cause. This is also a reduction — many photographic details are filtered out — but to a map rather than a set of numbers, and this map can evolve over time.

Scanned images of text are a special case of picture data. If it is printed, you can turn it into unstructured text by Optical Character Recognition (OCR). If, however, your scans are of paper forms with handwritten comments, from technicians or operators, it won’t help. For print, today’s OCR software is usable, if not 100% accurate. Writer-independent recognition of handwriting, on the other hand, is still in the future, and companies that collect handwritten forms still pay people to manually transcribe them into electronic text.


As explained in an 8-part series of articles on using videos to improve operations is an art that is still in infancy, 100 years after Frank Gilbreth first demonstrated the power of the approach with films, in spite of the low cost and universal accessibility of video cameras and players.

Technically, recording a video is not a challenge but analyzing data in video format still is. The available software makes it easy for human analysts to move back and forth inside a video, slow it down or speed it up, and break it down into steps to name, categorize and comment on.  It’s a process that is supported by software but where all the decisions are made by human users. We don’t have any video analysis tool to automatically split a video into steps and gather information about what happens at each step. It might exist 10 years from now, but it doesn’t now.


Stabilizing high-technology processes is a challenge that deserves the most powerful data analytics you can get, and one of the reasons why the traditional tools of statistical quality play, at best, a modest role in this is that they are based on an excessively narrow perception of what data is. It’s not just numbers in tables; it’s anything that can be read or written. Some of the data captured in other forms can be reduced to tables of numbers, but not all of it can, and useful, actionable information can be retrieved directly from it.

#datascience, #informationtechnology, #textmining, #SPC, #SixSigma