Manufacturing Data Cleaning: Thankless But Necessary

Whether you manage operations with paper and pencil as in 1920 or use the state of the art in information technology (IT), you need clean data. If you don’t have it, you will suffer all sorts of dysfunctions. You will order materials you already have or don’t need, and be surprised by shortages. You will make delivery promises you can’t keep, and ship wrong or defective products. And you will have no idea what works and what doesn’t in your plant.

I have never seen a factory with perfect data, and perhaps none exists. Dirty data is the norm, not the exception, and the reason most factories are able to ship anything at all is that their people find ways to work around the defects in their data, from using expediters to find parts that aren’t where the system thought they were, to engineers who work directly with production to make sure a technical change is implemented. Mei-chen Lo, of Kainan University in Taiwan, proposed a useful classification of the issues with data quality. What I would like to propose here is pointers on addressing them.

Data scrubbing
Data scrubbing

I call this “data cleaning.” Other terms in use include data cleansing and data scrubbing. The difference between cleaning and cleansing is depth. Cleaning is removing dirt from a surface; cleansing, from a volume. I don’t see the distinction as meaningful with data. Data scrubbing, on the other hand, conjures up the idea of vigorously brushing the data on your hands and knees, and I don’t see it as a particularly helpful metaphor either.

The company’s information system may include multiple, separately implemented computer systems for supply chain management, manufacturing execution control, quality management, equipment maintenance, human resources, accounting, etc., for departments that are not always motivated to share information. There are three main levels of issues with the data in these systems:

  1. Violations of the system’s internal logic. If you consider the system independently of the plants it supports, it looks like a game, with rules specifying what players can and cannot do, but the data in the systems’ databases is not always compatible with these rules. If this were poker, it would by like playing with an deck of cards with two aces of spades and no jack of clubs, and rating two pairs over three-of-a-kind.
  2. Information retrieval challenges. You need to extract and decode the information embedded in “smart” part numbers, which is a different challenge from simply reviewing database extracts. Some of the older systems also do not produce extracts in tab-delimited text files but instead report files intended for line printers, divided in pages with headers and footers, with columns separated by spaces.
  3. Inaccurate mapping to physical reality. Unlike the cards and hands in poker, the objects in the company’s information system are supposed to map to counterparts in the physical world: materials, components, machines, tools, process programs, inventory, etc.

Following are details and examples from my personal experience with these issues:

Violations Of The System’s Internal Logic

First, the collections of computer systems that more or less talk to each other, the spreadsheets various individuals have developed to remedy the shortcomings of these systems, and the manually maintained visual management signs and boards must be the expression of the same model of the plant, and this model must be logically consistent. The first step is to identify where it isn’t, and fix it.


Products, machines, processes, operations, people,… all have IDs in the system to which their characteristics and history are attached, and legacy systems can be expected to have records with wrong IDs, for a variety of causes.

A vexing problem that I have encountered multiple times is what happens to numeric IDs when you use Excel. Employee numbers, customer numbers, account numbers are identifiers that are usually entirely numeric. In some companies, even Product IDs are numeric, and managers routinely extract tables from the ERP system to analyze them in Excel. But, when Excel sees a string like “007,” by default it overzealously converts it to an integer — fit for adding or multiplying– rather than an ID, and stores it as the number 7 which no longer matches the text 007. And you can’t correct it manually after the fact, because 7, 07, and 007 may all be valid IDs for different agents.

On the other hand, you can regenerate your spreadsheet and force Excel to treat a numeric column as text. If you are importing a table into Excel from a delimited text file, it is done by specifying the “text” format for the the ID column when running the import wizard. As long as what is done in the spreadsheet is purely analytical, you can replace the data table and the calculations will not be affected, but the identifiers will then match those in the original system. The formatting step is not difficult, but easy to forget, and the problem will keep recurring as long as managers and engineers keep using Excel independently to wrangle company data.

In transaction processing, IDs should never be entered manually, but still often are, even though bar codes have been around for over 40 years. And random errors creep in through manual data entry. Most systems’ input validation features will detect IDs that match nothing but will accept IDs that match the wrong item. Such errors have consequences — like the apparent shortage of a component that has been duly delivered — that cause them to be fixed. If, however, you have more inventory than you need, it may take weeks or months before these consequences cause problems. When you analyze the data, these errors also appear as outliers, such as an item delivered once in 1,000-lb big bags when every other time, it is in cases of 50 units. Therefore, running queries and reports may bring to light problems that have yet to affect operations, but it may not catch everything.


Inconsistencies are cases where different names are used for the same object, or when parameters that should have the same values are recorded as different. Inconsistencies may be difficult to detect. The strangest case I have encountered was a system for configured products where each configuration had a name that was changed every three months. It was a feature coded into the system 25 years before. No one currently working at the plant knew its purpose, but no one knew how to turn it off. What tipped me off about it was looking at monthly production quantities for the most popular configurations, and noticing that some configurations were popular for three months and then vanished, to be replaced by other ones. This is not the way you would expect a market to behave, and it made me suspicious that it was actually the result of systematic name changes.

To confirm this hypothesis, I checked whether the bills of materials (BOMs) matched. In principle it is possible for  two distinct configurations to have the same BOM, with different assembly processes, or subject to different treatments. Printed circuit boards with the same components and made by the same process routinely end up as different products based on test performance. In this particular case, it was unlikely but, in any case, matching BOMs were not the final proof. When I brought up the issue with the department in charge of technical data, it turned out that they knew this was going on, but felt powerless to do anything about it.

The plant manager and his direct reports, on the other hand, had no idea and were being misled by reports that showed the number of different configurations built in a year as four times larger than it was. An immediate countermeasure, however, was possible. By maintaining a cross-reference table of configuration names, you could aggregate the production history of a configuration across its multiple names and produce accurate reports. It did not require waiting until the system was replaced.

More generally, flaws in the revision management system are a major cause of inconsistencies. Once a technical change is approved, it must be made effective so that, as of a given date or a given serial number, a product is made in a new way.From the point of view of materials handlers and production operators, the instructions are an output of the plant’s information system, regardless of whether they are electronic or hardcopy. If all the pick lists and work instructions are electronic, they are automatically updated; if in hardcopy, it depends on the diligence and thoroughness of the people who distribute and post them and mistakes happen.

Each purchased component must have a unique part number inside the company, even when bought from multiple suppliers. The origin of each part must of course be recorded for traceability, but it must be separate from its ID.  For their products, most companies still use so-called smart numbering systems, with information encoded in each ID, and no two companies use the same, as a result of which mergers produce part catalogs with different encoding methods. As long as each ID is unique, it is not an issue of consistency issue but of usability. It is necessary for analytical purposes to extract and decode the information embedded in the IDs, but this topic deserves a separate discussion below.

Logical Integrity Violations

This is data that makes no  sense, such as:

  • Products with no sales reported with positive revenue.
  • A supposedly unique ID used for multiple objects. This is the converse of the problem discussed above, of having multiple IDs for the same object.
  • The times for the steps in a sequence not adding up to the time for the whole sequence.
  • Some male employees are described as pregnant.

Systems that are professionally designed database applications are not supposed to let this happen, but many of the most frequently used functions in a company’s legacy systems are in Excel spreadsheets built by end users, and devoid of safeguards.

Bringing to light the existence of such errors in end-user spreadsheets is useful to make management realize that these spreadsheets are not a viable solution. The existence of these spreadsheets, however, is evidence of the functional shortcomings of the systems they feed off. As such, the spreadsheets deserve attention, but correcting all the errors they contain is hopeless.

The uniqueness of IDs is easy to check, by exporting their list to a relational database and attempting to use them as a primary key. If it fails, it means there are duplicates, which can then be identified through a “find duplicates” query.

To identify logical violations that involve multiple fields or multiple records, you first need to spell out the relationships that they should have, and then develop queries to isolate records that violate them. It’s a fishing expedition. You have to approach the task determined to find what is wrong, not convince yourself that everything is fine. Once you have identified logical integrity violations, there usually is no systematic way to clean them up: they have to be resolved one by one.

Missing Records And Missing Values

Missing records and missing field values are easy to identify but difficult to understand and fix. A field in a data table may be empty for many different reasons:

  • It may be irrelevant to a specific record. For example, the company’s products used to be all liquids, and viscosity was relevant to all. Then the company got into the business of steel balls, and all its steel-ball products have an empty viscosity field. This is best addressed by storing product characteristics in lists of property name and value pairs, which allows different products to have different lists of characteristics without missing values. This is converted into dimensional tables for analysis by cross-tabulating homogeneous subsets, using pivot tables in Excel, crosstab queries in Access, or casting in R.
  • It may mean that a default value applies. In product data sheets that tout selling points, a beautiful brushed metal enclosure may be mentioned among the characteristics, while a plain plastic one won’t be. In this case, the data is deliberately omitted. The omission has a meaning, and the missing data must be replaced by a statement of that meaning. If an empty “Enclosure” field means “nothing special,” then, for internal use, it must be filled out with the mention “Nothing special” to remove any ambiguity. Of course, data sheets shown to customers just omit the enclosure field if it is not a selling point.
  • It may be a mistake. The data should have been collected but wasn’t. The safest response is to just mark it as such. Some analysts will fill in a value by interpolating from neighboring points, or by making assumptions. While such practices may make it easier to manipulate the data, it is not clear at all that they clean the data. Instead, they may further muddy it. It is safer to replace the missing value with a symbol having no other meaning, like NA in R. If interpolations or assumptions are used, it should not be done in the database but during the analysis and explicitly stated.

Information Retrieval Challenges

Sometimes the data in the legacy systems is accurate, but difficult to retrieve for analytical purposes. The first example is the information embedded in “smart” part numbers, which needs to be parsed and decoded before you can use it to slice and dice the variables and attributes collected by product. The second example covers the case where a legacy system is unable to provide extracts in a directly usable format, because its output is limited to reports formatted for human readability.

Retrieval of information embedded in “smart” part numbers

Information retrieval from “smart” part numbers is a three step process:

  1. Parse: Break down the part numbers into their components.
  2. Identify the components: Retrieve identifiers for the components from the the part numbering system specs, like Material, Size, Style, Color, Version number, etc.
  3. Decode the content: If, for example, colors are coded as numbers from 1 to 6, you replace the numbers with plain text, like blue, green, yellow, etc.

While most professionals at a plant have been indoctrinated into the idea that “smart” part numbers are the way to go, few are actually capable of reading them. Usually, only the most senior member of the technical data department fully understands the part numbers and can dig out the spec.

Since the spec may say, for example, that the first three characters are a material code, the following two, a size, etc., it is tempting to just split the IDs at these positions. But it will split the IDs inherited from the company you merged with last year at meaningless positions. The splits will be off by one position where, unbeknownst to the spec, the materials codes have been expanded to four characters,  and wherever a leading blank was introduced by mistake. You need a more robust approach, that leaves the merged company’s IDs alone and copes with variations in length.

The key is a tool called regular expressions, that is best understood as wildcards on steroids and that only IT specialists can love. Jokers in poker can stand in for any other cards, and wildcards in queries substitute for any string of characters. In Linux or DOS, “” is a wildcard and, if you ask for a list of prod, it will give you the names of all the files in the directory that begin with prod. With regular expressions, you can match more elaborate patterns, like the structure of a part number as described in the spec, and can separate it into components. It deals with components of varying lengths, and skips over the IDs inherited through a merger and that don’t match the pattern. Then you can go back and apply the technique with a different pattern to the remaining IDs. Once you have thus successfully parsed thousands of item numbers, you may be left with a handful that still resist but, as long as it is a handful, they can be treated manually with the help of the technical data management group.

The remaining steps are straightforward, but require the help of at least one organization member who understands the “smart” numbering system. Technically, the work may be doable with just the spec, but human validation is necessary to ensure both accuracy and acceptance by a technical data management group that may feel threatened by the release of previously encrypted information in plain text to everyone.

Retrieval from line-printer reports

line printer
A line printer

The ability to extract data into delimited text files for use by other systems has been available since the MRP-II era, in the 1980s but it doesn’t mean that the legacy systems in your plant today have it. Sometimes, all they can produce is reports in the form of text files intended to be printed on accordion 11×17 paper with line printers and read by humans. Sometimes these reports are in PDF format rather than files. There are software packages that can be used to recover usable tables of data from these reports, including Monarch, ReportMiner, Tabula, PDFtables, PDF2XL, and others.

Inaccurate Mapping To Physical Reality

In his classic on systems analysis, Tom DeMarco uses the example of the Astro Pony Tout Sheet, a software system to predict the performance of race horses based on their zodiacal signs. The system’s internal logic is perfect, but it fails nonetheless because astrology doesn’t work. Ultimately, an information system’s value is in its connection to an outside reality. The objects in its database must map to real, physical objects on the race track or the shop floor, and their state changes and flows of data with other objects must reflect actual events. Although it’s not a commonly used term, I have heard this mapping to physical reality referred to as the pragmatics of the system.

Discrepancies in pragmatics are much more difficult to rectify than the internal issues discussed above. If the logic of the systems is simplistic, it is technically impossible within the systems, but even correcting inaccuracies in the data may be practically impossible. The task may require too much labor, and face headwind from department managers, who may not be keen to provide the top with total information access.

The real challenge is to understand which data can be trusted and stop supporting decisions with reports or summaries from wrong data. While correcting all the wrong data may be practically impossible. improvement projects like setup time reduction or the cell conversion of a process segment cause the collection of accurate, detailed data about their target operations, and the next challenge is to keep and maintain it in an appropriate database, which usually is not a legacy system.

You assess pragmatics by direct observation:

  • You spot check the warehouse to see whether items are where the system thinks they are, in the recorded quantities. This is an area where discrepancies can be fixed, recurrence reduced by reviewing picking policies, and future performance effectively monitored by cycle counting.
  • You measure some process times with a stopwatch and check whether the results agree with the parameters in the system.
  • You infer the downtime rate of machines by taking snapshots of machine status at different times, and compare the results with the performance reported by the system.
  • You observe the extent to which people in operations rely on information from the system to take action. For a production scheduling system to work, it must provide operators with instructions and they must follow these instructions. If it provides ambiguous guidelines like a dispatch list, instructions that are technically impossible to follow because of missing parts or tools, or if it is ignored because the shop floor team doesn’t want to initiate a setup in the last hour of the shift, the system is not working.
  • When transactions for operation start and end are entered manually with a keyboard, it is rarely done by operators at the time of the events. Instead, operators handwrite the data on a paper traveler and, at the end of the shift, a clerk enters all the accumulated transactions, making their system timestamps meaningless in any unit of time shorter than a shift.
  • A common issue is systems that contain a model at too high a level of aggregation to direct and support operations. In most ERP systems, for example, operations have a setup time and a runtime, which cannot be used in daily scheduling, because actual setups can differ within or between product families, or take a matrix of different values depending on both the product you are changing from and the one you are changing to. The machines themselves may process one piece at a time, one load at a time like a shuttle oven, or a continuous stream of pieces, like an assembly line or a tunnel oven.


The quality of a business organization’s legacy systems varies. For all their flaws, most do contain valuable information about operations that can be captured neither by direct observation on the shop floor, nor by interviewing stakeholders. But the data must be vetted before use in analytics, and the quality of the data itself is a reflection of the organization’s culture.