Feb 28 2016
Manufacturing Data Cleaning: Thankless But Necessary
Whether you manage operations with paper and pencil as in 1920 or use the state of the art in information technology (IT), you need clean data. If you don’t have it, you will suffer all sorts of dysfunctions. You will order materials you already have or don’t need, and be surprised by shortages. You will make delivery promises you can’t keep, and ship wrong or defective products. And you will have no idea what works and what doesn’t in your plant.
I have never seen a factory with perfect data, and perhaps none exists. Dirty data is the norm, not the exception, and the reason most factories are able to ship anything at all is that their people find ways to work around the defects in their data, from using expediters to find parts that aren’t where the system thought they were, to engineers who work directly with production to make sure a technical change is implemented. Mei-chen Lo, of Kainan University in Taiwan, proposed a useful classification of the issues with data quality. What I would like to propose here is pointers on addressing them.
I call this “data cleaning.” Other terms in use include data cleansing and data scrubbing. The difference between cleaning and cleansing is depth. Cleaning is removing dirt from a surface; cleansing, from a volume. I don’t see the distinction as meaningful with data. Data scrubbing, on the other hand, conjures up the idea of vigorously brushing the data on your hands and knees, and I don’t see it as a particularly helpful metaphor either.
The company’s information system may include multiple, separately implemented computer systems for supply chain management, manufacturing execution control, quality management, equipment maintenance, human resources, accounting, etc., for departments that are not always motivated to share information. There are three main levels of issues with the data in these systems:
- Violations of the system’s internal logic. If you consider the system independently of the plants it supports, it looks like a game, with rules specifying what players can and cannot do, but the data in the systems’ databases is not always compatible with these rules. If this were poker, it would by like playing with an deck of cards with two aces of spades and no jack of clubs, and rating two pairs over three-of-a-kind.
- Information retrieval challenges. You need to extract and decode the information embedded in “smart” part numbers, which is a different challenge from simply reviewing database extracts. Some of the older systems also do not produce extracts in tab-delimited text files but instead report files intended for line printers, divided in pages with headers and footers, with columns separated by spaces.
- Inaccurate mapping to physical reality. Unlike the cards and hands in poker, the objects in the company’s information system are supposed to map to counterparts in the physical world: materials, components, machines, tools, process programs, inventory, etc.
Following are details and examples from my personal experience with these issues:
Contents
Violations Of The System’s Internal Logic
First, the collections of computer systems that more or less talk to each other, the spreadsheets various individuals have developed to remedy the shortcomings of these systems, and the manually maintained visual management signs and boards must be the expression of the same model of the plant, and this model must be logically consistent. The first step is to identify where it isn’t, and fix it.
Identifiers
Products, machines, processes, operations, people,… all have IDs in the system to which their characteristics and history are attached, and legacy systems can be expected to have records with wrong IDs, for a variety of causes.
A vexing problem that I have encountered multiple times is what happens to numeric IDs when you use Excel. Employee numbers, customer numbers, account numbers are identifiers that are usually entirely numeric. In some companies, even Product IDs are numeric, and managers routinely extract tables from the ERP system to analyze them in Excel. But, when Excel sees a string like “007,” by default it overzealously converts it to an integer — fit for adding or multiplying– rather than an ID, and stores it as the number 7 which no longer matches the text 007. And you can’t correct it manually after the fact, because 7, 07, and 007 may all be valid IDs for different agents.
On the other hand, you can regenerate your spreadsheet and force Excel to treat a numeric column as text. If you are importing a table into Excel from a delimited text file, it is done by specifying the “text” format for the the ID column when running the import wizard. As long as what is done in the spreadsheet is purely analytical, you can replace the data table and the calculations will not be affected, but the identifiers will then match those in the original system. The formatting step is not difficult, but easy to forget, and the problem will keep recurring as long as managers and engineers keep using Excel independently to wrangle company data.
In transaction processing, IDs should never be entered manually, but still often are, even though bar codes have been around for over 40 years. And random errors creep in through manual data entry. Most systems’ input validation features will detect IDs that match nothing but will accept IDs that match the wrong item. Such errors have consequences — like the apparent shortage of a component that has been duly delivered — that cause them to be fixed. If, however, you have more inventory than you need, it may take weeks or months before these consequences cause problems. When you analyze the data, these errors also appear as outliers, such as an item delivered once in 1,000-lb big bags when every other time, it is in cases of 50 units. Therefore, running queries and reports may bring to light problems that have yet to affect operations, but it may not catch everything.
Inconsistencies
Inconsistencies are cases where different names are used for the same object, or when parameters that should have the same values are recorded as different. Inconsistencies may be difficult to detect. The strangest case I have encountered was a system for configured products where each configuration had a name that was changed every three months. It was a feature coded into the system 25 years before. No one currently working at the plant knew its purpose, but no one knew how to turn it off. What tipped me off about it was looking at monthly production quantities for the most popular configurations, and noticing that some configurations were popular for three months and then vanished, to be replaced by other ones. This is not the way you would expect a market to behave, and it made me suspicious that it was actually the result of systematic name changes.
To confirm this hypothesis, I checked whether the bills of materials (BOMs) matched. In principle it is possible for two distinct configurations to have the same BOM, with different assembly processes, or subject to different treatments. Printed circuit boards with the same components and made by the same process routinely end up as different products based on test performance. In this particular case, it was unlikely but, in any case, matching BOMs were not the final proof. When I brought up the issue with the department in charge of technical data, it turned out that they knew this was going on, but felt powerless to do anything about it.
The plant manager and his direct reports, on the other hand, had no idea and were being misled by reports that showed the number of different configurations built in a year as four times larger than it was. An immediate countermeasure, however, was possible. By maintaining a cross-reference table of configuration names, you could aggregate the production history of a configuration across its multiple names and produce accurate reports. It did not require waiting until the system was replaced.
More generally, flaws in the revision management system are a major cause of inconsistencies. Once a technical change is approved, it must be made effective so that, as of a given date or a given serial number, a product is made in a new way.From the point of view of materials handlers and production operators, the instructions are an output of the plant’s information system, regardless of whether they are electronic or hardcopy. If all the pick lists and work instructions are electronic, they are automatically updated; if in hardcopy, it depends on the diligence and thoroughness of the people who distribute and post them and mistakes happen.
Each purchased component must have a unique part number inside the company, even when bought from multiple suppliers. The origin of each part must of course be recorded for traceability, but it must be separate from its ID. For their products, most companies still use so-called smart numbering systems, with information encoded in each ID, and no two companies use the same, as a result of which mergers produce part catalogs with different encoding methods. As long as each ID is unique, it is not an issue of consistency issue but of usability. It is necessary for analytical purposes to extract and decode the information embedded in the IDs, but this topic deserves a separate discussion below.
Logical Integrity Violations
This is data that makes no sense, such as:
- Products with no sales reported with positive revenue.
- A supposedly unique ID used for multiple objects. This is the converse of the problem discussed above, of having multiple IDs for the same object.
- The times for the steps in a sequence not adding up to the time for the whole sequence.
- Some male employees are described as pregnant.
- …
Systems that are professionally designed database applications are not supposed to let this happen, but many of the most frequently used functions in a company’s legacy systems are in Excel spreadsheets built by end users, and devoid of safeguards.
Bringing to light the existence of such errors in end-user spreadsheets is useful to make management realize that these spreadsheets are not a viable solution. The existence of these spreadsheets, however, is evidence of the functional shortcomings of the systems they feed off. As such, the spreadsheets deserve attention, but correcting all the errors they contain is hopeless.
The uniqueness of IDs is easy to check, by exporting their list to a relational database and attempting to use them as a primary key. If it fails, it means there are duplicates, which can then be identified through a “find duplicates” query.
To identify logical violations that involve multiple fields or multiple records, you first need to spell out the relationships that they should have, and then develop queries to isolate records that violate them. It’s a fishing expedition. You have to approach the task determined to find what is wrong, not convince yourself that everything is fine. Once you have identified logical integrity violations, there usually is no systematic way to clean them up: they have to be resolved one by one.
Missing Records And Missing Values
Missing records and missing field values are easy to identify but difficult to understand and fix. A field in a data table may be empty for many different reasons:
- It may be irrelevant to a specific record. For example, the company’s products used to be all liquids, and viscosity was relevant to all. Then the company got into the business of steel balls, and all its steel-ball products have an empty viscosity field. This is best addressed by storing product characteristics in lists of property name and value pairs, which allows different products to have different lists of characteristics without missing values. This is converted into dimensional tables for analysis by cross-tabulating homogeneous subsets, using pivot tables in Excel, crosstab queries in Access, or casting in R.
- It may mean that a default value applies. In product data sheets that tout selling points, a beautiful brushed metal enclosure may be mentioned among the characteristics, while a plain plastic one won’t be. In this case, the data is deliberately omitted. The omission has a meaning, and the missing data must be replaced by a statement of that meaning. If an empty “Enclosure” field means “nothing special,” then, for internal use, it must be filled out with the mention “Nothing special” to remove any ambiguity. Of course, data sheets shown to customers just omit the enclosure field if it is not a selling point.
- It may be a mistake. The data should have been collected but wasn’t. The safest response is to just mark it as such. Some analysts will fill in a value by interpolating from neighboring points, or by making assumptions. While such practices may make it easier to manipulate the data, it is not clear at all that they clean the data. Instead, they may further muddy it. It is safer to replace the missing value with a symbol having no other meaning, like NA in R. If interpolations or assumptions are used, it should not be done in the database but during the analysis and explicitly stated.
Information Retrieval Challenges
Sometimes the data in the legacy systems is accurate, but difficult to retrieve for analytical purposes. The first example is the information embedded in “smart” part numbers, which needs to be parsed and decoded before you can use it to slice and dice the variables and attributes collected by product. The second example covers the case where a legacy system is unable to provide extracts in a directly usable format, because its output is limited to reports formatted for human readability.
Retrieval of information embedded in “smart” part numbers
Information retrieval from “smart” part numbers is a three step process:
- Parse: Break down the part numbers into their components.
- Identify the components: Retrieve identifiers for the components from the the part numbering system specs, like Material, Size, Style, Color, Version number, etc.
- Decode the content: If, for example, colors are coded as numbers from 1 to 6, you replace the numbers with plain text, like blue, green, yellow, etc.
While most professionals at a plant have been indoctrinated into the idea that “smart” part numbers are the way to go, few are actually capable of reading them. Usually, only the most senior member of the technical data department fully understands the part numbers and can dig out the spec.
Since the spec may say, for example, that the first three characters are a material code, the following two, a size, etc., it is tempting to just split the IDs at these positions. But it will split the IDs inherited from the company you merged with last year at meaningless positions. The splits will be off by one position where, unbeknownst to the spec, the materials codes have been expanded to four characters, and wherever a leading blank was introduced by mistake. You need a more robust approach, that leaves the merged company’s IDs alone and copes with variations in length.
The key is a tool called regular expressions, that is best understood as wildcards on steroids and that only IT specialists can love. Jokers in poker can stand in for any other cards, and wildcards in queries substitute for any string of characters. In Linux or DOS, “” is a wildcard and, if you ask for a list of prod, it will give you the names of all the files in the directory that begin with prod. With regular expressions, you can match more elaborate patterns, like the structure of a part number as described in the spec, and can separate it into components. It deals with components of varying lengths, and skips over the IDs inherited through a merger and that don’t match the pattern. Then you can go back and apply the technique with a different pattern to the remaining IDs. Once you have thus successfully parsed thousands of item numbers, you may be left with a handful that still resist but, as long as it is a handful, they can be treated manually with the help of the technical data management group.
The remaining steps are straightforward, but require the help of at least one organization member who understands the “smart” numbering system. Technically, the work may be doable with just the spec, but human validation is necessary to ensure both accuracy and acceptance by a technical data management group that may feel threatened by the release of previously encrypted information in plain text to everyone.
Retrieval from line-printer reports
The ability to extract data into delimited text files for use by other systems has been available since the MRP-II era, in the 1980s but it doesn’t mean that the legacy systems in your plant today have it. Sometimes, all they can produce is reports in the form of text files intended to be printed on accordion 11×17 paper with line printers and read by humans. Sometimes these reports are in PDF format rather than files. There are software packages that can be used to recover usable tables of data from these reports, including Monarch, ReportMiner, Tabula, PDFtables, PDF2XL, and others.
Inaccurate Mapping To Physical Reality
In his classic on systems analysis, Tom DeMarco uses the example of the Astro Pony Tout Sheet, a software system to predict the performance of race horses based on their zodiacal signs. The system’s internal logic is perfect, but it fails nonetheless because astrology doesn’t work. Ultimately, an information system’s value is in its connection to an outside reality. The objects in its database must map to real, physical objects on the race track or the shop floor, and their state changes and flows of data with other objects must reflect actual events. Although it’s not a commonly used term, I have heard this mapping to physical reality referred to as the pragmatics of the system.
Discrepancies in pragmatics are much more difficult to rectify than the internal issues discussed above. If the logic of the systems is simplistic, it is technically impossible within the systems, but even correcting inaccuracies in the data may be practically impossible. The task may require too much labor, and face headwind from department managers, who may not be keen to provide the top with total information access.
The real challenge is to understand which data can be trusted and stop supporting decisions with reports or summaries from wrong data. While correcting all the wrong data may be practically impossible. improvement projects like setup time reduction or the cell conversion of a process segment cause the collection of accurate, detailed data about their target operations, and the next challenge is to keep and maintain it in an appropriate database, which usually is not a legacy system.
You assess pragmatics by direct observation:
- You spot check the warehouse to see whether items are where the system thinks they are, in the recorded quantities. This is an area where discrepancies can be fixed, recurrence reduced by reviewing picking policies, and future performance effectively monitored by cycle counting.
- You measure some process times with a stopwatch and check whether the results agree with the parameters in the system.
- You infer the downtime rate of machines by taking snapshots of machine status at different times, and compare the results with the performance reported by the system.
- You observe the extent to which people in operations rely on information from the system to take action. For a production scheduling system to work, it must provide operators with instructions and they must follow these instructions. If it provides ambiguous guidelines like a dispatch list, instructions that are technically impossible to follow because of missing parts or tools, or if it is ignored because the shop floor team doesn’t want to initiate a setup in the last hour of the shift, the system is not working.
- When transactions for operation start and end are entered manually with a keyboard, it is rarely done by operators at the time of the events. Instead, operators handwrite the data on a paper traveler and, at the end of the shift, a clerk enters all the accumulated transactions, making their system timestamps meaningless in any unit of time shorter than a shift.
- A common issue is systems that contain a model at too high a level of aggregation to direct and support operations. In most ERP systems, for example, operations have a setup time and a runtime, which cannot be used in daily scheduling, because actual setups can differ within or between product families, or take a matrix of different values depending on both the product you are changing from and the one you are changing to. The machines themselves may process one piece at a time, one load at a time like a shuttle oven, or a continuous stream of pieces, like an assembly line or a tunnel oven.
Conclusions
The quality of a business organization’s legacy systems varies. For all their flaws, most do contain valuable information about operations that can be captured neither by direct observation on the shop floor, nor by interviewing stakeholders. But the data must be vetted before use in analytics, and the quality of the data itself is a reflection of the organization’s culture.
Shiva Prasad
March 6, 2016 @ 4:29 am
Hi Michel,
Thank you for a comprehensive layout of most common data quality challenges organizations face.
A nuance closely associated with Logical integrity is that of Semantic integrity – because of a highly heterogeneous IT landscape with a mix of multiple Commercial Applications and home-grown systems built over a period of time, many organizations suffer from a lack of uniform, consistent definition and meaning of a single data or application object, and how it is related to the business context applied across the enterprise. As a result, decision-making is frequently impaired because there is less trust in the data.
You mention that professionally-designed database applications are not supposed to let this happen – unfortunately, in real world, apps from many of the biggest IT vendors are barely able to support such logical and semantic integrity needs. They are fairly good at maintaining referential integrity. Some organizations who understand this limitation of these Apps, overlay an additional layer of semantic/logical integrity in the form of extra metadata and business rules / policies that help to create “trusted” data in the first place (usually, at the time of the creation of master data instances in the IT app). Often these “high quality data hubs” then are used to distribute clean, usable information to consumers.
Data cleaning is “after the fact” and hence costly. We should encourage a move to a regime where “first time right” becomes the normal operating behavior when it comes to information management using Automated systems. This might call for not just new tools capable of real-time information validation and enrichment using trusted sources (a combination of internal and external 3rd party ones) at the point of data entry/updates, but also process-level and organizational behavior-level changes that explicitly recognize & reward good data management practices of individuals, teams and departments.
Regards
Shiva
Michel Baudin
March 6, 2016 @ 9:19 am
Thank you for your comments, and I generally agree with your recommendations.
As I am addressing manufacturing professionals, I try to avoid vocabulary that will stop them reading further. That’s why I am not discussing semantic or referential integrity. “Logical” is a less intimidating word, and I am betting that, as long as I am clear as to what I mean by “logical integrity,” I won’t lose them.
Yes, data cleaning is costly and shouldn’t be necessary, but the reality is that the data available in most companies is dirty and we don’t have a choice.
Srinath Alamela
March 6, 2016 @ 11:11 am
You obviously have put in a lot of thought along with your experience to come up with the conclusions – which are right on target. To solve the smart part number problem we did exactly what you suggested – that is to break down the components, store them as attributes (or properties) so that they are atomic and reportable and classifiable – but went an additional step to concatenate it back into a smart part number and solve the human readable problem as it minimizes the other problem caused by ID based part numbering and mitigating the wrong ID being used in transactions. (It also saved us whole lot of time during implementation of the ERP on intelligent vs non-intelligent part numbering!)
Michel Baudin
March 6, 2016 @ 2:17 pm
The main feature that makes IDs readable by humans is brevity. “A39HW” is readable and even memorable because it’s short; a 15-digit “smart” part number isn’t because it’s long. The “smart” part numbers don’t actually convey any information to any human, because nobody is trained to decode them. Furthermore, “smart” part numbers are error prone because these conventions give similar names to similar parts, making them easier to confuse.
Srinath Alamela
March 7, 2016 @ 12:44 pm
It depends on how the smart number is configured – in our case it was the concatenation of the specifications in conjunction with the device id and when we measured the outcomes after 3 years – transactions errors were down to less than <.0001 %. After 15 years of being production inventory accuracy is close to 99.99% resulting in the customers being able to close their books in 1 day and run daily automated supply chain plans.
Michel Baudin
March 7, 2016 @ 2:45 pm
If you are satisfied with the rate of transaction errors you have achieved, that’s great. It doesn’t change the fact that “smart” part numbers are a legacy of the paper-and-pencil age, and obsolete in the database age.
From your own words, there is nothing in your part numbers that you cannot retrieve from the database. If you want to print all the attributes on labels, there is nothing preventing you doing it, with plain text names and values.
Marty Preuss
March 7, 2016 @ 5:04 am
You’ve done a great job in classifying the sources of data problems. From a practical standpoint, they exist at one level or another in almost every organization. Resolving these issues quickly, to make the data useful for a team, often requires data manipulation (or cleaning), . Many people, with limited data manipulation skills, act as though they are standing in front of a giant, insurmountable wall when confronted with data that isn’t “served on a silver platter” – they take no action at all.
It’s a shame because many answers are buried in that data. I believe the Lean Six Sigma “curriculum” is fundamentally flawed in this area – over emphasis on advanced statistics and zero emphasis on practical data manipulation skills.
I need to explore the programs you mention in 2.2, I usually attack text delimited files with VBA, especially if the team is using such reports for decision making…
Thankless is a strong word for the title, from my perspective 🙂 I enjoy the challenge – and the results.
Thanks for another good post.
Orbit Charts, Revisited – Michel Baudin's Blog
September 23, 2023 @ 8:37 pm
[…] communication of results is the last stage in a data science project, after data acquisition, cleaning, storage/retrieval, and analysis. Without the preceding stages, the best communication skills are […]