Prof. Mei-chen Lo, of National University and Kainan University in Taiwan, worked with Operations Managers in two semiconductor companies to establish a list of 16 dimensions of data quality. Most are not parameters that can be measured, and should be considered instead as questions to be asked about a company’s data. I learned it from her at an IE conference in Kitakyushu in 2009, and found it useful by itself as a checklist for a thorough assessment of a current state. Her research is about methods for ranking the importance of these criteria.
They are grouped in four main categories:
- Intrinsic. Agreement of the data with reality.
- Context. Usability of the information in the data to support decisions or solve problems.
- Representation. The way the data is structured, or not.
- Accessibility. The ability to retrieve, analyze and protect the data.
Each category breaks further down as follows:
- Intrinsic quality
- Accuracy. Accuracy is the most obvious issue, and is measurable. If the inventory data says that slot 2-3-2 contains two bins of screws, then can we be confident that, if we walk to aisle 2, column 3, level 2 in the warehouse, we will actually find two bins of screws?
- Fact or judgement. That slot 2-3-2 contains two bins of screws is a statement of fact. Its accuracy is in principle independent of the observer. On the other hand, “Operator X does not get along with teammates” is a judgement made by a supervisor and cannot carry the same weight as a statement of fact.
- Source credibility. Is the source of the data credible? Credibility problems may arise due to the following:
- Lack of training. For example, measurements that are supposed to be taken on “random samples” of parts are not, because no one in the organization knows how to draw a random sample.
- Mistake-prone collection methods. For example, manually collected measurements are affected by typing errors.
- Conflicts of interest. Employees collecting data stand to be rewarded or punished depending on the values of the data. For example, forecasters are often rewarded for optimistic forecasts.
- Believability of the content. Data can unbelievable because it is valid news of extraordinary results, or because it is inaccurate. In either case, it warrants special attention.
- Relevance. Companies often collect data because they can, rather than because it is relevant. It is the corporate equivalent of looking for keys at night under the street light rather than next to the car. In the semiconductor industry, where this list of criteria was established, measurements are routinely taken after each step of the wafer process and plotted in control charts. This data is relatively easy to collect but of little relevance to the control and improvement of the wafer process as a whole. Most of the relevant data cannot be captured until the circuits can be tested at the end of the process.
- Value added. Some of the data produced in a plant has a direct economic value. Aerospace or defense goods, for example, are delivered with documentation containing a record of their production process, and this data is part of the product. More generally, the data generated by commercial transactions, such as orders, invoices, shipping notices, or receipts, is at the heart of the company’s business activity. This is to be contrasted with data that is generated satisfy internal needs, such as, for example, the number of employees trained in transaction processing on the ERP system.
- Timeliness. Is the data available early enough to be actionable? A field failure report on a product that is due to problems with a manufacturing process as it was 6 months ago is not timely if this process has been the object to two engineering changes since then.
- Completeness. Measurements must be accompanied by all the data characterizing where, when, how and by whom they were collected and in what units they are expressed.
- Sufficiency. Does the data cover all the parameters needed to support a decision or solve a problem?
- Interpretability. What inferences can you draw directly from the data? If the demand for an item has been rising 5%/month for the past 18 months, it is no stretch to infer that this trend will continue next month. On the other hand, if you are told that a machine has an Overall Equipment Effectiveness (OEE) of 35%, what can you deduce from it? The OEE is the product of three ratios: availability, yield, and actual over nominal speed. The 35% figure may tell you that there is a problem, but not where it is.
- Ease of understanding. Management accounting exists for the purpose of supporting decision making by operations managers. Yet the reports provided to managers are often in a language they don’t understand. This does not have to be, and financial officers like Orrie Fiume have modified the vocabulary used in these reports to make them easier for actual managers to understand. The understandability of technical data can also be impaired when engineers use cryptics instead of plain language.
- Conciseness. A table with 100 columns and 20,000 rows with 90% of its cells empty is a verbose representation of a sparse matrix. A concise representation would be a list of the rows and columns IDs with values.
- Consistency. Consistency problems often arise as a result of mergers and acquisitions, when the different data models of the companies involved need to be mashed together.
- Convenience of access. Data that an end-user can retrieve directly through a graphic interface is conveniently accessible; data in paper folders on library shelves is not. Neither are databases in which each new query requires the development of a custom report by a specially trained programmer.
- Usability. High-usability data, for example, comes in the form of lists of property names and values can easily be tabulated into spreadsheets or database tables, and, from that point on, selected, filtered and summarized in a variety of informative ways. Low-usability data often comes in the form of a string of characters, that first needs to be separated, with character 1 to 5 being one field, 6 to 12 another, etc., and the meaning of each of these substrings needs to be retrieved from a correspondence table, to find that ’00at3′ means “lime green.”
- Security. Manufacturing data contain some of the company’s intellectual property, which must be protected not only from theft but from inadvertent alterations by unqualified employees. But effective security must also be provided efficiently, so that qualified, authorized employees are not slowed down by security procedures when accessing data.
Prof. Mei-Chen Lo’s research on this topic was published in The assessment of the information quality with the aid of multiple criteria analysis (European Journal of Operational Research, Volume 195, Issue 3, 16 June 2009, Pages 850-856)