A management perspective on data quality

Prof. Mei-chen Lo, of National University and Kainan University in Taiwan, worked with Operations Managers in two semiconductor companies to establish a list of 16 dimensions of data quality. Most are not parameters that can be measured and should be considered instead as questions to be asked about a company’s data. I learned it from her at an IE conference in Kitakyushu in 2009 and found it useful by itself as a checklist for a thorough assessment of a current state. Her research is about methods for ranking the importance of these criteria.

They are grouped into four main categories:

  1. Intrinsic. Agreement of the data with reality.
  2. Context.  Usability of the information in the data to support decisions or solve problems.
  3. Representation. The way the data is structured, or not.
  4. Accessibility. The ability to retrieve, analyze and protect the data.

Each category breaks further down as follows:

  1. Intrinsic quality

    • Accuracy. Accuracy is the most obvious issue and is measurable. If the inventory data says that slot 2-3-2 contains two bins of screws, then can we be confident that, if we walk to aisle 2, column 3, level 2 in the warehouse, we will actually find two bins of screws?
    • Fact or judgment. That slot 2-3-2 contains two bins of screws is a statement of fact. Its accuracy is in principle independent of the observer. On the other hand, “Operator X does not get along with teammates” is a subjective judgment and cannot carry the same weight as a statement of fact.
    • Source credibility. Is the source of the data credible? Credibility problems may arise due to the following:
      • Lack of training. For example, measurements that are supposed to be taken on “random samples” of parts are not, because no one in the organization knows how to draw a random sample.
      • Mistake-prone collection methods. For example, manually collected measurements are affected by typing errors.
      • Conflicts of interest. Employees collecting data stand to be rewarded or punished depending on the values of the data. For example, forecasters are often rewarded for optimistic forecasts.
    • Believability of the content. Data can unbelievable because it is valid news of extraordinary results, or because it is inaccurate. In either case, it warrants special attention.
  2. Context.

    • Relevance. Companies often collect data because they can, rather than because it is relevant. It is the corporate equivalent of looking for keys at night under the street light rather than next to the car. The semiconductor industry, that established this list of criteria, routinely takes measurements after each step of the wafer process and plots them in control charts. This data is relatively easy to collect but of little relevance to the control and improvement of the wafer process as a whole. The engineers cannot capture most of the relevant data until the circuits undergo tests at the end of the process.
    • Value-added. Some of the data produced in a plant have direct economic value. Aerospace or defense goods, for example, include documentation of their production process, as part of the product. More generally, the data from commercial transactions, such as orders, invoices, shipping notices, or receipts, is at the heart of the company’s business activity. By contrast, the organization generates data to satisfy internal needs, including, for example, the number of employees trained in transaction processing on the ERP system.
    • Timeliness. Is the data available early enough to be actionable? A field failure report on a product that is due to problems with a manufacturing process as it was 6 months ago is not timely if this process has been the object to two engineering changes since then.
    • Completeness. Measurements must come with units of measure and all the data describing who collected them, where, when, and how.
    • Sufficiency. Does the data cover all the parameters needed to support a decision or solve a problem?
  3. Representation

    • Interpretability. What inferences can you draw directly from the data? If the demand for an item has been rising 5%/month for the past 18 months, it is no stretch to infer that this trend will continue next month. On the other hand, if a chart tells you that a machine has an Overall Equipment Effectiveness (OEE) of 35%, what can you deduce from it? The OEE is the product of three ratios: availability, yield, and actual over nominal speed. The 35% figure may tell you that there is a problem, but not where it is.
    • Ease of understanding. Management accounting exists for the purpose of supporting decision making by operations managers. Yet the reports provided to managers are often in a language they don’t understand. This does not have to be, and financial officers like Orrie Fiume have modified the vocabulary used in these reports to make them easier for actual managers to understand. Engineers using cryptics instead of plain language make technical data more difficult to understand.
    • Conciseness. A table with 100 columns and 20,000 rows with 90% of its cells empty is a verbose representation of a sparse matrix. A concise representation would be a list of the rows and columns IDs with values.
    • Consistency. Consistency problems often arise as a result of mergers and acquisitions, when companies mash together their different data models.
  4. Accessibility

    • Convenience of access. Data that an end-user can retrieve directly through a graphic interface is conveniently accessible; data in paper folders on library shelves are not. Neither are databases in which each new query requires the development of a custom report by a specially trained programmer.
    • Usability. High-usability data, for example, take the form of lists of property names and values that you can easily tabulate into spreadsheets or database tables. From that point on, you can select, filter and summarize it in a variety of informative ways. Low-usability data often comes in the form of a string of characters that you first need to parse, with character 1 to 5 being one field, 6 to 12 another, etc. Then you need to retrieve the meaning of each of these substrings from a correspondence table, to find that ’00at3′ means “lime green.”
    • Security. Manufacturing data contain some of the company’s intellectual property, which you need to protect not only from theft but from inadvertent alterations by unqualified employees. But you must also provide security effectively so that security procedures do not slow down qualified, authorized employees accessing data.

Prof. Mei-Chen Lo’s research on this topic was published in The assessment of the information quality with the aid of multiple criteria analysis (European Journal of Operational Research, Volume 195, Issue 3, 16 June 2009, Pages 850-856)