Tradition, Tradition, Data Visualization, and Pareto Charts

Some of the standard charts used in manufacturing for decades don’t meet today’s criteria for effective visualization. But using them is now a tradition; they are taught in school and their value is unchallenged, but it is time to challenge it. If we were to see these charts for the first time in 2015, would we consider the information they provide useful, and would we want to use the classical formats? This post suggests answers in the case of the venerable Pareto chart.

Tradition, Tradition

The Pareto principle, also known as the law of the vital few and the trivial many, or the 80/20 law, is widely known and accepted, although not always acted upon, but what about the chart? What do we actually need it for, and is its traditional format the best way to present it?

 

Flaws of the Standard Pareto Chart

From Nancy R. Tague’s The Quality Toolbox, Second Edition, ASQ Quality Press, 2005, pages 376-378.

This chart has two y-axes with different scales. If you are used to it, you know that the scale on the left is for frequencies and the one on the right for the cumulative percentage. On the other hand, if you see the chart for the first time, you may be confused.

In addition, the only conceivable reason to use a line plot for the cumulative values is to make them visually different from the individual values, which is not necessary if they are on a separate chart. Generally, it is confusing to use a line plot when the x-axis has categories. because line plots imply interpolation. If you have a line between two data points, it assigns a y-value to all points between the two x-values, but there are no points between “Certificate error” and “Certificate missing.”

Sample Pareto chart with ggplot2
Pareto chart produced with ggplot2

In fact, ggplot2, a software package that its developer Hadley Wickham describes as “elegant graphics for data analysis,”  refuses to produce a chart with two y-axes having different scales. Wickham considers it to be such a bad design that he makes it impossible. The closest you can come to it with ggplot2 is a stack of two charts with the same x-axis, each with its own y-axis. At the cost of taking up more space, it removes any ambiguity as to which axis applies to which data series.

The traditional chart also uses vertical bars for categories that may have long names, and of which there may be many more shown than the five in Nancy Tague’s example. If a product has 100 defect opportunities, it may take the top 20 to account for 80% of the defects, still leaving a large bar for the “other” category. As a result, you may be forced to align the captions vertically or at a slanted angle, making you twist your neck to read them. What would be lost by using horizontal bars instead and having all the text in the same, horizontal orientation? Nothing, other than conformance with the textbook format.

Fundamental Questions

Vilfredo Pareto (1848-1923)

Fixing presentation details, however, does not answer the fundamental question of what we need the chart for. When Vilfredo Pareto studied the distribution of farmland ownership in Italy in 1900, he did not use what, four decades later, Juran would call a “Pareto chart” or “Pareto diagram.”

thomas_piketty_rtr_img_0
Thomas Piketty

In 2013, when Thomas Piketty published Capital in the 21st Century, his 950-page study of economic inequality through the ages, he did not include a single Pareto chart. Instead, he focused on the evolution over time of the fraction of the total wealth, income, or inheritance in a society went to the top 0.1%, 1% or 10%, compared with what went to the bottom 50%.  And he zeroed in on the structure of specific fortunes, such as those of Bill Gates, Liliane Bettencourt, or the endowments of Harvard, Stanford, and other major American universities.

In Manufacturing, unequal distributions of a quantity among categories are not limited to quality defects. They are also found in product demand, inventory, and maintenance costs, and what we would really like to know is the following:

  1. Corrado Gini (1884-1965)

    How unequal the distributions actually are. When you look at actual data, the “80/20 law” often turns out to be anywhere between 90/10, with the top 10% accounting for 90% of the quantity in question, or 70/30. As this clearly makes a difference, we would like a metric to quantify the degree of inequality in the partition of a quantity among categories. For this, we can turn to another Italian economist, Corrado Gini and his Gini index.

  2. Which categories are at the top and which ones in the long tail. This is essential for the analysis to be actionable. In quality, we want to know the most frequent defects in order to focus on eliminating them. With product demand and inventory, we want to identify runners, repeaters, and strangers to drive the allocation of manufacturing and logistical resources. In maintenance, we want to decide which machines to replace or overhaul…

Quantifying Inequality with a Gini Index

The Gini index, or coefficient, is explained simply. It is based on the Lorenz curve, which is essentially a mirror image of the cumulative line in the Pareto diagram. It is calculated as follows:

  1. Rank the population from poorest to richest.
  2. Calculate cumulative wealth by rank.
  3. Divide the rank by the population size to obtain percentiles.
  4. For each rank n, the cumulative wealth by rank by the total wealth to get the percentage of total wealth possessed by the n poorest members.
  5. Plot the percentage of wealth possessed as a function of percentile.

If the wealth is equally distributed, then the graph will be diagonal. meaning that the bottom 10% of members own 10% of the national wealth, 40% of the households own 40% of the national wealth and so on. With inequality, the Lorenz curve is convex, as in the following picture. The Gini index is the ratio of the pink area to the total area under the perfect equality curve. It varies from 0 in the case of perfect equality to 1 if one member has all the wealth while all others have nothing.

Gini-Coefficient
Definition of the Gini index (Source: mrunal.org)

 

The data used to generate the horizontal Pareto chart above were in fact for 745 categories, most of which only had one occurrence, and the Gini index for this set is 88%, where it is on the order of 60% if the top 20% of the categories account for 80% of the total value. By comparison, the Gini index of income was 41% in the US in 2010, 53% in Brazil in 2012, and  26% in Sweden in 2005. With the defects, the high figure strongly suggests that there are too many categories, with most likely the same defects under different names and that the organization should perhaps aggregate them into fewer, higher-level categories.

Thomas Piketty does not like the Gini index, because the same value can correspond to too many different patterns, and you usually have to investigate further. If, however, you consider the product mixes of different factories, their Gini indices in various measures like bulk, unit counts, or sales, give you information about the breakdown of their activities among products. If it is 40% in one and 70% in another, you can expect the first one to have its work more evenly divided among products than the other, and it should prompt you to take a closer look.

Classification of All Items

In quality, knowing which defects occur most frequently is the starting point is deciding which ones to work on first. You have to balance their frequencies with the amount of time and effort needed to eliminate them and arrive at figures of merit in terms of reduction in fraction defective per unit time or per dollar spent, which requires further data collection and analysis. In production, as explained in Lean Assembly, you want a dedicated production line for each Runner, lines dedicated by product family for Repeaters, and a small job-shop for Strangers. In Logistics, you want to set up your plan for every part so that you make the most frequently used items always available and most easily accessible, etc.

To support such actions, you need lists, rather than charts, preferably with each item linked to the type of action you may take. The following is an example about sales volumes for products measured in linear feet, like hoses. Note that, as long as the numbers are right-justified, the digits in Total footage visually act like a horizontal bar chart in logarithmic scale, since one more digit on the left reflects a 10-fold increase.

Runners-Repeaters-Strangers_example

Before the age of electronic spreadsheet, school children were taught to always right-align integers so that columns could be added without mistakes. Today, this motivation has disappeared, and tables of numbers are often presented center-aligned, but center-aligned columns of numbers do not double as horizontal bar charts as right-aligned columns do.

For the purpose of making policy decisions about a mix of 718 products, a Pareto chart is not only unnecessary but impractical. A plain table is the better tool.