Cars Per Employee And Productivity At Volkswagen Versus Toyota

Seen this morning in a Lean consultant’s blog:

“Two decades later, VW has topped Toyota as the world’s number one automaker, but Toyota generally is considered to be […] far more productive. In 2015, VW employs 600,000 people to produce 10 million cars while Toyota employs 340,000 to produce just under 9 million cars…”

Is it really that simple? VW produces 10 million/600,000 = 16.67 cars/employee/year, and Toyota 9 million/340,000 = 26.47 cars/employee/year. Ergo, Toyota is 60% more productive than VW — that is, if you accept cars/employee/year as an appropriate metric of productivity.  Unfortunately, it is a bad metric that can easily be gamed by outsourcing.

Continue reading

The Goals That Matter: SQDCM | Mark Graban

See on Scoop.itlean manufacturing

Blog post at Lean Blog : “Today is the start of the 2014 World Cup, which means much of the world will be talking about goals.I’m not really a soccer, I mean football, fan but I’m all for goals. In the Lean management system, we generally have five high-level goals. These were the goals taught to us in the auto industry, where I started my career, and they apply in healthcare.”

 

Michel Baudin‘s comments:

As I learned it, it was “Quality, Cost, Delivery, Safety, and Morale” -(QCDSM) rather than SQDCM. I am not sure the order matters that much. The rationale for grouping Quality, Cost, and Delivery is that they matter to customers, while Safety and Morale are internal issues of your organization, visible to customers only to the extent that they affect the other three.

They are actually dimensions of performance rather than goals. “Safety,” by itself, is not a goal; operating the safest plants in your industry is a goal. In management as taught in school, if you set this goal, you have to be able to assess how far you are from it and to tell when you have reached it. It means translating this goal into objectives that are quantified in metrics.

In this spirit, you decide to track, say, the number of consecutive days without lost time accidents, and the game begins. First, minor cuts and bruises, or repetitive stress, don’t count because they don’t result in the victims taking time off. Then, when a sleeve snagged by a machine pulls an operator’s hand into molten aluminum, the victim is blamed for hurting the plant’s performance.

Similar stories can be told about Quality, Cost, Delivery and Morale, and the recent scandal in the US Veterans’ Administration hospitals shows how far managers will go to fix their metrics.

To avoid this, you need to reduce metrics to their proper role of providing information an possibly generating alarms. In health care, you may measure patients’ temperature to detect an outbreak of fever, but you don’t measure doctors by their ability to keep the temperature of their patients under 102°F, with sanctions if they fail.

Likewise, on a production shop floor, the occurrence of incidents is a signal that you need to act. Then you improve safety by eliminating risks like oil on the floor, frayed cables, sharp corners on machines, unmarked transportation aisles, or inappropriate motions in operator jobs. You don’t make the workplace safer not by just rating managers based on metrics.

In summary, I don’t see anything wrong with SQDCM as a list. It covers all the dimensions of performance that you need to worry about in manufacturing operations, as well as many service operations. Mark uses it in health care, but it appears equally relevant in, say, car rental or restaurants. I don’t see it as universal, in that I don’t think it is sufficient in, for example, research and development.

And, in practice, focusing on SQDCM  easily degenerates into a metrics game.

See on www.leanblog.org

Metrics in Lean – Chart junk in performance boards and presentations

Manufacturing professionals who read Edward Tufte‘s books on visualization may be stunned to discover that their 3D pie charts, stacked bar charts, and green safety crosses are chart junk. These charts are common, both on shop floor performance boards and in management presentations. But they are information-poor, and their decorative elements distract, confuse, and occasionally mislead.  The purpose of plotting is not to dazzle, but to discover patterns, understand the underlying phenomena, and communicate with people whose livelihood is affected by these patterns.

For details, see:

Pabulum pies

Following is an example of what Microsoft Support considers to be a pie chart with a “spiffy, professional look,” and a few comments about it.
This chart not only uses buckets of ink to display five data points, it also violates other good design principles:
  1. The legends are remote from their objects, forcing the reader to look in two places to understand each item.
  2. The chart begs a question that it could answer but doesn’t: percentages of what total amount?
  3. The shadow and the light reflection convey no information.
  4. The 3D effect makes the 21% slice appear larger than it actually is.
If we put the legends on their objects, add the total amount as a subtitle, get rid of the vinyl-cushion look, and take a top view so that the wedge sizes are not distorted, we get the following:
Microsoft pie chart example - improved

One feature that now stands out is that the wedges are ordered by decreasing size, except for the last two: “Desserts” is larger than “Beverages” but comes after. It is the example Microsoft uses for training and they don’t explain the sequence. While this chart is an improvement over the previous version, but do we actually need it? A picture may be worth a 1,000 words but, if what you have to say fits in 10 words, a picture may be overkill. Compare the pie chart with the following table of the data in it:

Microsoft pie chart example - data table

If you compare a pie chart with a sorted table of the data in it, you see that the chart uses orders of magnitude more ink in multiple colors than the table, without telling you anything that isn’t already obvious from the table. In Edward Tufte’s book, it makes it the kind of chart junk that you should banish from your materials.

3D pie charts are actually worse, because they are not only useless but misleading in that the perspective distorts the apparent size of the wedges. This is an illustration of another of Tufte’s principles, that a graphic display should not have more dimensions than the data. In 3D pie charts, the height of the pie is meaningless. The only circumstance in which showing pie thicknesses would be useful if when showing several pie charts side by side, and both pie diameter and thickness would represent parameters of each pie.

Stacks astray

Vertical bar charts, also known as column charts, usually serve to show the evolution of a quantity for which data is available for a sequence of periods. It is used when there are few periods and interpolation between periods does not make sense. If you plot daily sales over a year in a vertical bar chart, the columns will be so densely packed that you will only see the line formed by their tops anyway, and you might as well just use a line plot.
Bar charts - Daily sales as bar versus line plot
On the other hand, if you are plotting monthly sales, you use vertical bars, because interpolation between two points does not give you meaningful intermediate numbers. If you plotted temperature readings taken at fixed intervals, for example on a hospital patient, you would interpolate to estimate the temperature between readings, and therefore you use a line plot rather than vertical bars.
Bar chart of monthly sales

Bar chart of monthly sales

There is nothing objectionable to the ordinary, vertical bar chart. It is simple to produce and easy to read, and provides information that is not obvious in a table of numbers. Stacked bar charts, however, are another matter. They attempt to show the evolution over multiple periods of both an aggregate quantity and its breakdown in multiple categories, and do a poor job of it, especially when a spurious 3rd dimension is added for decoration.
We can retun to the Microsoft support web site for a tutorial on stacked bar charts. The building-block look of the Microsoft example may be appropriate for an audience of preschoolers.

Another oddity of the Microsoft example is that it does not follow the convention by which vertical bars are used when the categories on the x-axis represent consecutive time buckets. When they are not ordered categories, you usually prefer horizontal bars, not only because we are used to the x-axis representing time in performance charts but also because this chart has horizontal category labels, readable without tilting your head.

Based on the preceding section, the first question we might ask is whether we need graphics at all, and the answer is yes. When we take a look at the data in table format, even though there are only 16 points, no pattern is immediately apparent:

Stacked bar Microsoft example source table

If we forget the spurious 3rd dimension and the building-block look, and toggle the axes so that the x-axis represents time, we get the following:

Stacked bar Microsoft example 3D removed

From this chart, it is obvious that total sales collapsed from the 2nd to the 4th quarter; it was not obvious from either the table of number or the 3D chart Microsoft presented as a model. Even in this form, however, the stacked bar chart is ineffective at answering the immediate follow-up question of whether the decline in sales was more pronounced in some regions than others. For example, if we try to isolate the Northeast on this chart, we find bars floating at different heights. On the other hand, the answers become visible if you de-stack the bars, as in the following:

Stacked bar Microsoft example unstacked

For example, you can see that sales actually grew from the 1st to the 2nd quarter in both Eastern regions, while declining throughout the year everywhere else. Yes, it takes up more space than the stacked bar, but is that really a concern when, even in a simple example like this one, it lets you see better?

Safety crosses to bear

The safety performance of a plant matters, and not only to the potential victims of accidents. Injuries must be rare events, occur with ever decreasing frequency, and have both immediate countermeasures and permanent solutions to prevent recurrence. In light of this, what kind of graphic summaries would be appropriate to represent the safety performance of a whole plant, or of a shop within it?

A common metric is the number of consecutive days without a lost-time accident. It is a relevant measure, but imperfect in that its use has been known to lead thoughtless managers to blame victims for hurting their department’s performance and pressure them not to report injuries. It is also necessary to show detailed information about each accident, and to categorize injuries, for example, in terms of whether they affected hands, shoulders, feet, etc. and where exactly they occurred.

In light of these considerations, what is industry using? In the US, the National Safety Council uses a green cross in its logo and awards a Green Cross for Safety Medal to one company each year. That makes the green cross a symbol of safety, and it has motivated some to make it the basis of a safety performance tracking chart. You start with a cross shape subdivided into rectangles numbered 1 through 31 and, on each day of the month, you place a magnet on the corresponding spot, which is green if nothing happened, yellow if a minor incident happened, and red for a lost time accident.

It is difficult to think of the use of a chart in the form of a symbol of safety as anything but a gimmick. The green cross has no connection with any requirements we can think of for a chart of safety performance. It does not make it particularly easy to count consecutive days without incident, nor does it bear any information about the nature or location of the accidents. The green cross shape evokes safety but has nothing to do with the key questions about employee safety.

To visualize a sequence of rare events, a technique that comes to mind is the timeline. You arrange events around a centerline that shows the passage of time, as in the following example that summarizes 10 years of the history of the iPhone.

Timeline 10 years of iPhone history

Safety-related events in a factory do not need a 10-year timeline, but possibly a six-month timeline along the following model:

Timeline of sports events january-june 2010

Note that the time between events is immediately visible, and that each event has some explanation and photographic documentation. For location information, you can pin injury locations on an outline of a human body and a map of the shop floor. While these may not be pleasant charts to consider, they are a means of starting a conversation in the team on what the safety issues are and the means of preventing injuries.

There are, of course, other safety issues, like repetitive stress, that are not associated with discrete events and do not appear on these charts, but they do not appear on the green crosses either.

Recommendation on performance board design

Performance boards come in all sorts of shapes, as in the following examples:

As a performance board for a shop floor team, I recommend the following template:

Performance board template

Performance board template

This template has one column per dimension of performance and one row for each type of information, as follows:

  1. The top row is for the latest news, what happened this shift or last.
  2. The second row is for historical trends in aggregate performance.
  3. The third row is for a breakdown of the aggregate into its constituent parts, such as the most common injuries, defect categories, most frequently made products, or the employee skills matrix.
  4. The bottom is about actions or projects in progress in each of these areas.

Metrics in Lean – Alternatives to Rank-and-Yank in Evaluating People

The July, 2012 edition of Vanity Fair had a cover story about Microsoft and the damage caused to the company by its stack ranking system, also known as rank-and-yank. This story caused a flurry of responses in the press, including an article in Forbes defending the practice. The idea is not just that employees should be evaluated individually and ranked, but that the bottom few performers should be mercilessly culled from the work force, thus bringing up the overall level, tightening the performance range, and motivating survivors for the next round. This assumes that the performance of employees can meaningfully be reduced to a single number, and that living in permanent fear of losing your livelihood is the best motivation for improvement.

According to another Forbes article, GE, the company that championed Rank-and-Yank in the 1980s, abandoned it in 2005. Another once celebrated showcase of this method is the notorious Enron, and now Microsoft is under fire because of it.  Yet, according to the same article, the practice is now widespread among large American corporations, hidden under a variety of names. But where is the evidence that it actually works in the long term?

In fact, stack-ranking is the polar opposite of the Lean approach to human resources, and it needs to be said, explained, and repeated. In this spirit, this post covers the following:

  1. What is Rank-and-Yank?
  2. How Rank-and-Yank would apply in the Tour de France.
  3. The effect of Rank-and-Yank on organizational behavior.
  4. How you should evaluate people.

What is Rank-and-Yank?

The Microsoft model is officially called Vitality Curve. In its latest version, since April, 2011, the model ranks employees in 5 buckets of pre-defined size:

  • 20% are outstanding
  • 20% are above average
  • 40% are average
  • 13% are fair.
  • 7% are poor.

In every department, regardless of what it does, the manager is expected to simply rank the members, with far-reaching consequences. All compensation is pre-defined based on the bucket, and employees in the bottom bucket are ineligible to move positions with the understanding they will soon be fired. And, apparently, this is applied in every department, regardless of what it does.

According to former GE CEO Jack Welch, every team has 20% of A-players, 70% of B-players, and 10% of C-players, who contribute nothing, procrastinate, fail to deliver on promises, and therefore should be fired. This is is predicated on the assumption that the performance of individuals follow a bell curve, as in Figure 1:

Figure 1. Stack ranking bell curve (from Nagesh Belludi’s blog)

Underlying Figure 1 are the following assumptions:

  1. Performance is one-dimensional and numeric.
  2. In any given population, it is normally distributed.

Both are obviously false. It is actually difficult to contrive an example where they might hold. For example, you might do the following:

  1. Choose an objectively measurable task, performed individually, such as shoveling dirt.
  2. To perform the task, pick a random sample of the population, to make sure the participants have no special skills.

The quantities of dirt shoveled by individuals might follow a normal distribution, or bell curve. Then you may well be able to group them in A, B, and C categories, and put in the C categories people to whom you would rather not ask to do this task again. And the proportion of people who fail to meet the minimum standard will vary between samples.

There are many circumstances in which you encounter the bell curve, from the distribution of IQ test scores to temperature profiles in a solid in which heat penetrates by conduction from a point source. It doesn’t mean that every variable fits a bell curve. You never assume it does. Instead, you examine the data to determine whether it is reasonable model, and run statistical tests to confirm it.

To anyone trained in statistics, the idea of manipulating data to force them onto a  given curve is anathema. But it is exactly what rank-and-yank. It mandates that there should be 10% of C-players, in corporate situations where, usually, the following holds:

  1. The people are not a random sample of the population but a group of trained professionals, recruited and vetted for special skills.
  2. They work in teams, not as individuals in parallel.
  3. Performance is multidimensional. There is no single, objective performance metric with which to rank them.

According to Jack Welch, A-players are as follows:

“The As are people who are filled with passion, committed to making things happen, open to new ideas from anywhere, and blessed with lots of runway ahead of them.  They have the ability to energize not only themselves, but everyone who comes in contact with them.  They make business productive and fun at the same time.”

(Straight from the Gut, page 158)

I am not quite sure what “lots of runway ahead” means, and there are many jobs, including in management, that must be done but that no sane person would consider fun. Otherwise, there are domains in which you can expect the recruitment process to have selected entire teams of A-players. It does not mean that you cannot or should not evaluate them, but Rank-and-Yank may not be the best way to do it.

Rank-and-Yank applied to the Tour de France

Tour de France riders on the Champs Elysees

Known as the toughest bicycle race on the planet, the Tour de France has the following special characteristics that make it usable as a metaphor for performance ranking in other businesses:

  1. The participants are not a random sample of the population but the best riders in the world.
  2. They are measured individually by their total time through all the stages of the race. It is an objective, numeric metric, and the only one that matters in the end.
  3. They work in teams during the race.

Individual rider performance

The 2012 Tour de France covered 2,173 miles in 20 stages. The winner , also know as the yellow jersey, was Englishman Bradley Wiggins, who covered this distance in a total of 87 hours, 34 minutes and 47 seconds. The last finisher, known as the red lantern, arrived 3 hours, 57 minutes and 36 seconds later, meaning that it took him to 4.5% longer than the winner. Through the paving stones of the North and the mountain passes of the Alps and the Pyrenees, the yellow jersey averaged 24.82 mph; the red lantern, 23.74 mph. These numbers are not only high but close.

Figure 2 shows a histogram of how long behind the yellow jersey each rider was, also known as his “gap,” in 10-minute bins. It clearly takes too much imagination to see a bell curve.

Figure 2. Histogram of rider gaps in 10-minute bins, Tour de France 2012

We could just leave it at that and conclude right away that the gaps are not normally distributed. Just to make sure, let us call in the heavy statistical artillery. Another way to examine the data is through the cumulative distributions of the gaps, as in Figure 3,  for the 2012 and 2011 Tours de France. These curves are built on the raw data, which avoids the aliasing due to bin size in histograms. The actual data are in burgundy, and the normal distributions in blue. Fitting the normal model involve taking the average and standard deviation of the actual data from the 153 riders for 2012 and 166 from 2011. Theoretically, all tests should therefore be based on the Student t-distribution rather than the Normal distribution. The consensus of statisticians, however, is that it makes no difference when you have more than 50 data points.

Figure 3. Cumulative distribution of gaps in Tour de France performance

The actual data for both years have curves that are sufficiently similar to assume it is not a coincidence, and neither fits the normal distribution. This is not obvious at first sight, but it is when you consider the ends and apply the back-of-the-envelope  test. In the 2012 race, for example, the normal distribution model shows a probability of 1.7% for a  rider to beat the Yellow Jersey and 4.7% to lose to the Red Lantern. With the normal model, the probability that 153 independent riders will all be behind the Yellow Jersey and ahead of the Red Lantern is (1-1.7\%-4.7\%)^{153} = 0.0039\% , which shows that the model does not fit the data.

There are two possible reasons for the gaps not being normally distributed that immediately come to mind:

  1. The Tour de France riders are not a random sample of human beings riding bikes but the best in the world.
  2. They race in teams, not individually.

While records for cycling speed or endurance are readily available on the web, data on ordinary riders are not, although you might expect cities with millions of bicycle commuters to have some. As a consequence, I have not been able to check their speed distributions. On the other hand, information about teams can be retrieved from the Tour de France website.

Team performance

Although riders are ranked individually, they work in teams during the race. In 2012, they were in 22 teams, named after corporate sponsors. Each team has one star rider, considered a contender, that all other riders as expected to support. For example, the supporting riders take turns in front of the contender so that he can ride in their wake. With the energy they have left, the supporting riders can draw attention to themselves by winning stages, or by escaping ahead of the pack for an hour or two during a stage.

Bradley Wiggins’s team in 2012,  SKY PROCYCLING, had kept only one rider from the 2011 team. On the other hand, the team of 2011 winner Cadel Evans, BMC Racing Team, kept 6 of its 9 riders for 2012, and we can compare their performances year-to-year, as  in Figure 4:

Figure 4. BMC Racing Team results in 2011 and 2012

While the 2011 winner fell behind, all other returning team members improved their standings, and in particular the last one. In 2011, Marcus Burghardt was only three positions and 12 minutes ahead of the red lantern;  in 2012, he placed 58th, 1 hour and 43 minutes behind the winner. Had Rank-and-Yank been applied, he would have been branded a C-player.

In Jack Welch’s words, “C-players are non-producers. They are likely to “enervate” rather than “energize”, according to Serge Hovnanian’s model. Procrastination is a common trait of C-players, as well as failure to deliver on promises.”  But can this description possibly apply to a professional rider who finishes the Tour de France? Following is how CyclingNews described Burghardt’s performance in 2011:

“One of Cadel Evans’ domestiques at the 2011 Tour de France, Marcus Burghardt, was instrumental in the Australian’s overall victory. The tall German Classics rider is a powerful rouleur and set the pace for the BMC Racing Team whenever it was needed, protecting the squad’s sole leader throughout the three-week race. Having succeeded in the team’s goal to win the Tour de France, Evans gave one of the plush lions that go with the yellow jersey to Burghardt as a present for his baby girl. This will be one of the 28-year-old’s greatest accomplishments as a rider, […]”

The overall time metric obviously does not tell the whole story. What other ways are there to prove a rider’s value? The Tour de France offers two consolation prizes, the green jersey for the highest number of stage wins, and the polka-dot jersey for the best mountain climber.  Escapes during stages are not recognized by the Tour de France because they have no effect on the race itself if the pack catches up before the end of the stage. They are, however, valued by team sponsors, whose logos on team jerseys are exposed to  TV cameras during escapes. But none of the above indicates how good a team player Marcus Burghardt was.

Here we go from the objective measurement of sport performance, like the total race time, to the subjective assessment that a rider was a good team player because others say he was, and it leads us back to what happens in businesses other than sports.

The effect of Rank-and-Yank on behavior

Your work force may range from an R&D lab populated with PhDs representing the world’s top talent in the domain to a crew of previously unknown day laborers recruited that morning. The notion that the same performance evaluation model could apply throughout a large company is absurd on the face of it and we should not have to even discuss it. Since, however, it is done anyway, it is worth pondering the impact it has on attitudes and behavior.

Rank-and-Yank turns work life into a permanent game of high-stakes musical chairs. Where there are clear metrics, as in shoveling dirt or racing, at least individuals can predict and affect outcomes. Everywhere else, the evaluations are based on subjective assessments, prone to favoritism, and perceived by employees as unfair. The Vanity Fair article shows employees using strategies to protect themselves against this process that are counterproductive for the company, including, for example:

  • Nurturing their image with respect to anyone with influence on their ranking instead of producing output.
  • Withholding key information that might help colleagues, while pretending to collaborate with them.

There is anecdotal evidence that Rank-and-Yank has the same effects in other companies. On a subject this political, with variations across companies in the way the approach is implemented, objective assessments are not easy to find. The overwhelming majority of blog posts is negative, but bloggers are a self-selected group.

How you should evaluate people

In Out of the Crisis, Deming branded evaluation by performance, merit rating, or annual review of performance as a deadly disease. Unfortunately, in every company, management has to make decisions on employee raises, bonuses, stock grants, promotions, transfers,… based on some form of evaluation. Since Deming was also critical of Management By Objectives (MBO), I think his statement on reviews should be viewed in this context.

The less formal the evaluation is, the more arbitrary it is, and the more unfair it is perceived to be, leading employees to distrust the company, disengage with it, and leave it at the first opportunity. It was central to Alfred P. Sloan‘s approach to make General Motors into a company managed through processes that employees could trust. In the 1920s, it was a contrast with rival Ford, which had no such processes in place until a generation later when the Whiz Kids — including  Robert McNamara and Arjay Miller — implemented them after World War II.

The challenge

In general, a fair and objective review process is essential if you want to retain talent, which means that you cannot have a learning organization without it. A learning organization is an organization whose members learn. As a term learning organization can easily mislead us into thinking that organizations have knowledge outside the heads of their members. They don’t. All they have otherwise is a library of data, that only turns into knowledge when a member reads it and checks it against reality, and this is a process that restarts from scratch with every new hire. For an organization to learn and to retain knowledge, it must retain its people, and it doesn’t without a review process that they not only perceive to be fair but also helps them manage their careers, so that they have an idea of what they can accomplish by staying and aligning their interests with those of the company.

While Rank-and-Yank is overly harsh, a system can also fail by giving good reviews to everybody, making the company like Garrison Keillor’s Lake Wobegon, “where all the children are above average.” It is a trap that conflict-averse managers easily fall into, leading to complacency and demoralizing high-achievers.

The challenge, therefore, is to have a review process that recognizes high performance and encourages all to emulate it, without creating an environment where employees work in constant fear of losing their jobs for no reason they can understand.

Who gets reviewed?

Companies like Boeing, Unilever, GE or the old GM, are known to have programs to nurture young employees identified as having executive potential by giving them access to special training, rotating them through various jobs designed to help them broaden their understanding of the company and grow a network of relationships. The rest of the professional staff receives less attention, and shop floor workers none at all, unless they have problems.

One rarely discussed feature of the Toyota Production System (TPS) is the way in which it extended the review and career planning process to all permanent employees, including production operators. Not even the Japanese literature says much about, possibly because it is less unique to Toyota than other aspects of TPS.

Lifetime employment, limited to men under 60, is a practice introduced in Japan after World War II. The men working under this arrangement were supplemented by temporary contractors, retirees, and young women expected to marry and resign. The social origin of candidates was also a factor in hiring, in that, for example, large companies were reluctant to hire heirs to family businesses, based on the assumption that they would leave. Labor mobility was one-way, from larger to smaller organizations. You could leave government service for large companies, or large companies for smaller ones, but not the other way around.

Since the 1990s, the economic crisis has in some places broken the lifetime employment practice, while the evolution of Japanese society has weakened gender, social and ethnic discrimination. But the legacy of the postwar system is that most companies and their employees still have a stronger bond than in the US. The system is self-perpetuating within Japan in many ways. Since there are few opportunities for mid-career job-hopping to an equally prestigious company, few employees can do it, and those who do have a hard time integrating with cohorts that have joined right after school. Also, rotation between jobs inside the company develops skills and relationships that make an employee more valuable to the company but not outside of it.

Long-term retention actually requires you to plan the careers of employees at all levels of education and talent, and it means production operators along with engineers or managers. If a permanent employee cannot keep up with the work, you have to find another way to use his or her  talents.

The review process and its consequences

Fairness may be in the eye of the beholder, but, in this case, the beholder matters. Figure 5 shows a case where the employee does not agree with the process, which we can imagine was a meeting behind closed doors between the pointy haired manager and his superiors. The process should instead be open, and give the employee the opportunity to defend his or her record.

Figure 5. Dilbert’s 10/13/1996 strip

In Lean management, along with leading Kaizen activity, performance review and career planning is a major task for supervisors. To keep the personal chemistry between an employee and a supervisor from unduly influencing outcomes, formal reviews are carried out by panels rather than individuals. In some companies, the employee selects one of the panel members. Formal reviews occur at least twice a year, and cover both hard and soft skills, meaning both the technical ability to carry out tasks and the managerial ability to work with others, contribute to or lead projects, and communicate.

Among production operators, you encounter people who, outside of work, may have management roles in clubs and religious or political organizations. At work, however, daily operations do not afford them much opportunity to show this kind of talent, but Kaizen activity does, and this is another reason it is so essential.

It is a multidimensional assessment and includes an analysis of each employee’s ambitions and of the means of realizing them. That a company should do this for anyone is alien to the Silicon Valley culture, and any that did would be a prime target for raiding by competitors. But it is a tradition for professional staff in older, established companies even in the US. Doing it for a machinist who wants to become a maintenance technician, however, is unheard of. It involves rotating such a person around the shop to get acquainted with a broad variety of machines, while arranging for the required training and certification in mechanics, lubrication, power and controls over, say, 5 years, to allow a smooth transition into the role of maintenance technician.

Of course, the opportunity to fulfill an employee’s ambition is contingent on the availability of positions. Not all members of a cohort can be promoted into the management hierarchy, and even technical positions are in limited supply. For the professional staff, the best known, and worst response, is to turn managers who have been passed over into “window people,” who have a desk near a window to wait for retirement without any specific assignment. Another option is to send them as “fallen angels” to take on management responsibilities at subsidiaries or suppliers. On the shop floor, some operators remain in production for their entire career, but the review process ensures that the abilities they accumulate are recognized in their wages, and, towards the end of their careers, through titles like Master Craftsman, that honor their contributions and designate them as a source of advice for younger workers.

Another consideration is the scope of differences in compensation introduced by the review process. The compensation offered by a company is a means of communication. Extreme differences in rewards for small differences in performance happens in a race like the Tour de France, but should not among people who work together inside a company, lest they focus on competing with one another instead of rival companies. The rewards must be large enough to communicate the company’s appreciation and encouragement to keep up the good work, yet small enough that employees do not turn into bounty hunters. Effective teamwork should be acknowledged by team rewards; outstanding individual performance within a team, by an additional individual reward.

Metrics in Lean – Part 6 – Productivity of a Quality Assurance department

Response to a question in the IndustryWeek manufacturing network on LinkedIn:

For metrics of quality itself, see Metrics in Lean- Part 2 , but you specifically asked about the productivity of your Quality Assurance (QA) department, meaning that you are interested in its efficiency rather than its effectiveness. It is a legitimate concern, as long as you don’t pursue efficiency at the expense of effectiveness, which is common but not in the best interest of the organization as a whole.

What is the job of the QA department? I can think of several functions it should have:

  1. Assuring compliance with external mandates. If you have customers that require you to be ISO-9001 certified, the QA department is responsible for making sure you are.
  2. Training the production work force in in-process quality management, including the use of go/no-go gauges, response to quality alarms, problem-solving, and mistake-proofing.
  3. Auditing the quality practices in all operations.
  4. Monitoring the quality of incoming materials and working with suppliers as needed to enhance it.
  5. Leading the response to quality problem reports from customers, including immediate countermeasures and root cause analysis.

For each task you want to do it for, identify a measure of volume, such as the number of quality problem reports (QPR) per period. Then find a practical way to measure the resources consumed by this task. For example, it may not be practical to track accurately the number of hours technicians spend on a QPR, but you know how many people withing the department are dedicated to, or involved with, QPRs. Then the number of QPRs per technician per period can measure the productivity or efficiency of the department at this task.

Metrics in Lean – Part 5 – Lead times and inventory

As in the article on Nike, Lead time reduction is often touted as Lean’s greatest achievement. Improvements in productivity, quality and new product introduction time make it into the first paragraph, but lead times get the headline. Lead time metrics are popular, but there are many different lead times that are of interest and they are not easy to define, measure, or interpret. Inventory is easier to measure and, under stable conditions, Little’s Law provides some information about average lead times, details on lead time distributions can be inferred from inventory age statistics. In addition, inventory metrics are useful in their own right, to support improvements in storage and retrieval, materials handling, and supply chain management.

What do we call a lead time?

In its most general form, the lead time of an object through a system is the interval between the time it enters the system and the time it leaves it. The objects can be material, like manufacturing work pieces, or data, like a passport application, or a combination of both, like a pull signal or a customer order for a manufactured product, which starts as data and ends as materials accompanied with delivery documents. The system is any part of your business that you can draw a boundary around and monitor objects going in and coming out.

Order fulfillment lead time is, in principle, well defined as the interval between the placement of the order and receipt of the goods by the customer. The objects are orders and the system is comprised of your own company and its distribution network. There is no ambiguity as to the time the order is placed when a consumer confirms the checkout for an on-line cart, nor is there about the time of delivery when it is recorded by the delivery service. On the other hand, business-to-business transactions frequently do not have that clarity, particularly on large, long-term orders. If a customer places an order for 12 monthly deliveries, strictly speaking, the order fulfillment lead time is one year, which is not terribly useful. Then you have to identify a trigger point to start the clock for each delivery. If you use Kanbans or other types of pull signals, they can be used for this purpose.

Inside the company, if production work is organized in jobs or work orders, you can measure the time between release by production control and completion, and that gives you an internal, manufacturing lead time. If you produce the same item every day one piece at a time, you can record the times through a production line by serial number. But the existence of scrap and rework makes this a bit more complicated. The parts that do not make it out of the line tie up capacity and slow down the others, and the parts that are reworked undergo extra processing, adding to the lead time and increasing its variability. When calculating lead times for a process, however, you should only consider the units that make it out as good product.

An assembly process involves multiple flows merging. It is somewhat like a river basin, and there is often no objective criterion for deciding which of two merging rivers is the main stream and which one the tributary. Usually, the smaller river is designated as the tributary, but there are exceptions. By this criterion, for example, the river crossing Paris should be called the Yonne rather than the Seine, because, as French kids learn in primary school, where they merge upstream from Paris, the Yonne is the larger of the two (See Figure 1).

Figure 1. A tributary larger than the mainstream

Likewise, in assembly, working upstream from the end, you have to decide which flow is the main and which are  feeder lines.  It is a simple call where you are mounting a side rear-view mirror on a car, but less obvious when you are mating an engine and transmission assembly with a painted body.

Measuring lead times directly

Tracing the lead time of completed units through multiple operations requires a history database with timestamps for all the relevant boundary crossings. This is only available if there is a tracking system collecting this data. If the data collection is manual, it often occurs at the end of each shift, meaning that we know in which shift the event occurred but not at what time within that shift, as shown in Figure 2. To measure lead times in weeks, it is accurate enough; in hours, it isn’t.

Figure 2. Operators recording production activity at the end of the shift

The direct measurement of lead times is also problematic with rapidly evolving, high-technology processes that have manufacturing lead times in months. If a part goes through 500 operations in 4 months, its actual lead time will commingle data about current conditions at the last operation with four-month-old data about the first one. Since then, three additional machines may have been brought on line, two engineering changes to the process may have taken place, and volume may have doubled, all of which makes the old data useless. It would be more useful to have a snapshot of the lead time under current conditions, with the understanding that it is an abstraction because, as the process keeps evolving, no actual part will ever make it from beginning to end under these exact conditions. To get such a snapshot, we need to measure lead times for individual operations, which raises the question of how we can infer lead times for an entire process from operation lead times.

Average lead times add up, extreme values don’t

When we have lead times for operations performed in sequence, we want to add them up like the times between stations on a train line, to get a lead time for the entire process. For each object flowing through, it always works: the time it needs to go through operations 1 and 2 is the sum of its times through Operation 1 and Operation 2.When we look at populations of objects flowing through, it is a different story. The averages still add up by simple arithmetic. The problem is that the average is usually not what we are interested in. When accepting customer orders, we want to make promises we are sure to keep, which means that our quotes must be based not on lead time averages  but on upper-bounds, so that, in the worst-case scenario, we can still deliver on time. We need to be careful, however, because extreme values are not additive. The worst-case scenario for going through operations 1 and 2 is not the sum of the worst-case scenario through Operation 1 and the   worst-case scenario through Operation 2.

That it is wrong to add the worst-case times is easiest to see when considering two operations in sequence in a flow line, when variability in the first operation causes you to maintain a buffer between the two. If one part takes especially long through the first operation, then the buffer will be empty by the time it reaches the second and its time through it will be short, and it makes no sense to add the longest possible times for both operations. If it takes you an unusually long time to go through passport control at an airport, your checked luggage will be waiting for you at the carrousel and you won’t have to wait for it. In other words, the times through both operations are not independent.

A job shop is more like a farmer’s market (See Figure 3). At each operation, each waits in line with other parts arriving by different paths, like customers in queues at different stalls in the market. Then the times through each operation are independent, and the extreme values for a sequence of operations can be calculated simply, but not by addition. This is because, for independent random variables, it is the squares of the standard deviations that are additive, and not the standard deviations themselves. If operations 1 and 2 have independent lead times with standard deviations σ1 and σ2, the standard deviation for both in sequence is \sqrt{\sigma _{1}^{2}+\sigma _{2}^{2}} . If the first operation takes 8±2 hours and the second one 5±1 hours, the sequence of the two will take 13 ± 2.2 hours and not 13 ± 3 hours as would be obtained by just adding the extreme values. It is like the hypotenuse of a right triangle versus the sum of its two other sides. Of course, the more operations “standard lead times” you add in this fashion, the worse the lead time inflation. For details on this phenomenon, see Measuring Delivery Performance: A Case Study from the Semiconductor Industry, by J. Michael Harrison et al. in Measures for Manufacturing Excellence,  pp.309-351.

Figure 3. A farmers’ market and a machining job shop

Interpretation and use

Process lead times look like task durations in a project, and it is tempting to load them in a program like Microsoft project and treat operations like tasks with finish-to-start constraints and use the project planning and management tools to perform calculations on the production process. Unless you are building a one-of-a-kind prototype or a rarely ordered product, however, manufacturing a product is not a project but an activity involving flow. As a consequence, order fulfillment lead times are usually much shorter than process lead times. You can order a configured-to-order computer on-line and get it delivered within 3 to 5 days, but the processor in it takes months to make. When a manufacturer explains that the business is purely “make-to-order,” it doesn’t usually mean starting by digging for iron ore to make a car. The game is to decide where in the process to start and how to have just the materials you need when you need them, in order to fill customer orders promptly without hoarding inventory.

Lean manufacturers achieve short lead times indirectly by doing the following:

  1. Engineering production operations for stability and integration into one-piece flow lines. This is never achieved 100% but is always pursued.
  2. Designating your products as runners, repeaters or strangers, and lay out production lines and logistics differently for each category.
  3. In mixed-flow lines, applying SMED to reduce changeover times.
  4. Applying leveled-sequencing (heijunka) as needed in scheduling production lines.
  5. Using a pull system to manage both in-plant and supply-chain logistics.

In an existing factory, the challenge of reducing lead times is often mistakenly perceived as involving only production control and supply chain management in actions limited to production planning, production scheduling,  and materials procurement. Because materials in the factory spend so little of their time being worked on, improving production lines is viewed at best as secondary, and at worst as a waste of time, because “production already been optimized.” In reality, it is nothing of the kind, and one key reason materials wait so long is dysfunctional production. Improve the productivity and flexibility of manufacturing operations, lay out your lines to make it easiest to do what you do the most often, and you see the waiting times melt away, creating the opportunity to use more sophisticated methods in support of production. This perspective is a key difference between Lean Manufacturing and the theory of constraints or the approaches proposed in the academic literature on operations management, such as Factory Physics.

Theoretical versus actual lead time

In analyzing lead times, we separate the time the object spends waiting from the time it is being worked on, making progress towards completion. This serves two purposes:

  1. Establishing the lower limit of lead time under current process conditions. The fastest the object can move through the system is if it never waited.
  2. Understanding the ratio of working to waiting, and making it a target for improvement.

The dual timelines at the bottom of a Value Stream Map bear lead time and process time data. The sum of these process time data is often called theoretical lead time or theoretical cycle time, after which actual performance is often described as “We’re running at five times theoretical…” How exactly the theoretical lead time is calculated is usually not specified.

What I recommend to calculate a meaningful theoretical lead time for a product is a thought experiment based on the following assumptions:

  1. The plant has no work to do, except except making one piece of the product.
  2. The following is ready and available for this one piece:
    • Materials
    • Equipment.
    • Jigs, fixtures, and tools
    • Data, like process programs or build manifests.
    • Operators.
  3. Transportation between operations is instantaneous.
  4. There is no inspection or testing, except where documented results are part of the product, as is common in aerospace or defense.

Under these conditions, the theoretical lead time is what it would  take to make the unit from start to finish. These assumptions have the following consequences:

  1. Since we assume the equipment is ready, no setup time is involved.
  2. The process time through an operation involving a machine includes loading and unloading.
  3. If a machine processes a load of parts simultaneously, the processing time for a single part is the same as for a load. If an oven cures 100 parts simultaneously in two hours, it still takes two hours to cure just one part.

On the other hand, there are cases for which our assumptions still leave some ambiguity. Take, for example, a moving assembly line with 50 stations operating at a takt time of 1 minute. If we treat it as one single  operation, our product unit will take 50 minutes to cross it from the first station to the last. On the other hand, to make just one part, the line does not have to move as a constant pace. The amount of assembly work at each station has to be under 1 minute, and the part transferred to the next station as soon as this work is done, with the result that it takes less than 50 minutes to go through the whole line. You can make an argument for both methods, and the assumptions are not sufficiently specific to make you choose one over the other. What is important here is that the choice  be explicit and documented.

The difference between the actual and theoretical lead times can then be viewed as gold in the mine, to be extracted by improvements in all aspects of operations except the actual processes. If you find a way to mill a part twice as fast, you change the theoretical lead time itself. Because the theoretical lead time is usually a small fraction of the actual lead time, say, 5 hours versus 2 months, managers often assume that it makes no sense to focus on finding ways to reduce these 5 hours to 4, and that they should instead focus on the time the materials spend waiting. But, as said above, the two are not independent. Faster processing melts away the queues, and reducing the theoretical lead time by 20% may reduce the actual lead time by 50%.

“Days of inventory” and Little’s Law

Inventory levels are often expressed in terms of days of coverage. 200 units in stock, consumed at the rate of 10 units/day, will last 20 days. Therefore, 200 units is equivalent to 20 days of inventory, and this is what the average lead time for one unit will be. This is the method most commonly used to assign durations to “non-value added activities” on Value Stream Maps.

We should not forget, however, that the validity of this number is contingent on consumption. If it doubles, the same number of parts represents 10 days instead of 20. If consumption drops to zero, then the 200 parts will cover the needs forever.

When, on the basis of today’s stock on hand and today’s throughput, a manager declares that it is “20 days of inventory,” it really means one of the two following assertions:

  1. If we keep producing at the exact same rate, the current stock will be used up in 20 days, which is simple arithmetic.
  2. If the production rate and available stock fluctuate around the current levels, the item’s lead time from receiving into the warehouse through production will fluctuate around 20 days, by Little’s Law.

In either one of these interpretations, we have an “instantaneous” lead time that is an abstraction, in the sense that no actual part may take 20 days to go through this process, just as a car going 60 mph this second will not necessarily cover 60 miles in the next hour. In the case of a car, we all understand it is just a speedometer reading; for days of inventory, it is easy to draw conclusions from the number that go beyond what it actually supports.

Inventory, throughput, and average lead times

As we have seen, lead times are difficult to measure directly, because it requires you to maintain and retrieve complete histories for units or batches of units. Inventory is easier to measure, because you only need to retrieve data about the present. First, the inventory database is much smaller than the production history databases. Second, because inventory data are used constantly to plan, schedule, and execute production,  it is readily accessible and its accuracy is maintained. For similar reasons, throughput data is also easier to access than history and more accurate. As a result, with all the caveats on assumptions and range of applicability, Little’s Law is the easiest way to infer average lead times.

Inventory age analysis and lead time distribution

In some cases, inventory data lets us infer more than just average lead times. Often, the inventory database contains the date and time of arrival into the warehouse by unit, bin, or pallet. If it cannot be retrieved from the database, it is often available directly from the attached paperwork in the warehouse.Then for a relevant set of items, we can plot a histogram of the ages of the parts in the warehouse, which, as a snapshot of its state, may look like Figure 4.

Figure 4. Inventory age snapshot for one item

If there is “always at least 5 days of inventory,” then we can expect no part to leave the warehouse until it is at least 5 days old, and seek an explanation for the short bar at age 3 days. The bar to the right shows outliers, parts that have been passed over in retrieval for being too hard to reach, or possibly have gone through a special quality procedure. In any case, they are an anomaly that needs investigating.

If the warehouse operations are stable in the sense that there is a lead time distribution, then, if we set aside obvious outliers and take the averages of multiple snapshots taken at different times of the day, the week or the month as needed to smooth out the spikes associated with truck deliveries, the chart should converge to a pattern like that of Figure 5.

Figure 5. Average of multiple snapshots with outliers removed

If a unit is 9 days old in the warehouse, it means that its time in the warehouse will be at least 9 days. The drop between the columns for 9 and for 10 days, then represents the parts that stay at least 9 days but less than 10. In other words, in proportion to the whole, it gives the probability that a part will be pulled on its 10th day in the warehouse. Therefore, by differences, the age distribution gives us the complete distribution of the lead times, as shown in Figure 6.

Figure 6. Lead time distribution inferred from inventory age

Admittedly, this approach cannot always be used. Where it can, it gives us detailed information about lead times at a fraction of the cost of directly measuring it. Even where it cannot be used, snapshots of inventory age still provide valuable information, much like the demographers’ populations pyramids, as in Figure 7.

Figure 7. Example of population pyramids

Inventory metrics

To accountants “resource consumption” is synonymous with cost. As discussed in Part 2, for posting on the shop floor, we need metrics that express performance in the language of things. Depending on circumstances, such substitutes may include the amount of work in process used to sustain production, as a measure of the effectiveness of production management and engineering. When it goes down, it is a both a one-time reduction in working capital and a reduction in recurring holding costs. The unit of measure of WIP can be set locally in each work area.

Many companies measure inventory in terms of its dollar value, of the time it would take to consume it, or the turnover frequency. In doing so, they combine measures of the inventory itself with other parameters, such as the values assigned to inventory by Accounting and an assumed throughput rate. These are legitimate derivative metrics and of interest to management, but when you stand next to a rack on the shop floor, you see pallets and bins, not money, days of supply, or turns. The raw inventory data is comprised of quantities on hand by time, and these should also be used as the basis for simple metrics in the language of things, such as the following:

  • Number of pallets, bins and units on hand by item. This is what Operations has to work with, regardless of item cost.
  • Number of partial containers in store. The presence of “partials” in store is a sign of a mismatch between batch sizes in quantities received and consumed.
  • Floor space occupied by inventory. This is of interest because freed up space can be used to increase production.
  • Accuracy of inventory data. This is usually measured by the percentage of items for which database records agree with reality, as observed through cycle counting.

As discussed above, inventory is easier to measure than lead time, and much about lead time can be inferred from inventory status, using tools like Little’s Law or age analysis. But it is not the systematic application of formulas to numbers: we need to be careful about underlying assumptions and the extent to which the data supports our conclusions.

Metrics on the web versus manufacturing

Mingled last night with 209 internet operations “ninjas” at the meetup of the Large Scale Production Engineering (#lspe) group, on Actionable Metrics, hosted by Yahoo! in Sunnyvale, CA,  and heard speakers from collectibles marketplace site etsy, web performance testing service SOASTA, and video streaming service Netflix describe how they used metrics in their activities.

Both in style and content, the presentations were radically different from what I have heard in manufacturing on that subject. The speakers went fast, as if they were in a great hurry to give us as much information as possible prior to returning to work. They were also extraordinarily open about the tools they used and how they used them.

The first feature of their work that struck me was the simplicity of their dashboards. Almost all the charts they monitor are simple line plots of time series, free of the jumble of 3D pie charts, stacked bar charts, and other complicated displays commonly found on manufacturing dashboards.  The most elaborate display showed a time series of, for example, login response times, with a dynamically adjusted confidence band based on a statistical model to help operations engineers tell significant changes from fluctuations. Yet, even with these features, the meaning of the chart was obvious, even to an outsider to their business. It is not difficult, for example, to understand a plot of the evolution of login times as the number of users grows or additional security is added.

As engineers tweak and enhance the functions provided on their servers everyday, they monitor metrics to see if it does any good, and take immediate action if it doesn’t. The time within which they can and must react is measured in seconds, not hours or days. Their metrics fall into the following categories:

  1. Customer experience/business. This includes the user experience and its translation into business activity, like the number of page views and the conversion ratio of page views to orders. For a subscription-based service like Netflix, it might include the number of times subscribers visit the site without streaming anything, suggesting that they didn’t find what they were looking for.
  2. Infrastructure. This covers the behavior of the servers and the networks through which user inputs are passed to the applications and outputs returned. This has to do with processor and memory utilization, and with the availability of these resources in the face of very large and varying numbers of user interactions.
  3. Application. These metrics rate the ability of the application software to process the user data once it has them and until it sends a response. This includes the speed and quality of concurrent searches or commercial transactions, or the protection of user data.

Customer experience translates as is to a manufacturing context but Infrastructure, on the other hand, corresponds to Logistics, and Application to Production all with different time scales. While one of the speakers described response time as “one metric to rule them all,” none of the organizations presenting imposed any standard set of metrics on their engineers. Instead, they provide them with tools to capture the data, compute the metrics, display them on charts, and generate alerts, but it is their choice and their responsibility to define the metrics.

Response time is to their world what order fulfillment lead time is in manufacturing, but its importance is much greater. There are sectors in manufacturing where short order fulfillment lead times are a competitive advantage with customers, but also many, particularly with big ticket items, where other factors trump short lead times. If you are a farmer buying a tractor, for example, you will take one with the features you want over one you can have sooner that doesn’t have them. The pursuit of short production lead times has to do with other considerations, such as reducing inventory, detecting quality problems faster, or reducing the obsolescence cost of engineering changes. When providing services on the web, on the other hand, any slowdown in response causes customers to balk, and, internally, any increase in transaction processing time can cause servers to saturate and customer response times to explode.

For these engineers, capturing one metric is a matter of adding one line of code in their software, and they use open-source tools to generate plots and dashboards. It is not difficult to do. The hard part  is identifying the right metrics. The Netflix speaker was quite aware of this. After someone had him “No matter what you measure, it’s useful,”  he charted the evolution over one year of the proportion of user IP addresses ending with an even number.

Metrics in Lean – Part 4 – Gaming and how to prevent it

As massively practiced today, Management-by-Objectives (MBO) boils down to management imposing  numerical targets on a few half-baked metrics, cascading this approach down the organization and giving individuals a strong incentive to spin their numbers. It is a caricature of the process Peter Drucker recommended almost 60 years ago, and he deserves no more of the blame for it than Toyota does for what passes as Lean in most companies that claim to implement it.

A non-manufacturing example of decadent MBO is the French police under former president Sarkozy, which was tasked by the government to decrease the crime rate by 3%/year while increasing the proportion of solved cases. According to the French press, this was achieved by gaming the numbers. The journalists first latched on to a reported yearly decrease in identity theft, which seemed unlikely. They found that police stations routinely refused to register complaints about identity theft on the grounds that the victims were the banks and not the individuals whose identities were stolen. A retired officer also explained how crimes were systematically downgraded with, for example, an attempted break-in recorded as the less severe “vandalism.”

The fastest way the police had found to boost the rate of case solutions was to focus on violations detected through their own actions, such as undocumented aliens found through identity checks. The solution rate for such crimes is 100%, because they are simultaneously discovered and solved. The challenge is to generate just enough of such cases to boost the solution rate without increasing the overall crime rate… To achieve this result, packs of police officers stalked train stations in search of offenders, as reported both by cops who felt this was not what they had joined up to do, and innocent citizens who complained about being harassed for their ethnicity.

In organizations affected by this kind of gaming, members work to make numbers look good rather than fulfill their missions. It is a widely held belief that you get what you measure and that people will always work to improve their performance metrics, but this is not a simplistic view of human nature. This behavior does not come naturally. On their own, schoolteachers focus on educating children, not boosting test scores, and production operators on making parts they can take pride in. It takes heavy-handed management to turn conscientious professionals into metrics-obsessed gamers, in the form, for example, of daily meetings focused entirely on the numbers, backed up by matching human resource policies on retention, promotion, raises and bonuses.

But enough about police work. Let us return to Manufacturing, and list a few of the most common ways of gaming metrics in our environment:

  1. Taking advantage of bad metrics. As discussed in The Staying Power of Bad Metrics, many metrics commonly used in manufacturing are poorly defined, providing gaming opportunities, such as outsourcing in order to increase sales per employee.
  2. Stealing from the future. In sports, nothing is more dramatic than the game won by points scored in the last seconds of a game. The bell rings right after the ball spirals into the basket and the Cinderella team wins the championship. In business, the end of an accounting period is the end of a game, and, as it approaches, sales scrambles to close last-minute deals and manufacturing to ship a few more orders. This is what Eli Goldratt called the “hockey stick effect.” Of course, this is done by moving up activities that would otherwise have taken place a few days later, during the beginning of the next accounting period. As a consequence, the beginning of the period is almost quiescent. Not much is going on, but it will be made up at the end…
  3. Redefining 100%. Many ratios, by definition, top out at 100%. A machine cannot run 25 hours/day, and a manufacturing process cannot produce more good parts than the total it makes. This is why ratios like equipment uptime and first-pass yield top out at 100%. Any result under 100%, however, invites questions on how it could be improved. A common way to fob off the questioners is to decree, for example, that a particular machine could not possibly be up more than 85% of the time, and redefine the scale so that 85% uptime is 100% performance. For production rates in manual operations, the ratio of an operator’s output to a work standard is often used instead of process times or piece rates. Such ratios have the advantage of being comparable across operations, and are not bounded in either direction. But their relevance depends on a work standard, and, when everybody in a shop performs at 140% of standard, chances are that the standards are engineered for this purpose.
  4. Leveraging ambiguity. Terms like availability, cycle time, or value added are used with different meanings in different organizations, creating many opportunities to game the metrics. If the product’s market share in the first quarter went for 1% to 2%, it doubled, but, if it went back to 1% in the second quarter, it went down by 1%.

Why do people who, in other parts of their lives, may be model citizens, engage in such behaviors, ranging from spinning to cheating? One answer is that, with what MBO has degenerated into in many companies, management is co-opting metrics gamers into its ranks. It is not that gaming is human nature, but instead that you are actively weeding out those who don’t engage in it. Changing such habits in an organization is obviously not easy.

Assume, for example, that your goal is to be competitive by having a skilled work force, and that your analysis shows that it requires employees to stay for entire careers so that what they learn at the company stays in the company. You then apply a number of different methods to make it happen:

  • Communications. You make sure that all employees know what you are doing.
  • Career planning. You have human resources develop a plan with all employees so that each one knows what he or she can aspire to by staying with the company.
  • Organized professional development. You organize formal training, on-the-job training, and continuous improvement to provide opportunities for employees to develop the skills they need to execute their plans.
  • Job enrichment. You redesign the jobs themselves to make more effective use of each employee’s talents.

If employees appreciate their jobs and have long-term career perspectives within the company, few of them should quit or make excuses not to come to work today, and the results should be visible in lower employee turnover rates and absenteeism.

The metrics are there to validate the approaches taken to reach the goal, but the goal is not to improve the metrics. It is a subtle difference. If you have the flu, you have a fever, but your goal is to heal, not just to bring down the fever. Once you are healed, you fever will be gone, and the decrease in your temperature is therefore a relevant indicator of your healing process, but it is not the healing process. If bringing down the fever were the goal, you could “game” your temperature and bring it down without healing. This distinction existed in Drucker’s original writings about MBO, but got lost in implementation.

So, what can you do to prevent metrics gaming? Let us examine three strategies:

  1. Review the metrics themselves. Use the requirements listed in my first post on this subject. You may not be able to completely game-proof your metrics, but at least you could make sure that they make sense for your business and are not trivially easy to game.
  2. Decouple the metrics from immediate rewards. Piece rates used to be the most common form of payment for production work, but have almost entirely vanished in advanced manufacturing economies, and been replaced by hourly wages. Performance expectations are attached, but there is no direct link to the amount produced in a given hour of a given day. There are many reasons for this evolution:
    • The pace of work is often set by machines or by a moving line, rather than by the individual.
    • The best performance for the plant is not necessarily achieved by every individual maximizing output at all times.
    • More is expected of all individuals than just putting out work pieces, including training or participating in improvement activities.

    One consequence of this decoupling is that time studies are easier and more accurate than in a piece rate environment. The same logic applies in management: the more direct the link between metrics and individual evaluations, the more intense the gaming. Don’t make the metrics the key to promotions or to prizes representing a substantial part of a manager’s compensation. Use them only as indicators to inform discussions on plans and strategies.

  3. Increase the measurement frequency. The longer the reporting period, the more opportunities it offers for gaming the metrics by stealing from the future, and the more pronounced the hockey stick effect. Conversely, you can reduce it by measuring more often, and eliminate it by monitoring continuously, as is done, for example by the electronic production monitors that keep a running tally of planned versus actual production in a line during a shift. Periods exist in accounting because of the limitations of data processing technology at the time the accounting methods were developed. In the days of J.P. Morgan, closing the books was a major effort that a company could only be do every so often. In 2012, there is no technical impediment to the “anytime close,” but the publication of periodic reports continues by force of habit. Metrics in the language of things as well as the language of money can be monitored continuously.
  4. Have third parties calculate the metrics. In principle, counting chips should be done more accurately by agents with no stake in where they may fall. In practice, it is not only expensive but does not always produce the desired result. It is the approach used in Management Accounting. A plant’s accounting manager, or comptroller, is not chosen by the plant manager, he or she reports directly to corporate finance, and has no motivation to humor the plant manager. This is a double-edged sword because, with neutrality, comes a distance from the object of the measurement that may cause misunderstandings, and Management Accounting leaders like Robert Kaplan, Orrie Fiume, or Brian Maskell  have been struggling with the challenge of providing relevant, actionable information to managers for the past 30 years. Outside of Accounting, for metrics in the language of things, the closest you can come to having a 3rd party produce the measurements is to have a computer system do it, based on automatic data acquisition. There is no opportunity for gaming, but the issues of relevance are as acute as in Management Accounting.