Capacity Planning For 1st-Responders

Early in my consulting career, working with Kei Abe, I was surprised to hear him make seemingly contradictory recommendations about the organization of maintenance in a small auto parts plant and in a large car assembly plant. In both, the managers were thinking of splitting the maintenance group into smaller teams, each dedicated to a production line.  In the parts plant, Kei Abe talked them out of it; in car assembly, he supported it. When I asked him why he just said: “the parts plant is too small.”

A Simplistic Model

Reflecting on this later, I worked out a simple but nonetheless useful model of this situation to help me understand the difference. As Kyle Harshberger pointed out in his comments, it is not just simple but simplistic and underestimates the number of technicians needed.  We will return to this issue below. Simplistic though it is, it still makes the point that it is beneficial to split up the maintenance department in some cases but not others.

The Maintenance department in the parts plant had 20 members, which would have been split into 4 groups of 5. Assuming the technicians work in parallel, have the same skills, and that, combining planned and unplanned tasks, are busy 80% of the time. Then they will all be busy simultaneously

$latex 0.8^5 = 33\% $

of the time. As a consequence, one equipment failure in three has no technician available to respond. It is not an acceptable level of service.

If instead, the whole plant is served by a central group of 20 technicians, each busy 80% of the time, they will all be busy simultaneously

$latex 0.8^{20}= 1\% $

of the time. It means that there is at least one technician available to respond immediately to 99% of the failures, unquestionably better service. With technicians busy only 50% of the time, a subgroup of 7 would have at least one member available 99.3% of the time and, overall, you would have needed 7×4=28 technicians to respond as promptly as the central group with 20.

In the car assembly plant, the Central Maintenance had 300 technicians serving the entire plant, with limited interchangeability, because they served shops like stamping, painting, machining or assembly with different technology. The managers proposing to divide them into 6 groups of 50. If we do the same calculation as above, we find that, in each group of 50, with all technicians busy 80% of the time, they will all be simultaneously busy

$latex 0.8^{50}= 14\, ppm $

of the time.

As a metric of the effectiveness of the Maintenance department, we could use the probability of having at least one technician available when a machine breaks down, and call it “Responsiveness.” Then the number of technicians needed to achieve a given responsiveness while keeping the technician utilization at a given level would be:

$latex Number\, of \, Technicians = \frac{{\log (1 – Responsiveness)}}{\log (Utilization)} $

There is, however, a difference between a concept that helps you understand why splitting a maintenance group is a good idea only if the resulting subgroups have a critical mass and a formula that you could apply to determine what this mass is. The key points, relevant to any group that responds to random events, are as follows:

  • The group cannot simultaneously be small, efficient, and responsive. The larger the group, the more you are able to keep its members busy with useful work while retaining the ability to respond to events.
  • Increasing the size of the group has diminishing returns. In the parts plant case, the difference in responsiveness between 4 groups of 5 and one group of 20 was substantial, between failing to respond on average to one out of every three events and one out of every 100 events. In the car assembly plant, on the other hand, going from 6 groups of 50 to group of 300 meant going from never experiencing a failure to respond in your lifetime to never experiencing it for a millennium.

The managers in both the parts and car assembly plant had a different — and valid — reason to want to distribute maintenance resources among the different shops inside the plant. They wanted the technicians in close contact and constant communication with the production people they supported and reporting to the production managers. They didn’t want the technicians to be strangers dispatched by a distant bureaucracy. But size matters in the extent to which you can pursue this goal.

How To Fix The Simplistic Model

The model is based on taking snapshots of technician availability at different times, noticing, for example, that the number of technicians available to respond to an emergency fluctuates around 4 out of 20 — whether this is due to many failures fixed quickly, or few that take a long time, with planned tasks mixed in. From this, I concluded that each technician has an 80% utilization on these tasks and, therefore, is available to be assigned to a new one 20% of the time. And I did not resist the temptation to simply multiply these utilizations to calculate the probability that all are busy.

The flaw in the reasoning, that Kyle pointed out, is that, when all technicians are busy, it means that there are at least as many machines needing attention as there are technicians, and not exactly as many, which is assumed when multiplying the utilizations. In other words:

$latex Prob(All\,technicians\,busy)= Prob(Number\,of\,machines\,down \geq Number\,of\,technicians) $

If we assume, for example, that 20 technicians look after 1oo machines, and that, in steady state, an average of 16 require attention. Then the average number of technicians required is 16, giving an 80% utilization. Assuming all machines to be failing independently and at the same rate, we can assume the binomial distribution for the number of machines down at any time and calculate the probability of having at least 20 of them down. In this example, the result is that you would need 26 technicians rather than 20 to bring the probability of all of them being busy at the same time under 1%. You can verify it using the BINOMDIST function in Excel or the dbinom function in R.

In addition, the vision of maintenance as simply dispatching an individual technician whenever a machine breaks down is itself simplistic. The more complex issues encountered in real maintenance include the following:

  1. Maintenance tasks on large machines often require teams rather than individuals. There is little one person can do alone in the maintenance of a nuclear reactor, a chemical plant, or a steel mill.
  2. Not all equipment failures are of equal severity. Some put lives at risk, some stop production, and others can be worked around until help is available. In principle, this should be systematically assessed through FMEA (Failure Mode Effect Analysis), but it is not as common a practice as the literature suggests.
  3. Technicians are generally not fully interchangeable. Even when they are cross-trained on mechanical, piping, electrical power and controls issues, an individual’s familiarity with specific machines and their operators makes a difference.

A Few Conclusions

A comprehensive policy for maintenance in a factory may include the following:

  • 5S to maintain an uncluttered, clean and tidy environment for production equipment. From a maintenance standpoint, it rules out many causes of equipment problems.
  • Autonomous Maintenance to delegate routine tasks like changing light bulbs and frayed cables to production operators. This frees up maintenance technicians for more complex tasks.
  • Systematic reviews of equipment and maintenance specs. This may lead, for example, to replacing machines whose yearly maintenance costs exceeds their replacement cost, eliminating checks on devices that are no longer used, or reinforcing checks on — and re-engineering — devices that have recently failed…
  • Cross-training of technicians in the different trades involved in maintenance work, to build a corps of “general practitioners”/first responders who can solve the majority of the problems.
  • The assignment of “primary care technicians” by area in the plant, to be called in priority from that area, but backed up by the other technicians when unavailable. The point is to have it both ways: a technician who is familiar with and close to the people he or she is supporting, and technical resources available in an emergency.
  • Having higher-level specialists to help with the problems the technicians can’t handle. Depending on circumstances, these may be equipment engineers employed by the company or outside contractors.

The responsiveness discussion above applies to first responders assumed to have the same skills.  Not all maintenance work is responding to emergencies. It is a combination of planned and unplanned work but even planned maintenance on a machine takes more or less time depending on what the technicians discover when they open it up. You know when a biyearly check on a machine is planned to start, but not whether it will take a day or a week.

Capacity planning for everything a complete Maintenance Department does is more complex and beyond the scope of this post. Managers can be tempted to apply in maintenance the capacity planning methods that are effective in production. The nature of the work and the objectives pursued, however, are sufficiently different to require a different approach. You plan production work with the objective of keeping each operator busy as close to 100% of the takt time as you can. If you attempt this with maintenance, you will be unresponsive.

#Maintenance, #FirstResponders, #CapacityPlanning, #EmergencyResponse