Probability For Professionals

dice In a previous post, I pointed out that manufacturing professionals’ eyes glaze over when they hear the word “probability.” Even outside manufacturing, most professionals’ idea of probability is that, if you throw a die, you have one chance in six of getting an ace.  2000 years ago, Claudius wrote a book on how to win at dice but the field of inquiry has broadened since, producing results that affect business, technology, science, politics, and everyday life.

In the age of big data, all professionals would benefit from digging deeper and becoming, at least, savvy recipients of probabilistic arguments prepared by others. The analysts themselves need a deeper understanding than their audience. With the software available today in the broad categories of data science or machine learning, however, they don’t need to master 1,000 pages of math in order to apply probability theory, any more than you need to understand the mechanics of gearboxes to drive a car.

It wasn’t the case in earlier decades, when you needed to learn the math and implement it in your own code. Not only is it now unnecessary, but many new tools have been added to the kit. You still need to learn what the math doesn’t tell you: which tools to apply, when and how, in order to solve your actual problems. It’s no longer about computing, but about figuring out what to compute and acting on the results.

Following are a few examples that illustrate these ideas, and pointers on concepts I have personally found most enlightening on this subject. There is more to come, if there is popular demand.

Continue reading

If Talk Of Probability Makes Your Eyes Glaze Over…

Few terms cause manufacturing professionals’ eyes to glaze over like “probability.” They perceive it as a complicated theory without much relevance to their work. It is nowhere to be found in the Japanese literature on production systems and supply chains, or in the American literature on Lean. Among influential American thinkers on manufacturing, Deming was the only one to focus on it, albeit implicitly, when he made “Knowledge of Variation” one of the four components of his System of Profound Knowledge (SoPK).

Continue reading

The Value Of Surveys: A Debate With Joseph Paris

Joseph Paris and I debated this issue in the Operational Excellence group on LinkedIn, where he started a discussion by posting the following:

“Riddle me this…

If the Japanese way of management and their engagement with employees is supposedly the best, yielding the best result, why is there such a lack of trust among employment across the spectrum; employers, bosses, teams/colleagues. From Bloomberg and EY.

Japanese Workers Really Distrust Their Employers preview image

Japanese Workers Really Distrust Their Employers

Lifetime employment sounds like a great thing, but not if you hate where you work. That seems to be the plight of Japanese “salarymen” and “office ladies.” Only 22 percent of Japanese workers have “a great deal of trust” in their employers, which is way below the average of eight countries surveyed, according to a new report by EY, the global accounting and consulting firm formerly known as Ernst & Young. And it’s not just the companies: Those employees are no more trusting of their bosses or colleagues, the study found.

Continue reading

Where Have The Scatterplots Gone?

What passes for “business analytics” (BI), as advertised by software vendors, is limited to basic and poorly designed charts that fail to show interactions between variables, even though the use of scatterplots and elementary regression is taught to American middle schoolers and to shop floor operators participating in quality circles.

But the software suppliers seem to think that it is beyond the cognitive ability of executives. Technically, scatterplots are not difficult to generate, and there are even techniques to visualize more complex interactions than between pairs of variables, like trendalyzers or 3D scatterplots. And, of course, visualization is only the first step. You usually need other techniques to base any decision on data.

Continue reading

Tradition, Tradition, Data Visualization, and Pareto Charts

Some of the standard charts used in manufacturing for decades don’t meet today’s criteria for effective visualization. But using them is now a tradition; they are taught in school and their value is unchallenged, but it is time to challenge it. If we were to see these charts for the first time in 2015, would we consider the information they provide useful, and would we want to use the classical formats? This post suggests answers in the case of the venerable Pareto chart.

Continue reading

The bell curve: “Normal” or “Gaussian”?

Most discussions of statistical quality refer to the “Normal distribution,” but “Normal” is a loaded word. If we talk about the “Normal distribution,” it implies that all other distributions are, in some way, abnormal. The “Normal distribution” is also called “Gaussian,” after the discoverer of many of its properties, and I prefer it as a more neutral term. Before Germany adopted the Euro, its last 10-Mark note featured the bell curve next to Gauss’s face.

The Gaussian distribution is widely used, and abused, because its math is simple, well known, and wonderful. Here are a few of its remarkable properties:

  1. It applies to a broad class of measurement errors. John Herschel arrived at the Gaussian distribution for measurement errors in the position of bodies in the sky simply from the fact that the errors in x and y should be independent, that the probability of a given error should depend only on the distance from the true point.
  2. It is stable. If you add Gaussian variables, or take any linear combination of them, the result is also Gaussian.
  3. Many sums of variables converge to it.  The Central Limit Theorem (CLT) says that, if you add variables that are independent, identically distributed, with a distribution that has a mean and a standard deviation, they sum converges towards a Gaussian. It makes it an attractive model, for example, for order quantities for a product coming independently from a large number of customers.
  4. Mint syrup diffusion in water

    Mint syrup diffusion in water

    It solves the equation of diffusion. The concentration of, say, a dye introduced into clear water through a pinpoint is a Gaussian that spreads overt time. You can experience it in your kitchen: fill a white plate with about 1/8 in of water, and drop the smallest amount of mint syrup you can in the center. After a few seconds, the syrup in the water forms a cloud that looks very much like a two-dimensional Gaussian bell shape for concentration, as shown on the right. And it fact it is, because the Gaussian density function solves the diffusion equation, with a standard deviation that rises with time. It also happens in gases, but too quickly to observe in your kitchen, and in solids, but too slowly.

  5. It solves the equation of heat transfer by conduction. Likewise, when heat spreads by conduction from a point source in a solid, the temperature profile is Gaussian… The equation is the same as for diffusion.
  6. Unique filter. A time-series of raw data — for temperatures, order quantities, stock prices,… — usually has fluctuations that you want to smooth out in order to bring to light the trends or cycles your are looking for. A common way of doing this is replacing each point with a moving average of its neighbors, taken over windows of varying lengths, often with weights that decrease with distance, so that a point that is 30 minutes in the past counts for less than the point of 1 second ago. And you would like to set these weights so that, whenever you enlarge the window, the peaks in your signal are eroded and the valleys fill up. A surprising, and recent discovery (1986) is that the only weighting function that does this is the Gaussian bell curve, with its standard deviation as the scale parameter.
  7. Own transform. This is mostly of interest to mathematicians, but the Gaussian bell curve is its own Laplace transform, which drastically simplifies calculations.

For all these reasons, the Gaussian distribution deserves attention, but it doesn’t mean that there aren’t other models that do too. For example, when you pool the output of independent series of events, like failures of different types on a machine, you tend towards a Poisson process, characterized by independent numbers of events in disjoint time intervals, and a constant occurrence rate over time. It is also quite useful but it doesn’t command the same level of attention as the gaussian.

The most egregious misuse of the gaussian distribution is in the rank-and-yank approach to human resources, which forces bosses to rate their subordinates “on a curve.” Measuring several dimensions of people performance and examining their distributions might make sense, but mandating that grades be “normally distributed” is absurd.

The meaning(s) of “random”

Random and seq. access“That was random!” is my younger son’s response to the many things I say that sound strange to him, and my computer has Random Access Memory (RAM), meaning that access to all memory locations is equally fast, as opposed to sequential access, as on a tape, where you have to go through a sequence of locations to reach the one you want.

In this sense, a side-loading truck provides random access to its load, while a back-loading truck provides sequential access.

While  these uses of random are common, they have nothing to do with probability or statistics, and it’s no problem as long as the context is clear. In discussion of quality management or production control, on the other hand, randomness is connected with the application of models from probability and statistics, and misunderstanding it as a technical term leads to mistakes.

From the AMS blog (2012)

From the AMS blog (2012)

In factories, the only example I ever saw of Control Charts used as recommended in the literature was in a ceramics plant  that was firing thin rectangular plates for use as electronic substrates in batches of 5,000 in a tunnel kiln. They took dimensional measurements on plates prior to firing, as a control on the stamping machine used to cut them, and they made adjustments to the machine settings if control limits were crossed. They did not measure every one of the 5,000 plates on a wagon. The operator explained to us that he took measurements on a “random sample.”

“And how do you take random samples?” I asked.

“Oh! I just pick here and there,” the operator said, pointing to a kiln wagon.

That was the end of the conversation. One of the first things I remember learning when studying statistics was that picking “here and there” did not generate a random sample. A random sample is one in which every unit in the population has an equal probability of being selected, and it doesn’t happen with humans acting arbitrarily.

A common human pattern, for example, is to refrain from picking two neighboring units in succession. A true random sampler does not know where the previous pick took place and selects the unit next to it with the same probability as any other. This is done by having a system select a location based on a random number generator, and direct the operator to it.

This meaning of the word “random” does not carry over to other uses even in probability theory. A mistake that is frequently encountered in discussions of quality is the idea that a random variable is one for which all values are equally likely.  What makes a variable random is that probabilities can be attached to values or sets of values in some fashion;  it does not have to be uniform. One value can have a 90% probability while all other values share the remaining 10%, and it is still a random variable.

When you say of a phenomenon that it is random, technically, it means that it is amenable to modeling using probability theory. Some real phenomena do not need it, because they are deterministic:  you insert the key into the lock and it opens, or you turn on a kettle and you have boiling water. Based on your input, you know what the outcome will be. There is no need to consider multiple outcomes and assign them probabilities.

There are other phenomena that vary so much, or on which you know so little, that you can’t use probability theory. They are called by a variety of names; I use uncertain.  Earthquakes, financial crises, or wars can be generically expected to happen but cannot be specifically predicted. You apply earthquake engineering to construction in Japan or California, but you don’t leave Fukushima or San Francisco based on a prediction that an earthquake will hit tomorrow, because no one knows how to make such a prediction.

Between the two extremes of deterministic and uncertain phenomena is the domain of randomness, where you can apply probabilistic models to estimate the most likely outcome, predict a range of outcomes, or detect when a system has shifted. It includes fluctuations in the critical dimensions of a product or in its daily demand.

The boundaries between the deterministic, random and uncertain domains are fuzzy. Which perspective you apply to a particular phenomenon is a judgement call, and depends on your needs. According to Nate Silver, over the past 20 years, daily weather has transitioned from uncertain to random, and forecasters could give you accurate probabilities that it will rain today. On the air, they overstate the probability of rain, because a wrong rain forecast elicits fewer viewer complaints than a wrong fair weather forecast. In manufacturing, the length of a rod is deterministic from the assembler’s point of view but random from the perspective of an engineer trying to improve the capability of a cutting machine.

Rods for assemblers vs. engineers

Claude Shannon

Claude Shannon

This categorization suggests that that a phenomenon that is almost deterministic is, in some way, “less random” than one that is near uncertainty. But we need a metric of randomness to give a meaning to an expression like “less random.”  Shannon’s entropy does the job. It is not defined for every probabilistic model but, where you can calculate it, it works. It is zero for a deterministic phenomenon, and rises to a maximum where all outcomes are equally likely. This brings us back to random sampling.  We could more accurately  call it “maximum randomness sampling” or “maximum entropy sampling,” but it would take too long.

Averages in Manufacturing Data

The first question we usually ask about lead times, inventory levels, critical dimensions, defective rates, or any other quantity that varies, is what it is “on the average.” The second question is how much it varies, but we only ask it if we get a satisfactory answer to the first one, and we rarely do.

When asked for a lead time, people  usually give answers that are either evasive like “It depends,” or weasel-worded like “Typically, three weeks.” The beauty of a “typical value” is that no such technical term exists in data mining, statistics, or probability, and therefore the assertion that it is “three weeks” is immune to any confrontation with data. If the assertion had been that it was a mean or a median, you could have tested it, but, with “typical value,” you can’t.

For example, if the person had said “The median is three weeks,” it would have had the precise meaning that 50% of the orders are delivered in less than 3 weeks, and that 50% take longer. If the 3-week figure is true, then the probability of the next 20 orders all taking longer, is 0.5^{20}= 9.6\,ppm. This means that, if you do observe a run of 20 orders with lead times above 3 weeks, you know the answer was wrong.

In Out of the Crisis, Deming was chiding journalists for their statistical illiteracy when, for example, they bemoaned the fact that “50% of the teachers performed beneath the median.” In the US, today, the meaning of averages and medians is taught in Middle School, but the proper use of these tools does not seem to have been assimilated by adults.

One great feature of averages is that they add up: the average of the sum of two variables is the sum of their averages. If you take two operations performed in sequence in the route of a product, and consider the average time required to go through these operations by different units of product, then the average time to go through operations 1 and 2 is the sum of the average time through operation 1 and the average time through operation 2, as is obvious from the way an average is calculated. If you have n values X_{1},...,X_{n}

the average is just

\bar{X}= \frac{X_{1}+...+X_{n}}{n}

What is often forgotten is that most other statistics are not additive.

To obtain the median, first you need to sort the data so that  X_{\left(1\right)}\leq ... \leq X_{\left(n\right)}. For each point, the sequence number then tells you how many other points are under it, which you can express as a percentage and plot as in the following example:

Median graphic

Graphically, you see the median as the point on the x-axis where the curve crosses 50% on the y-axis. To calculate it, if n is odd, you take the middle value

\tilde{X}= X_{_{\left (\frac{n}{2}+1\right )}}

 and, if n is even, you take the average of the two middle values, or

\tilde{X}= \frac{\left[ X_{_{\left (\frac{n}{2}\right )}}+X_{_{\left (\frac{n}{2}+1\right )}}\right]}{2}

and it is not generally additive, and neither are all the other  statistics based on rank, like the minimum, the maximum, quartiles, percentiles, or stanines.

An ERP system, for example, will add operation times along a route to plan production, but the individual operation times input to the system are not averages but worst-case values, chosen so that they can reliably be achieved. The system therefore calculates the lead time for the route as the sum of extreme values at each operation, and this math is wrong because extreme values are not additive. The worst-case value for the whole route is not the sum of the worst-case values of each operation, and the result is an absurdly long lead time.

In project management, this is also the key difference between the traditional Critical Path Method (CPM) and Eli Goldratt’s Critical Chain. In CPM, task durations set by the individuals in charge of each task are set so that they can be confident of completing them. They represent a perceived worst-case value for each task, which means that the duration for the whole critical path is the sum of the worst-case values for the tasks on it. In Critical Chain, each task duration is what it is actually expected to require, with a time buffer added at the end to absorb delays and take advantage of early completions.

That medians and extreme values are not additive is experienced, if not proven, by a simple simulation in Excel. Using the formula “LOGNORM.INV(RAND(),0,1)” will give you in about a second, 5,000 instances of two highly skewed variables, X and Y, as well as their sum X+Y. On a logarithmic scale, their histograms look as follows:

lognormal histogram with sum

And the summary statistics show the Median, Minimum and Maximum for the sum are not the sums of the values for each term:

Simulation stats

Averages are not only additive but have many more desirable properties, so why do we ever consider medians? There are real problems with averages, when taken carelessly:

      1. Averages are affected by extreme values. It is illustrated by the Bill Gates Walks Into a Bar story. Here we inserted him into a promotional picture of San Fancisco’s Terroir Bar:Bill-Gates-in-terroir_bar-SFAttached to each patron other than Bill Gates is a modest yearly income. But his presence pushes the average yearly income above $100M, which is not a meaningful summary of the population. On the other hand, consider the median. Without Bill Gates, the middle person is Larry, and the median yearly income, $46K. Add Bill Gates, and the median is now the average of Larry and Randy, or $48K. The median barely budged! While, in this story, Bill Gates is a genuine outlier, manufacturing data often have outliers that are the result of malfunctions, as when wrong measurements are recorded as a result of a probe failing to touch the object it is measuring, or the instrument is calibrated in the wrong system of units, or a human operator puts a decimal point in the wrong place…Large differences between average and median are a telltale sign of this kind of phenomenon. Once the outliers are identified, assessed, and filtered, you can go back to using the average rather than the median.
      2. Averages are meaningless over heterogeneous populations. The statement that best explains this is “The average American has exactly one breast and one testicle.” It says nothing useful about the American population. In manufacturing, when you consider, say, a number of units produced, you need to make sure you are not commingling 32-oz bottles with minuscule free samples.
      3. Averages are meaningless for multiplicative quantities. If you data is the sequence Y_{1}, ...,Y_{n} of yields of the n operations in a route, then the overall yield is Y= Y_{1}\times ...\times Y_{n}, and the plain average of the yields is irrelevant. Instead, you want the geometric mean \bar{Y}=\sqrt[n]{Y_{1}\times ...\times Y_{n}}.
        The same logic applies to the compounding of interest rates, and the plain average of rates over several years is irrelevant.
      4. Sometimes, averages do not converge when the sample size grows. It can happen even with a homogeneous population, it is not difficult to observe, and it is mind boggling. Let us say your product is a rectangular plate. On each one you make, you measure the differences between their actual lengths and widths and the specs, as in the following picture:Plate dimensions
        Assume then that, rather than the discrepancies in length and width, you are interested in the slope ΔW/ΔL and calculate its average over an increasing number of plates. You are then surprised to find that, no matter how many data points you add, the ratio keeps bouncing around instead of converging as the law of large numbers has led you to expect. So far, we have looked at the averages as just a formula applied to data. To go further, we must instead consider that they are estimators of the mean of an “underlying distribution” that we use as a model of the phenomenon at hand. Here, we assume that the lengths and widths of the plates are normally distributed around the specs. The slope ΔW/ΔL is then the ratio of two normal variables with 0 mean, and therefore follows the Cauchy distribution. This distribution has the nasty property of not having a mean, as a consequence of which the law of large numbers does not apply. But it has a median, which is 0.

The bottom line is that you should use averages whenever you can, because you can do more with them than with the alternatives, but you shouldn’t use them blindly. Instead, you should do the following:

      1. Clean your data.
      2. Identify and filter outliers.
      3. Make sure that the data represents a sufficiently homogeneous population.
      4. Use geometric means for multiplicative data.
      5. Make sure that averaging makes sense from a probability standpoint.

As Kaiser Fung would say, use your number sense.

Follow

Get every new post delivered to your Inbox

Join other followers:

%d bloggers like this: