Deviating Standard Deviations

This basic concept deserves revisiting. The following is from a blog post from 2022 hosted by a supplier of statistical software intended to explain the meaning of some notations in plain, simple terms:

The author calls two different things by the same name. If the standard deviation of each variable is 1, how could its expected value be anything else? The confusion within this nonsensical statement is the same we make when we equate the temperature of a soup with a thermometer reading. In our mental model of a bowl of soup, it has a temperature that exists regardless of our ability to measure it, and the thermometer reading is only an estimate of it.

For the purposes of eating soup, confusing the two is harmless, unless the thermometer, poorly calibrated, always gives you an answer that is 15°F off. This is the situation we have with the most commonly used estimator of the standard deviation of a random variable from a small sample. It is biased, and c_4(n) is a correction factor applicable when the random variable is Gaussian.

To describe c_4(N) accurately, we need to dig into probability theory. It is, in fact, the expected value of the estimator S=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}\left ( X_i -\bar{X} \right )} of the standard deviation from a sample of N independent Gaussian variables \left ( X_1, \dots, X_N \right ) with unit standard deviation, \sigma = 1. This is an accurate statement, but every term in it needs an explanation. 

Standard Deviations in Statistical Quality

In statistical quality, the standard deviation is a big deal. There is even an approach named after it,  Six Sigma. But nowhere in Six Sigma do you find a discussion of random variables, their standard deviations, and the issues of estimating them from samples. 

Many texts on SQC, even recent ones like Grant & Leavenworth’s or Montgomery’s, devote less than two pages of their books to the concept of a standard deviation. In particular, they fail to distinguish between the standard deviation \sigma of a random variable and its estimate s from a sample. Given that these authors are PhDs in Engineering with stellar academic careers, they obviously knew all about this distinction, but chose to gloss over it, presumably not to burden students of quality with mathematical subtleties.

Recent authors like Tom Pyzdek or Don Wheeler do not mention random variables at all in their books. And Wheeler made the following incorrect statement in a column in Quality Digest: “Probability models are built on the assumption that the data can be thought of as observations from a set of random variables that are independent and identically distributed.” In fact, you need no such assumption. 

As a result, they present methods, formulas, and coefficients like cookbook recipes, without justification or underlying assumptions. I disagree with this choice. What I find puzzling is attempting to promote an understanding of variation without using probability theory, the math specifically developed for this purpose. 

 

Problems with Sample Standard Deviations

For numbers \mathbf{x} = \left ( x_1,\dots,x_N \right ), s =\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}\left ( x_i -\bar{x} \right )} is commonly called the “sample standard deviation” of \mathbf{x}. Without anything else, the formula for s is just a “measure of dispersion” pulled out of thin air. And it raises questions like:

  1. What is the intended meaning of this number?
  2. Do you get consistent results when you apply it to multiple samples?
  3. What happens when the sample size grows?
  4. Why does it involve division by N-1 and not N ?
  5. Why use a sum of squares rather than, say, a sum of absolute values?

On the other hand, if the x_i are independent realizations of a random variable X that has a mean and a standard deviation, you can answer these questions.

If Unfamiliar with Random Variables

For a review of what a random variable is, see Appendix A. We will need just three basic concepts associated with random variables:

  1. The probability density function (p.d.f.).
  2. The expected value.
  3. The variance and its square root, the standard deviation

The probability density function

A probability density function (p.d.f.) f for a real random variable X is a positive function such that the probability of  X falling in interval I is the area under the curve f  over interval I. The p.d.f. does not have to be pretty or well-behaved. All that is required of it is to be positive, integrable, and with a total area under the curve of 1:

This probability is, of course, additive over any measurable set of pairwise disjoint intervals. In this example P(I_1\cup I_2\cup I_3)= P(I_1)+P(I_2)+P(I_3). Not every random variable has a p.d.f. but many interesting ones do. This concept extends to random vectors in n dimensions, random infinite sequences, or random functions. 

The Expected Value of a Random Variable

“Expected value,” as a term, comes from the early work by Pascal on gambling. If a player has a 40% probability of winning $100, the stake has an expected value of $100 × 40% = $40. For a random variable X, with a probability density function (p.d.f.) f, its expected value is

\mu = E\left( X \right) = \int_{-\infty}^{+\infty} xf\left( x \right)dx

 

provided this integral converges. 

There is a more general definition of expected value that covers cases without a density function, but this one is sufficient for our purposes. From this definition, it is not obvious that it has anything to do with the average of a sample \mathbf{x} = (x_1,\dots,x_N) of independent realizations of X. The law of large numbers makes this connection.

To get a handle on Standard Deviations, we only need a few properties of the expected value. For random variables X and Y, E(X+Y) = E(X) + E(Y) and, for any scalar \alpha, E(\alpha X) = \alpha E(X).

Also, if X and Y are independent, E(XY) = E(X)E(Y)

For any function g(X),

E[g(X)] = \int_{-\infty}^{+\infty} g(x)f\left( x \right)dx

 

but, unless g is linear,

E\left [ g\left ( X \right ) \right ]\neq g\left [ E\left ( X \right ) \right ]

The Variance of a Random Variable

Then we define Var(X) as follows, and call it \sigma^2 because we are often interested in its square root, the standard deviation:

\sigma^2=Var\left( X \right) = E\left( X -\mu\right)^2

 

The expected value, the variance, and the standard deviation exist when these integrals converge. Otherwise, they don’t, and there is no point in trying to estimate them from data. It’s not difficult to find examples for which these quantities do not exist

Applying the Law of Large Numbers

If \mu exists and we have a sample \left( x_1,\dots, x_N\right), the law of large numbers tells us that the average \bar{x}=\frac{1}{N}\sum_{i = 0}^{N}x_i of the x_i converges towards \mu when N goes to infinity:

\mu = E\left( X \right) = \lim_{N \to \infty} \frac{1}{N}\sum_{i = 0}^{N}x_i = \lim_{N \to \infty} \bar{x}

 

Likewise,

\sigma^2=Var\left( X \right) =\lim_{N \to \infty} \frac{1}{N}\sum_{i = 0}^{N}\left(x_i - \mu \right)^2

 

Moments of a Random Variable

\mu and \sigma^2 are known as the first and second order moments of the variable X, because \mu = E(X) is a function of the values of X and \sigma^2 of their squares.

\mu indicates where on the real axis the distribution is concentrated; \sigma^2, how concentrated it is. The 3rd order moment, a function of X^3 is called skewness, and shows how asymmetric the distribution is; the 4th order moment, or kurtosis, shows the pointiness of the distribution. Moments of order 5 and above don’t have names yet.

Many useful distributions have density functions that depend only on the 1st and 2nd order moments:

  • A binomial distribution on a population of size N has a single parameter p, an expected value of \mu =Np and a variance \sigma^2 = Np(1-p).
  • A Poisson variable is completely characterized by its expected value \mu, with \sigma^2 = \mu.
  • A Gaussian distribution is completely characterized by its expected value \mu and its variance \sigma^2. As it is symmetric, its skewness is always 0. Its kurtosis is always 3.
  • Distributions with kurtosis >3 are called leptokurtic and are pointier with fatter tails than the Gaussian; those with kurtosis <3 are called platykurtic.

When they exist, the moments of probability distributions have many useful properties. 

Variance and Standard Deviation

If we call Var(X) \sigma^2, it’s because we are often interested in its square root

\sigma = \sqrt{E\left ( X- \mu \right )^2}

 

It has the same dimension as X and \mu. This allows us, for example, to talk about an interval of the form \mu \pm k\sigma.

Unfortunately, the math of standard deviations is more complex than that of variances. For independent variables X and Y, for example,

Var(X+Y) = Var(X) + Var(Y)

 

but, for standard deviations, it becomes

\sqrt{Var(X+Y)}= \sqrt{Var(X) + Var(Y)}

 

like the hypotenuse of a right triangle with sides \sqrt{Var(X)} and \sqrt{Var(Y)}.

Estimation from Data

In a real situation, we often have a dataset (x_1, \dots, x_N) and we assume the x_i to be independent realizations of identically distributed variables X_i that have an expected value and variance but we don’t know what they are and we need to estimate them from the data.

The average \bar{x}=\frac{1}{N}\sum_{i = 0}^{N}x_i then is a realization of the variable \bar{X}=\frac{1}{N}\sum_{i = 0}^{N}X_i, and we immediately see that E(\bar{X})= \mu, which says that \bar{X} is an unbiased estimator of \mu. Colloquially, we say that it’s not wrong on average.

If we knew \mu, we could say likewise that \frac{1}{N}\sum_{i = 0}^{N}\left(X_i - \mu \right)^2 is an unbiased estimator of Var(X). But we don’t know \mu, and replacing it with the average of the data changes things. This is where the N-1 factor comes from. It is known as the Bessel correction. 

Unbiased Estimation of Variance

\sum_{i-1}^{N}\left ( X_i - \bar{X} \right )^2 = \sum_{i-1}^{N}X_i^2 + N\bar{X}^2 - 2\bar{X}\sum_{i-1}^{N}X_i = \sum_{i-1}^{N}X_i^2 -N\bar{X}^2

 

In taking the expected value of \bar{X}^2, we encounter N squared terms of the form

E(X^2_i) = \mu^2 + \sigma^2

 

and N(N-1) cross terms with i\ne j, for which:

E(X_iX_j) = E(X_i)E(X_j) = \mu^2

 

then the expected values of the two terms in the above sum of squares are

E\left [ \sum_{i=1}^{N} X_i^2\right ]= N\left ( \mu^2 + \sigma^2 \right )

 

and

E\left [ \sum_{i=1}^{N} N\bar{X\frac{}{}}_i^2\right ]= \frac{1}{N}E\left [ \sum_{i=1}^{N} X_i\right ]^2 = \mu^2 + \sigma^2 + \left ( N-1\right )\mu^2

 

Taking the difference between the two, we get

E\left [ \sum_{i-1}^{N}\left ( X_i - \bar{X} \right )^2 \right ] = \left ( N-1 \right )\sigma^2

 

which means that we need to divide the sum of squares by (N-1) and not N to have an unbiased estimator of the variance. It makes no practical difference when N = 30,000 but it does when N = 5, as in a subgroup of in-process measurements.

Biased Estimation of Standard Deviation

S^2= \frac{1}{N-1}\sum_{i-1}^{N}\left ( X_i - \bar{X} \right )^2 is an unbiased estimator of \sigma^2 but S is a biased estimator of \sigma, because the square root is a concave function and Jensen’s inequality tells us that E(S) \leq \sigma.

There isn’t much more that can be said without assuming a specific distribution for X. If you Google “standard deviation,” almost all the results are about the Gaussian distribution, also known as “normal.” The Wikipedia article on Standard Deviation even opens with the following illustration of this special case:

Just because the Gaussian distribution is mathematically tractable and has many applications, it isn’t a license to ignore all others. It also doesn’t mean that the Gaussian distribution shares all these properties with other distributions. For any given distribution that has a standard deviation, short of working out the math of the estimator bias, you can usually estimate it by simulation. 

Also, since the law of large numbers gives us \mu = E\left( X \right) = \lim_{N \to \infty} \bar{x}, we can expect the bias to become negligible for sufficiently large samples. In the Gaussian case, we can quantify this.  

The Special Case of Gaussians

In setting control limits for s-charts, the theory of control charts assumes that X is Gaussian, in which case you have the correction factor Shewhart used to set control limits for s-charts: 

c_4(N)= \sqrt{\frac{2}{N-1}}\times \frac{\Gamma(N/2)}{\Gamma((N-1)/2)}

 

where \Gamma is Euler’s Gamma function.

E(S) = c_4(N) \sigma is a powerful result but we must remember that it only applies to Gaussian variables. The proof is given in Appendix C.

Bias Correction for Small, Gaussian Samples

Tables of c_4(N) are available from multiple sources. It starts at 0.8 for N = 2 and is above 0.99 for  N \geq 26. It reaches 0.997 for a sample of 100 points. It’s a concern only for the kind of small samples you use in control charts. This is why Shewhart made it a coefficient in setting limits for s-charts. For a sample of 30,000 points, you can ignore it. 

Large Sample Approximations for the Gaussian

To approximate this for large N, let’s express c_4(N) as a function of z = \frac{N-1}{2}, so that

c_4(N)= \sqrt{\frac{1}{z}}\times \frac{\Gamma(z +1/2)}{\Gamma(z)},

The Stirling’s series for large z gives us

\frac{\Gamma\left ( z + \alpha \right )}{\Gamma\left ( z + \beta \right )}= z^{\alpha +\beta}\times\left [ 1 + \frac{\left ( \alpha -\beta \right )\left ( \alpha + \beta -1 \right )}{2z}+ O\left ( \frac{1}{z^2} \right ) \right ],

If we plug in \alpha = 1/2 and \beta = 0, we get

\frac{\Gamma\left ( z + 1/2 \right )}{\Gamma\left ( z \right )}= \sqrt{z}\times\left [ 1-\frac{1}{8z}+ O\left ( \frac{1}{z^2} \right ) \right ]

and, translating back in terms of N,

c_4(N)= 1- \frac{1}{4\left ( N-1 \right )}+ O\left ( \frac{1}{N^2} \right )

You can easily verify that 1- \frac{1}{4\left ( N-1 \right )} is a close approximation of c_4(N) for N as low as 26. 

When the Variables are not Gaussian

When using other models that do have standard deviations, you should not use the c_4(N) for bias correction, as it is only for Gaussians. If you cannot derive or find in the literature a similar bias correction factor for the standard deviation estimator, you can estimate it through simulations.

Conclusions

If you are going to use sigmas — that is, standard deviations, as central to your theory of quality, you need the concept of a random variable for it to make sense.

Appendix A. What is a Random Variable, Exactly?

Many authors go out of their way to avoid defining this concept. In earlier posts, I have assumed that readers share a common understanding of random variables.  I now realize it is not so. The closest I came to defining it was in the discussion of regression: The random variable is to the measurements as a die is to a sequence of throws:

 The random variable represents all the values the measurement might take, with probabilities assigned to subsets. While a useful metaphor, it falls short of a formal definition and does not do justice to the concept’s generality.

In Probability as Application of Measure Theory

The definition I learned is that a random variable is “a measurable mapping from a probability space into the real line.” It is concise, but subtle concepts from the early 20th-century math lurk behind every word:

  • The word “variable” doesn’t evoke a mapping but an argument of a mapping. In software, it is a name for a place in memory to store and retrieve objects. But we don’t assign values to random variables, and modeling them as mappings accounts for their values coming from somewhere else.
  • A probability space is a concept that is almost never explicitly used beyond this definition but is essential to it. It is a triplet \left ( \Omega,\mathfrak{F}, P \right ) where  
    • \Omega is a set whose elements we call “events.” 
    • \mathfrak{F} is a set of parts of  \Omega on which a measure can be defined. We call it a σ-algebra, which means the following:
      • The entire set \Omega  is in  \mathfrak{F} 
      • If a part A  of \Omega is in \mathfrak{F}, so is its complement \Omega -A 
      • If the A_1,\dots, A_i, \dots  are in  \mathfrak{F}, so is their union \bigcup_{i=1}^{\infty}A_i
  • A measure is a function on \mathfrak{F} that is additive on sequences of pairwise disjoint elements of \mathfrak{F}. A probability P is a positive measure defined on  \mathfrak{F} such that  P(\Omega)=1
  • On the real line \mathbb{R}, we use the σ-algebra generated by intervals, and a mapping from \Omega to  \mathbb{R} is “measurable” if the reciprocal image of any measurable subset of \mathbb{R} is in  \mathfrak{F}

If X is a random variable, this definition gives a precise meaning to the expression  P(a\leq X\leq b). It takes the reciprocal image X^{-1}\left ( \left [ a,b \right ] \right ) in \mathfrak{F}  and assigns it a value through the measure P. From this point forward, the only part of the probability space \left ( \Omega,\mathfrak{F}, P \right ) we work with is the value of  P on reciprocal images of measurable subsets of \mathbb{R}.

 

Appendix B. Proof of the Bias Correction Formula for Gaussians

Per Cochran’s theorem, with Gaussian X_i, the sum of squares \left [\frac{n-1}{\sigma} \right ]^2\times S^2 follows a \chi^2 distribution with n-1 degrees of freedom, which involves Euler’s Gamma function. The probability distribution function of a \chi^2 with k degrees of freedom is

f\left ( x,k \right )= \frac{1}{2^{k/2}\Gamma\left ( k/2 \right )}x^{k/2-1}e^{-x/2},

If you want to read how to derive this formula from the Gaussian distribution, see Appendix C.

u^2 =(N-1)S^2 is a \chi^2 variable with N-1 degrees of freedom. The expected value of the square root of u is therefore

h\left ( N \right )= \int_{0}^{+\infty}\sqrt{u}f(u,N-1)du=\frac{1}{2^{(N-1)/2}\Gamma\left ( (N-1)/2 \right )}\int_{0}^{+\infty}u^{N/2-1)}e^{-u/2}du,

By a change of variable to v = u/2, the integrand becomes

\int_{0}^{+\infty}u^{(N/2-1}e^{-u/2}du = 2^{N/2} \times \int_{0}^{+\infty}v^{N/2-1}e^{-v}dv,

or

\int_{0}^{+\infty}u^{N/2-1}e^{-u/2}du = 2^{N/2} \times \Gamma(N/2),

and

h\left ( N \right ) = \frac{2^{N/2}\Gamma((N)/2)}{2^{(N-1)/2}\Gamma\left ( \left ( N-1 \right )/2 \right )} =  \frac{\sqrt{2}\Gamma((N)/2)}{\Gamma\left ( \left ( N-1 \right )/2 \right )} = \sqrt{N-1}E(S)

Therefore

E(S) = \sqrt{\frac{2}{N-1}}\frac{\Gamma((N)/2)}{\Gamma\left ( \left ( N-1 \right )/2 \right )} = c_4(N)

 

This was for  \sigma =1. For other values of \sigma, E(S) = c_4(N)\sigma and  \frac{S} {c_4(N)} is an unbiased estimator of \sigma

Appendix C. The P.D.F. of a \chi^2

A \chi^2 with k degrees of freedom is the sum of the squares of k independent Gaussians N(0,1), and its value is constant over the sphere r^2 = x^2_1 +\dots+x^2_k.

The area A_{k-1} of the unit sphere S_{k-1} in \mathbb{R}^k is \frac{2\pi^{k/2}}{\Gamma\left ( k/2 \right )}, where \Gamma is Euler’s Gamma functon. For the sphere of radius r, it is

A_{k-1}(r)=r^{k-1}\frac{2\pi^{k/2}}{\Gamma\left ( k/2 \right )},

If we change the variable to u = r^2, then dr = \frac{1}{2\sqrt{u}}du and the density of a \chi^2 with k degrees of freedom becomes

f\left ( u,k \right )= A_{k-1}(\sqrt{u})\times \phi(u)\times\frac{1}{2\sqrt{u}},

where \phi(u) is the value of the multivariate Gaussian density for any point in the sphere of radius \sqrt{u},:

\phi(u) = \phi(u) =\prod_{i =1}^{k}\frac{1}{\sqrt{2\pi}}e^{-x^2_i} = \frac{1}{\left ( 2\pi \right )^{k/2}}e^{-r^2/2} = \frac{1}{\left ( 2\pi \right )^{k/2}}e^{-u/2},

Therefore

f(u,k)=u^{(k-1)/2}\frac{2\pi^{k/2}}{\Gamma\left ( k/2 \right )}\times \frac{1}{\left ( 2\pi \right )^{k/2}}e^{-u/2}\times \frac{1}{2\sqrt{u}},

which simplifies to

f(u,k)=\frac{1}{2^{k/2}\Gamma\left ( k/2 \right )}u^{(k/2-1)}e^{-u/2}

References

 

Grant, E. L., Leavenworth, R. S. (1972). Statistical Quality Control [by] Eugene L. Grant [and] Richard S. Leavenworth. United States: McGraw-Hill.

Montgomery, D. C. (2020). Introduction to Statistical Quality Control. United Kingdom: Wiley.

Pyzdek, T., Keller, P. (2012). The Handbook of Quality Management 2E (PB): A Complete Guide to Operational Excellence. United States: McGraw Hill LLC.

Wheeler, D. J., Chambers, D. S., Chambers, D. S. (1992). Understanding Statistical Process Control. United Kingdom: SPC Press.

#quality, #standarddeviation, #statistics, #probability, #sixsigma, #spc