Using Regression to Improve Quality | Part II – Fitting Models

This is a personal guided tour of regression techniques intended for manufacturing professionals involved with quality. Starting from “historical monuments” like simple linear regression and multiple regression, it goes through “mid-century modern” developments like logistic regression. It ends with newer constructions like bootstrapping, bagging, and MARS. It is limited in scope and depth, because a full coverage would require a book and knowledge of many techniques I have not tried. See the references for more comprehensive coverage. 

To fit a regression model to a dataset today, you don’t need to understand the logic, know any formula, or code any algorithm. Any statistical software, starting with electronic spreadsheets, will give you regression coefficients, confidence intervals for them, and, often, tools to assess the model’s fit.

However, treating it as a black box that magically fits curves to data is risky. You won’t understand what you are looking at and will draw mistaken conclusions. You need some idea of the logic behind regression in general or behind specific variants to know when to use them, how to prepare data, and to interpret the outputs.

The Basic Idea

It starts with drawing a scatterplot of a sample \left [ \left ( x_1,y_1 \right )\dots \left ( x_N,y_N \right )\right ] of two real variables X and Y. Then you run a straight line through it to approximate Y as a function of X.

Most explanations of regression, even in modern texts, are in terms of the sample, bypassing any discussion of the random variables that the data are instances of. This results in formulas with summation symbols and indices that look complicated. When we express the problem in terms of the random variables  X and Y, it reduces to an orthogonal projection in High School geometry, as we shall see. Let us start with the usual explanation. 

Least Squares Fit through a Cloud of Points

On samples \left [ \left ( x_1,y_1 \right )\dots \left ( x_N,y_N \right )\right ], the model is

y = \alpha + \beta x + \epsilon

 

\alpha and \beta are coefficients, and \epsilon is noise. You are fitting a line through the cloud of points by minimizing the sum of its squared distances to the points, as measured in parallel to the y-axis. x and y play different roles, and you get a different line if you measure distances in parallel to the x-axis. The third plot in the following diagram does not illustrate regression but a different tool, Principal Component Analysis (PCA), in which x and y play symmetrical roles:

 

 

Minimizing the Residual Sum of Squares 

You find estimates \hat{\alpha}_N and \hat{\beta}_N by minimizing the residual sum of squares (RSS):

RSS\left ( \alpha, \beta \right )= \sum_{i=1}^{N}\left (y_i - \alpha - \beta x_i \right )^2

 

You zero out the partial derivatives of RSS with respect to \alpha and \beta, and obtain formulas that can be expressed simply by using some standard notations for averages and the inner product of two vectors:

  • \bar{x}_N = \frac{1}{N}\sum_{i=1}^{N}x_i and \bar{y}_N = \frac{1}{N}\sum_{i=1}^{N}y_i

 

  • s_N^2 = \frac{1}{N}\sum_{i=1}^{N}x_i^2-\bar{x}^2 = \frac{1}{N}\mathbf{x}\cdot\mathbf{x} - \bar{x}^2

 

  • If we write \mathbf{x} = \left ( x_1,\dots,x_N \right ) and \mathbf{y} = \left ( y_1,\dots,y_N \right ), then

\frac{1}{N}\sum_{i=1}^{N}x_iy_i - \bar{x}\bar{y} = \frac{1}{N} \mathbf{x}\cdot\mathbf{y} -\bar{x}\bar{y}

 

In these terms, minimizing the RSS yields the following formulas for the sample estimates \hat{\alpha}_N and \hat{\beta}_N of \alpha and \beta that are found in the literature:

  • \hat{\beta}_N = \frac{\mathbf{x}\cdot \mathbf{y} - N\bar{x}_N\bar{y}_N}{\mathbf{x}\cdot \mathbf{x} - N\bar{x}_N^2}

 

  • \hat{\alpha}_N = \bar{y}_N - \hat{\beta}_N\bar{x}_N

Interpretation

The software will fit a straight line by least squares through a cloud of points regardless of the data’s meaning and backstory. You need to dig deeper into what this line means about the points you used to calculate it, and about new points.

The y_i may be a true product characteristic like the tensile strength of a cable, measured by pulling it to its breaking point, while the x_i may be a non-destructive substitute. Then you are trying to bound the true characteristic from the substitute. Your interest is not in the points in the cloud but in what these points reveal about the variables X and Y that they are instances of.

In Terms of Random Variables

Random variables are to a data sample as a die is to a sequence of throws. Reasoning about the die itself is reasoning about all the possible sequences of throws. It is more abstract, but also simpler and more enlightening.

That the points in the cloud are pairs of instances of random variables raises several questions:

  1. What does the line mean in terms of these variables that would be true across all possible samples?
  2. What assumptions must you make about the variables for the procedure to be valid?

In fact, linear regression is easier to understand in terms of the random variables than in terms of samples. The relationship between the expected value \mathrm{E}(Z)of a random variable Z and the average of a sample of independent instances x_1,\dots, z_N\dots is the law of large numbers:

\mathrm{E}(Z)= \displaystyle \lim_{N \to \infty} \frac{1}{N} \sum_{i=1}^{N}z_i

 

For linear regression with response Y and explanatory variable X, all we need to assume is the existence of the following:

  1. The expected values \mathrm{E}(X) and \mathrm{E}(Y)
  2. The variances \mathrm{Var}(X) = \mathrm{E}\left [ X- \mathrm{E}\left ( X\right ) \right ]^2> 0 and \mathrm{Var}(Y)
  3. The covariance  \mathrm{Cov}\left ( X,Y \right )= \mathrm{E}\left [ \left ( X- \mathrm{E}\left ( X\right ) \right )\left ( Y- \mathrm{E}\left ( Y\right ) \right )\right ].

In terms of the random variables, the basic regression model is simpler:

Y = \alpha + \beta X + \mathcal{E}

 

\alpha and \beta are coefficients, and \mathcal{E} is a random variable. We want to choose \alpha and \beta so that E (\mathcal{E})= 0 and Var(\mathcal{E}) is minimized.

\mathrm{E} (\mathcal{E})= \mathrm{E}(Y - \alpha - \beta X) = 0 and therefore

\alpha = \mathrm{E}(Y) - \beta \mathrm{E}(X)

 

Then \mathrm{Var}(\mathcal{E}) is minimized when it is uncorrelated with \alpha + \beta X , in other words, when

\mathrm{Cov}(\alpha + \beta X , Y - \alpha - \beta X)=  \beta \mathrm{Cov}(X,Y) - \beta^2 \mathrm{Var}(X) = 0

 

Then

\beta = \frac{\mathrm{Cov}(X,Y)}{\mathrm{Var}(X)}

 

and

\alpha = \frac{\mathrm{E}(Y)Var(X)- \mathrm{Cov}(X,Y)E(X)}{\mathrm{Var}(X)}

Geometric Interpretation

It looks like an orthogonal projection in high school geometry:

In fact, it doesn’t just look like it. Mathematically, centered random variables with a variance form a vector space where covariance is the inner product, the standard deviation is the norm, and the correlation coefficient is the cosine of the angle between two variables.

The 4th edition of Douglas Montgomery’s book on regression showed an orthogonal projection on its cover. Inside, it appears just once, and it is in terms of the sample data, not random variables. Trevor Hastie et al. treat it the same way.

 

The coefficients \alpha and \beta are simple functions of moments of the random variables X and Y. Except in games of chance or simulations – where you construct random variables – these moments are unknown quantities that you estimate from the data. If we replace the moments in the formulas with the estimates from the sample, we get the same formulas obtained above by minimizing the RSS.

As the sample size N grows, then

\displaystyle \lim_{N \to \infty}\hat{\beta}_N = \beta

and

\displaystyle \lim_{N \to \infty}\hat{\alpha}_N = \alpha

Independence and Process Sequence

The estimation of \hat{\alpha}_N and \hat{\beta}_N from a sample \left [ \left ( x_1,y_1 \right )\dots \left ( x_N,y_N \right )\right ] assumes that the points in it are independent instances of the pair (X,Y) of random variables.

This is not a foregone conclusion when X and Y are measured variables in a manufacturing process. It means, in particular, that you can scramble the sequence of the (x_i, y_i) and it will make no difference to the estimates \hat{\alpha} and \hat{\beta} or the residuals \left(\epsilon_1 \dots \epsilon_N\right).

This is an issue when the data are a sequence of measurements taken on workpieces leaving a machine after an operation that gradually wears a tool or depletes a resource or when an external event starts affecting the (x_i, y_i) at some point between 1 and N.

If, as shown in a. below, you process parts one at a time and preserve the process sequence, it gives you valuable information but not independent observations. If, on the other hand, as in b. you dump the parts into a wire basket as shown on the right side, or c. tumble deburr them, you scramble the process sequence, and the units you pull are independent like beads drawn from an urn.

c. Tumble deburr

A scatterplot visualizes a set of points, and will be identical regardless of the order in which you plot the points. If you have a sequence of points, their order matters and an orbit chart is a more relevant visualization. In terms of modeling, you are in the realm of time series rather than regression.

The Residuals

We first examine the residuals as instances of the random variable \mathcal{E} , and then as a sample (\epsilon_1, \dots, \epsilon_N).

Moments of the Random Variable

By construction,

E(\mathcal{E}) = 0

 

and

\mathrm{Var}\left (\mathcal{E}\right ) = \mathrm{E}\left ( Y - \alpha - \beta X \right )^2 = (1-\rho^2)\mathrm{Var}(Y)

 

where

\rho = \frac{\mathrm{Cov}\left (X,Y \right )}{\sqrt{\mathrm{Var}\left ( X \right )}\sqrt{\mathrm{Var}\left ( Y \right )}}

 

Is the correlation coefficient of X and Y. This expresses the connection between correlation and regression. If X and Y are uncorrelated, regression achieves nothing, because the residuals inherit the full variance of Y. If \rho = 1, then the centered Y- E(Y) is a multiple of X-E(X). They are said to be collinear.

Then, by orthogonality,

\mathrm{Var}\left (\alpha + \beta X \right ) = \mathrm{Var}\left ( Y \right ) - \mathrm{Var}\left (\mathcal{E}\right ) = \rho^2\mathrm{Var}(Y)

 

This discussion in terms of random variables requires no assumption other than the existence of \mathrm{Cov}\left (X,Y \right ), and of nonzero \mathrm{Var}\left ( X \right ) and \mathrm{Var}\left ( Y \right ).

Sample Statistics

If the model applies, the residuals are uncorrelated with X and their variance does not change as a function of X. Reality, however, is not obligated to agree. When you calculate \hat{\alpha}_N and \hat{\beta}_N from a dataset \left [ \left ( x_1,y_1 \right )\dots \left ( x_N,y_N \right )\right ] and plot the residuals \epsilon_1,\dots, \epsilon_N as a function of the x_1, \dots, x_N, you may find that their spread does not tend to vary, in which case you have homoscedasticity and your model is OK. If they vary, you have heteroscedasticity, and your model needs tweaking.

Wikipedia illustrates the concept with the following examples:

Deming used a heteroscedastic cloud of points as the cover art for his 1950 book on sampling:

You can see heteroscedasticity in the autocorrelation of moving ranges. The consecutive differences between two values in a sequence of independent measurements are correlated because they have one term in common. The scatterplot of consecutive differences D_1 and D_2 taken on 100,000 simulated measurements shows homoscedasticity. Their correlation coefficient is -0.5.

The corresponding ranges R_1 and R_2are absolute values of these differences. Their correlation, as estimated from the simulation, is +0.22. It is weaker than the correlation of the differences, and their scatterplot shows heteroscedasticity.

Distribution of the Residuals

Much of the literature presents homoscedasticity and the Gaussian distribution for residuals as assumptions for linear regression. In fact, neither is necessary to calculate \hat{\alpha} and \hat{\beta}. Furthermore, the Gauss-Markov theorem states that the least-square fit linear model minimizes the variance of the residuals under broad conditions that do not include the residuals being Gaussian. It is then said to be the “Best Linear Unbiased Estimator,” or “BLUE” for short. 

The confidence bands automatically provided by regression software are based on the assumption that the residuals are Gaussian. You should not take this for granted, particularly if you see many points outside of these bands.

When plotting yearly Atlantic hurricane counts over 55 years based on NOAA data, I that 32% of the points were outside the “99% confidence interval”:

The residuals are not Gaussian, and look heteroscedastic. Yet the model is still useful, in that it shows, in trend, a tripling of the yearly number of hurricanes of Category 3 and above, while the total number of storms is stable. Storms go through upgrades and downgrades along their paths. A hurricane of category 3 or above has, somewhere along its path winds above 111 mph. That, on average, three times as many hurricanes were thus classified in 2020 than in 1965 means that three times as many were observed, not necessarily that 3 times as many occurred.

If the residuals are not Gaussian, it does not invalidate the model; it only invalidates the automatically generated confidence bands. You should always analyze the residuals yourself.

Continuity and Regression

In regression, that two values x_1 and x_2 are close does not imply that the matching y_1 and y_2 are close too. In other words, Y is not a continuous function of X. Reality, however, often throws continuous functions our way, like, for example, thermometer readings over time, and we need other modeling methods for them.

Formally, if we consider just two pairs of random variables (X_1, Y_1) and (X_2, Y_2) connected by the simple, homoscedastic regression model described above, then

Y_2-Y_1 = \beta \left ( X_2-X_1 \right ) + \mathcal{E}_2 - \mathcal{E}_1

 

and

\mathrm{Var}\left ( Y_2-Y_1 \right )= \beta^2\times \mathrm{Var}\left ( X_2-X_1 \right ) + 2\times \mathrm{Var}\left ( \mathcal{E} \right ) \geq 2\times \mathrm{Var}\left ( \mathcal{E} \right )

 

Regardless of how close X_2 and X_1 are \mathrm{Var}\left ( Y_2-Y_1 \right ) will always exceed 2\times \mathrm{Var}\left ( \mathcal{E} \right ), and Y cannot be a continuous function of X.

Regression versus Interpolation

The literature on regression calls interpolation the use of the regression formula at values of x between the minimum and the maximum of the xs in the sample, as opposed to extrapolation for its use outside of this range.

In this sense, interpolation is inconsistent with standard usage, where interpolation and regression refer to two different ways of relating variables. The regression line does not go through the y_is! Depending on the nature of Y, it may or may not be a flaw of the method.

Connecting the Dots, or Not

Sometimes, it makes sense connect the dots, and sometimes not. If Y is a altitude measurement, you definitely want the map to show this exact value for each measured point, and fitted values only between measurements. When the y_is are accurate and precise measurements of a physical quantity that is continuous in X, not traversing the measurements is a flaw of regression.

The Last Measurement

Some quantities are constants until an event occurs that causes a step change:

  • The quantity on-hand of an item in a warehouse is constant until a storage or retrieval transaction changes it.
  • The highest number of consecutive days without anyone getting injured on the job in a factory is constant until a longer stretch occurs. 
  • The value of a security is the price at which it was last traded, and constant until a new trade changes it.
  • A sport record is constant until an athlete breaks it, …

Such variables are staircase functions of time, discontinuous, and not subject to linear interpolation. 

How records change

The world record in 100-m racing is precisely measured but not continuous. As a function of time, it should be a staircase function going downward because it doesn’t change until some runner breaks the current record, with a step change of random height, and this occurs at random intervals.

In fact, the scatterplot doesn’t show a clean staircase because of inconsistencies in the data due to factors like wind speeds. In this case, the straight trend line that doesn’t traverse the points is more informative than a broken line through all points, as it unexpectedly shows a steady rate of improvement over 110 years! We would have expected the rate to taper off as performances reach the limits of human capabilities.

Noisy Data

With noisy data, interpolation overfits. For example, if Y is the price per square foot of a piece of real estate sold, then it is the result of a negotiation between buyer and seller and not exactly the “fair market value” of the property if such a quantity even exists. In this case, there is no point in insisting that the line should traverse the measurements.

Traversing Measurements and Estimating Inbetween

The broken or smooth interpolations traverse the measurements. Still, they have another flaw: they provide a single value for points between measurements when, in fact, the most you can know about them is a distribution, conditionally on the measurements:

 

Kriging is widely used to produce this kind of output in topographic mapping, mining, and other domains, and is the BLUE in this context. Georges Matheron developed it in the 1960s as in improvement over traditional methods used in mining that provided biased results.

Tolerancing on Substitute Characteristics

To the extent that the linear model fits, the regression line gives you the conditional expectation of Y knowing X:

\alpha + \beta X = \mathrm{E}(Y\mid X)

 

When you combine this with the distribution of the residuals, you have the full, conditional distribution of Y knowing X. If the model is homoscedastic, by definition, the standard deviation \sigma of the residuals does not vary with X. If the residuals are also Gaussian, then the conditional distribution of Y knowing X is Gaussian with mean \alpha + \beta X and standard deviation \sigma.

The unconditional distribution of Y may be wide and ugly, and overstep the spec limits. Conditionally on X, however, it is a narrow Gaussian centered at a location that varies with X, based on which, knowing X, we can infer how likely the workpiece is to fall within the spec limits on Y:

Then we can use tolerances for X to control performance in terms of Y, which is what you expect of a substitute characteristic for Y.

Beyond the Basics

The scope of regression analysis has been expanding for nearly 140 years in many ways, in the nature of the variables considered, in their dimensions, and in the methods used for fitting models.

Logistic Regression

During the production process, you measure, say, the weight of each unit. Many operations later, at the end of the process you test the finished units and pass or fail them. If no unit ever passed final test with the weight you just observed, you shouldn’t pass it on for the next operations to put more work into it. Your sample \left [ \left ( x_1,y_1 \right )\dots \left ( x_N,y_N \right )\right ] with X the measured in-process variable and Y a sequence of Pass or Fail at final test, may look like this:

Goal: Estimating the Probability of Passing

If a weight anomaly is due to a missing component, it is usually obvious, and so is the decision. If it is due to fluctuations, as in the lengths of cut rods or the size of coffee beans, it is not so easy. What we would like to infer from this data is a probability p(X) = P(Pass \mid X) of the unit passing final test given its value for X:

If we measure X on a unit in process and finished goods sell for V per piece, the expected revenue for the unit in process by the time it is finished is p(X)\times V. This is the question we focus on here. The next one is whether this amount is worth all the resources needed to finish the unit, but that’s another discussion.

In the above chart, estimating p(X) is a two-sided problem. Logistic regression addresses it on one side only. Encoding the response as Y=1 for “Pass” and Y=0 for “Fail” yields the most common visualization of logistic regression, as shown on Wikipedia:

In this example, all the students who studied \leq 1.5 hours failed and all those who studied \geq 4 hours passed. Between these bounds, you found both passing and failing students.

Outline of the Math

This is just to give an idea of the method in the simplistic case of just one explanatory variable. Technically, you can go straight to software and feed your data to a module in SAS. Minitab, one of many packages available in Python, or GLM in R. If you want to apply logistic regression with Excel, you actually need to go through the steps described below. The software usually helps you assess the fit, but we’ll discuss this in Part III.

Meaning of “Logistic” in this Context

“Logistic” in this context has nothing to do with forklifts. Instead, it refers to the logit function, which is the logarithm of the odds. For a probability p, it takes the following form:

\mathrm{logit}\left ( p \right ) = \mathrm{log}\left ( \frac{p}{1-p} \right )

 

p is between 0 and 1; the odds \frac{p}{1-p} , between 0 and +\infty. Then \mathrm{logit}\left ( p \right ) between -\infty and +\infty, and, if z=\mathrm{logit}(p), then p = \frac{1}{1+e^{-z}}

The Logistic Regression Model

We want a linear model of the logit, of the form:

\mathrm{logit}\left [ p(X) \right ]=\alpha + \beta X + \mathcal{E}

 

Let \pi(X) = P(Y=1 \mid X) then 1-\pi(X) = P(Y=0 \mid X) and the probability distribution function of Y given X can be expressed as:

f\left ( y\mid X \right )= \pi\left ( X \right )^y\times \left [ 1- \pi\left ( X \right ) \right ]^{1-y}, where y= 0 or 1.

Likelihood 

Let’s specialize this to the sample data y_1,\dots, y_N, consisting of 0s and 1s, and x_1,\dots, x_N. Using the model, let’s write

\pi_i(\alpha, \beta) = P(Y=1\mid X= x_i) = \frac{1}{1+e^{-\alpha -\beta x_i}}

 

If the sample data are independent, then the likelihood of (\alpha, \beta) for the dataset is

L\left ( \alpha,\beta \right )= \prod_{i=1}^{N}\pi_i\left ( \alpha,\beta \right )^{y_i}\times\left [ 1 -\pi_i\left ( \alpha,\beta \right ) \right ]^{1-y_i}

 

In logarithmic form, this simplifies to:

\text{log}\left [ L\left ( \alpha, \beta \right ) \right ]= \sum_{i=1}^{N}y_i\left ( \alpha + \beta x_i \right ) + \sum_{i=1}^{N}log\left [ 1+e^{\alpha +\beta x_i} \right ]

Maximum Likelihood Estimation

Then you use numerical methods to find the \hat{\alpha}_N and \hat{\beta}_N that maximize the log-likelihood. This isn’t least-squares, but least squares doesn’t work here. MLE is more complex, requires more assumptions, and doesn’t always work. It is, however, how we do logistic regression.

In linear regression, the coefficients estimated by least squares and MLE coincide if, and only if, the residuals are Gaussian. In general, where both methods apply, they do not coincide. The choice of method is generally made for expediency.

Logistic Regression Post-Dates SPC and SQC

Least squares curve fitting goes back 220 years with Legendre and Gauss; linear regression with least squares, 140 years with Galton an Karl Pearson ; MLE, 100 years, with Fisher; logistic regression, 60 years, with David Cox.

Logistic regression is tailor-made to address yield prediction problems or to identify substitutes for true characteristics that are attributes. It is, however, too recent to have caught the attention of the quality profession.

Multiple and Multivariate Regression

Often, you need to relate a variable Y of a finished good, like a switching speed, to k > 1 measurements X_1, \dots, X_k taken during the production process, like geometric dimensions, electrical resistances, or weights. Then the model becomes

Y = \alpha + \beta_1 X_1 + \dots + \beta_k X_k + \mathcal{E}

 

and the logic for finding \alpha, \beta_1,\dots + \beta_k is the same as in the one-dimensional case. The key difference is that you calculate with covariance matrices rather than individual covariances. As in the simplest case, off-the-shelf software takes care of the calculations, and books that describe the algorithms in detail may be useful to software developers, but not to engineers trying to make sense of data.

In the next refinement, the response itself can have more than 1 dimension, and it is again solved with matrix algebra. It is more complicated than simple regression but and extension of the same concept.

The Generalized Linear Model (GLM)

Logistic regression is also a special case of GLM. You can’t model a binary response as a linear function of explanatory variables. So you transform it into a quantity that you can, and you transform the result back into probabilities of success and failure.

GLM is a general method to do this with a transformation g called the link function. The model then takes the following form:

g(Y) = \alpha + \beta_1 X_1 + \dots + \beta_k X_k + \mathcal{E}

 

The Generalized Additive Model (GAM)

In the GAM, Hastie and Tibshirani (1986) expanded the GLM to nonlinear functions of explanatory variables that can be binary attributes, categories, or ordered categories.

Y = \alpha + f_1 (X_1) + \dots + f_k ( X_k) + \mathcal{E}

 

where f_1,\dots,f_k are smooth functions, like cubic splines.

Multivariate Adaptive Regression Splines (MARS)

In the simple, one-dimensional case, the idea of Friedman’s MARS (1991) is to fit a piecewise linear function to a cloud of points, automatically choosing its hinge points. It initially assumes all the x_i to be hinge points and then prunes away the irrelevant ones. The Wikipedia example compares simple linear regression with MARS on the same data, and it shows a hinge point that was not visually obvious from the cloud of points:

The location of the hinge points may be the most valuable output of a MARS analysis. Based on real data, the following chart shows the appreciation by category of properties sold in San Francisco in 2012-2014. It has a surprising hinge point in the spring of 2014, affecting only Condominiums:

This hinge point, identified by MARS, prompted a search of the contemporary news media. It revealed a ballot measure introduced in the spring of 2014 for the November, 2014 election that would have heavily taxed the flipping of condominiums without affecting other types of properties. San Francisco voters rejected this proposition, and condominium prices rebounded afterwards.

Bagging, and Other Techniques

New techniques keep enriching the regression toolkit. One I have tried is bagging, short for bootstrap aggregation. It is due to Leo Breiman in 1996. There are many others I have not tried. The semiconductor manufacturing paper, for example, discusses lasso and ridge regression.

Boostrapping

In other contexts, “bootstrapping” usually refers to financing a business from retained earnings or starting up a computer operating system. Here, bootstrapping is a resampling technique that creates alternative samples from the data in an original sample. Brad Efron invented it in 1979, and applied it primarily to biostatistics, not quality. A few academic papers propose applying it in setting control limits for SPC.

To bootstrap a new sample, you throw the data points from you original sample into an urn, mix them up, and then draw points with replacement one by one until you have as many as in the original set. Because you are drawing with replacement, each point you draw has the same probability distribution as the original sample. For the same reason, you draw some points multiple times and others not at all.

Once you have bootstrapped many such samples, you calculate the same statistic on each and examine the distribution of these statistics, which you may not be able to find in any other way. Consider, for example, the hinge points in MARS. They are outputs of a complicated algorithm, and you neither know whether another sample of the same variables would give you the same number of hinge points nor how close they would be to those in the original sample.

No theory is available to help you, and collecting additional samples is impractical. On the other hand, bootstrapped samples are only a computer simulation away, and you can generate enough of them to infer a distribution for the number and locations of your hinge points. In addition, the mean of the estimates from your bootstrapped samples is often more accurate than the estimate from your original sample.

Bagging

Bagging applies bootstrapping to regression. For any regression model, you estimate coefficients for each bootstrap sample. In bagging, you take the means of the coefficients over all the bootstrap samples. 

Distributional Regression

While the above methods focus on the conditional mean of the response, given the explanatory variables, recent approaches estimate conditional distributions for the response. They usually have “LSS” in their name, which, in this context, stands for “Location, Scale, and Shape.” 

According to Ananyapam De, “The most popular distributional regression model you might have come across is GAMLSS, which uses an additive model. There have also been extensions which use neural networks as in NAMLSS or to machine learning models like XGBoostLSS or attention based models like NODE-GAMLSS.” For a review of the state of the art, see Rage Against the Mean

 

Terminology

We call the Xs, explanatory variables, and Y the response. You find many different, and sometimes confusing alternative names in the literature. Given that \mathcal{E} is what remains after you subtract \hat{Y} from Y, it makes sense to call it a residual. It brands \mathcal{E} as the part of Y that cannot be accounted for by a linear function of X.

The Xs are also commonly called predictors, suggesting that \hat{Y}_N is a point estimate of Y. It isn’t, but, if you treat is as such, then \mathcal{E} is an error, The Xs are also sometimes called covariates.

The most common usage is to call the Xs independent variables and Y the dependent variable. Unfortunately, this distinction has nothing to do with independent events as defined in probability theory. The Xs are not assumed to be independent in this sense, and their mutual dependencies are reflected in their covariance matrix. On the other hand, the (x_i, y_i) in a sample are assumed to be independent occurrences of the random variable (X,Y).

Conclusions

The growth of data science and information technology over the past 140 years has spurred the development of many refinements over Galton’s simple linear regression, which are so many different ways of quantifying the relationship between explanatory variables and response. To choose an appropriate one, you must understand the nature of the problem and decide, for example, whether it makes sense to have a curve that traverses the data, or not.

Then you experiment with different models, which leads us to the next topic: how do you evaluate the fit? On new data points, you want the fitted values \hat{y_j} to be as close as possible to the actual y_j. This is the subject of Part III. 

References

#linearregression, #regression, #multipleregression, #multivariateregression, #logisticregression, #substitutecharacteristic, #truecharacteristic, #kriging, #mars, #bootstrapping, #bagging