Using Regression to Improve Quality | Part III — Validating Models

Whether your goal is to identify substitute characteristics or solve a process problem, regression algorithms can produce coefficients for almost any data. However, it doesn’t mean the resulting models are any good.

In machine learning, you divide your data into a training set on which you calculate coefficients and a testing set to check the model’s predictive ability. Testing concerns externally visible results and is not specific to regression.

Validation, on the other hand, is focused on the training set and involves using various regression-specific tools to detect inconsistencies with assumptions. For these purposes, we review methods provided by regression software.

In this post, we explore the meaning and the logic behind the tools provided for this purpose in linear simple and multiple regression in R, with the understanding that similar functions are available from other software and that similar tools exist for other forms of regression.

It is an attempt to clarify the meaning of these numbers and plots and help readers use them. They will be the judges of how successful it is.

The body of the post is about the application of these tools to an example dataset available from Kaggle, with about 30,000 data points. For the curious, some mathematical background is given in the appendix.

Many of the tools are developments from the last 40 years and, therefore, are not covered in the statistics literature from earlier decades.

Visualize, When You Can

In simple, linear regression, you have just one numeric explanatory variable and one numeric response. With a data set in the hundreds of points, you can plot the point cloud and the regression line going through it:

If you have so many points that they merge into a blob in the plot, you can use a heat map instead.

In either case, you can often tell visually whether the model is useful or not. You only need to look further into metrics of the fit if it isn’t visually obvious.

Visual validation is not an option in multiple regression, where you have several explanatory variables, or in multivariate regression, where you have several responses. In these cases, you need to consider numbers, with the understanding that some of these overall numbers must be intelligible to all stakeholders, while the more detailed ones are only for data scientists.

Relying on visualization also entails the risk of confirmation bias: You see patterns in the plot that you expect to see but that another observer might not. In a discussion we had a few years ago, Mark Graban and I saw different patterns in TV viewership of Academy Awards ceremonies.

Example

We’ll use an example of multiple linear regression on data about quality extracted from Kaggle. All we know from Kaggle is the following:

“The roasting machine is an aggregate consisting of 5 chambers of equal size, each chamber has 3 temperature sensors. In addition, for this task, you have collected data on the height of the raw material layer and its moisture content. Layer height and humidity are measured when raw materials enter the machine. Raw materials pass through the kiln in an hour. Product quality is measured in the laboratory by samples that are taken every hour at the exit of the roasting machine.”

Conjectures about the Kaggle data set

Kaggle provides the data as just numbers, without even units of measure. Since the documentation refers to the equipment as a kiln, the products could be ceramics if the temperatures are in °C. Then, the quality score could be comprised of dimensional shrinkage during firing and surface smoothness class, or other appearance ratings.

Use of the Kaggle data

We use it here to illustrate the summaries and plots used to validate regression models. In a real case, we would want to know the complete backstory of the data, including what process physics and chemistry tell us about the relationship between the control parameters and the quality of the product.

Hourly Summaries of Explanatory Variables

For over three years, they measured temperatures every minute through three sensors in each of the five chambers the product traverses in a tunnel kiln. At the start, the Height of the raw materials entering the oven and its humidity were also measured. The quality of the outgoing materiawasis scored on samples every hour. There were about 2 million points for each measured variable and about 30,000 for quality. The regression model is linear on hourly averages of temperatures by chamber, height, and humidity.

Multiple Regression Model

The quality score Q is modeled as the sum of a constant \alpha, a noise \epsilon, and a linear combination of seven terms:

  • The hourly averages of temperature T_1,\dots,T_5
  • The height H of the materials flowing through the kiln.
  • The raw materials humidity W

The formula is as follows:

Q= \alpha + \beta_1T_1 +\dots +\beta_5T_5 +\beta_6H + \beta_7 W +\mathcal{E}

 

It assumes a relationship between random variables, which would be true for any alternative dataset collected on the furnace as well as Kaggle’s. The coefficients \alpha, \beta_1, \dots, \beta_7 and the distribution of the error \mathcal{E} are all unknown.

From the dataset q_i, t_{1i}, \dots, t_{5i}, h_i, w_i, with i = 1,\dots, N , we estimate the coefficients as \hat{\alpha},\hat{ \beta_1}, \dots, \hat{\beta_7} and the characteristics of the error \mathcal{E} from the residuals \epsilon_i = q_i - \hat{\alpha} - \hat{\beta}_1t_{1i} -\dots -\hat{\beta}_5t_{5i} -\hat{\beta}_6h_i -\hat{\beta}_6w_i .

 

To use the model, we want the mean of the error \mathcal{E} to be 0, and its standard deviation to be sufficiently small, and the meaning of “sufficiently” depends on the context and the nature of the decisions about the process that the model is to support.

Proportional Estimation Accuracy Tables (PEAT)

When we turn to numbers, we want to assess the model’s ability to approximate actual values on data outside the training set — that is, the data set used to estimate the model coefficients.

PEAT Calculations

In the testing set \left ( x_{1i},\dots,x_{k1}, y_i \right ), i = 1,\dots, M we have values for the explanatory variables and actual responses. Whatever model \hat{Y} = f\left ( X_1,\dots,X_k \right ) we fit to the training set, we can apply it to the explanatory variables in the testing set and measure how close the fitted values \hat{y}_1,\dots,\hat{y}_M come to the actual responses y_1,\dots,y_M:

r_i = \frac{y_i-\hat{y}_i}{y_i}\,\text{for}\,i=1,\dots,M

 

Specifying ranges like \pm 1%,\pm2%, \pm5%, \pm10%,pm20%, we can then measure the proportion of the r_i within each range.

PEAT Example

Range offFitted value within range
±1%13%
±2%29%
±5%68%
±10%90%
±20%98%

We call these Proportional Estimation Accuracy Tables (PEAT). They characterize the distribution of residuals in a way that is understandable to end users and applicable regardless of the model used.

If the precision shown in the PEAT is sufficient, for example, to establish tolerances on substitute characteristics instead of the elusive, true characteristic, you don’t need to look further. The end user can enjoy the sausage without knowing how you made it. It’s different when the performance is inadequate or marginal, and the analyst needs to dig deeper.

Communicating with Stakeholders

Recently, statistician Alexander Krannich shared his misgivings about showing the performance summary of regression models: “This is good enough for statisticians, but not optimal for presentations to non-statistician stakeholders.”

He was putting it mildly. This is an example from the lm function in R, on the same dataset: the

Only a statistician or data scientist could love this summary; it is unintelligible to anyone else. The regression summary from Excel is just as abstruse for non-statisticians. Minitab, SAS, or Python’s scikit-learn package all produce similar ones, none suitable for external communication.

Background of this summary

Let’s explore this example in more detail, and examine the meaning of these numbers. The goal here is to use linear regression to predict a roasting Quality Score of unknown composition from observed characteristics. The hourly measurements of this score taken over more than three years show no improvement:

This plot also suggests a skewed distribution, which Kernel Density Estimation (KDE) confirms:

We don’t know what the Quality Score is made of, but we have records mapping the control parameters of the kiln to it, and we can use multiple regression to find settings that should improve it.

Validation

If, based on the PEAT, you and the end users are happy with the model’s performance, the story ends here. Otherwise, you apply validation tools that focus on the training set. Regression software usually produces two sets of tools:

  1. A set of model metrics based on orthodox statistics, whose validity hinges on residuals being Gaussian.
  2. A set of more recent graphic tools, that help you ascertain the distribution of the residuals and identify possible anomalies in the data.

Most of the literature tries to explain these tools without using formulas, when in fact math in most easily explained with the notation developed CCF for this purpose. Explanations in text are simply more verbose and less precise.

Annotated Summary

Let’s consider first the summary as a whole, and then dig into its parts. One way to make sense of the summary is to annotate it:

The top part, about the formula, is self-explanatory. Let’s dig into the less obvious sections one by one.

The Residuals

The first part of the summary provide rank statistics on the residuals:

We can check their symmetry and range around the median, but we might just as well directly plot the density of the residuals:

It appears more pointy than a Gaussian, which we confirm with a Q-Q plot:

This says that the distribution of the residuals has thinner tails than if it were Gaussian while behaving like a Gaussian near the mean. The thin tails say that the fitted values are closer to the response than they would be if the residuals were Gaussian.

Statisticians call the fat-tailed distributions leptokurtic and the thin-tailed ones platykurtic. Both sound like diseases, but, to the extent that it is accurate, platykurtic residuals give you better estimates than Gaussians.

There are several actions you can take when the Q-Q plot shows that your residuals are not Gaussian:

  1. Ignore it. This has been, historically, the most common response, with statisticians using Gaussian models because they were mathematically tractable with the information technology of their day. It is Don Wheeler’s advice in the different context of Control Charts.
  2. Trim the data. There is a fine line between cleaning and doctoring data.
  3. Fix the model. Change the list of explanatory variables and try again.

The Coefficients

Linear regression software usually gives you a summary table for the coefficients, of the following form:

Or, in a more readable form:

Coefficient Estimate Standard Error t-value p-value   Stars
α
427.71
4.125
101.26
<2×1016
  ***
β1
0.62
0.005
106.89
<2×1016
  ***
  …
β7
0.23
0.132
1.743
0.08
  *

 

When we see such a table, our first reaction is to eliminate the coefficients with low star ratings and rerun the regression to estimate only the few, highly starred coefficients. This screens out irrelevant parameters and gives us a simpler model that performs equally well on the testing set as the full model. This is called “feature selection,” and lasso regression is an automated way to do it.

While it often works, you should be wary of accepting stars out of a black box. In particular, you might want to know what they mean and what assumptions are needed about the data for them to be valid. The software usually provides a legend mapping the p-value in the preceding column to the Stars through thresholds.

p-value Stars
0 ***
0.001 **
0.01 *
0.05 .
0.1  

 

The p-value, in turn, is based on the preceding column, t-value. It is the probability of observing | \text{ t-value}| or above if the true value of the coefficient is 0. This says that the software awards Stars based on Null Hypothesis Significance Tests (NHST) at varying significance levels.

The Standard Error of the Coefficients

The Standard Error is the standard deviation of the estimator \hat{\beta}_N and

\text{ t-value}=\frac{\text{Estimate}}{\text{Standard Error}}

 

Expresses the estimate as a multiple of its standard deviation. The term is confusing because this t-value is not related to the Student t-test. The conversion of the t-value to a p-value then assumes that the estimate follows a Gaussian distribution.

There is a simple formula for the Standard Error when fitting the model by least squares but not when using The last paragraph of the summary gives metrics of overall regression performance that are not necessarily easy to interpret:

Let’s walk through them:

  1. Residual standard error: 26.03 on 29176 degrees of freedom. If you have N points and k+1 coefficients \alpha, \beta_1, \dots, \beta_k, the number of degrees of freedom is N-(k+1) and the residual standard error is
  1. Multiple R-squared: 0.6838. In simple linear regression, the R^2 is the estimate of the square of the correlation coefficient between X and Y – that is, the \rho introduced in the discussion of the residuals. In general, it is the proportion of the variance of Y that is explained by the model:
  1. Adjusted R-squared: 0.6837. The adjusted R-squared, noted \bar{R}^2 takes into account the number of parameters estimated when calculating both the numerator and the denominator of the fraction in R^2:
  1. F-statistic: 9012 on 7 coefficients and 29176 DF. Ronald Fisher developed the F-statistic I the 1920s to study the differences in variance within and between groups of data. Here it is applied to the variances of the residuals \epsilon_i and of the fitted values \hat{y}_i.
  1. P-value: <2.2\times 10^{-16}. If the residuals are Gaussian, this is the probability of observing an F-statistic at least this high if all the coefficients are 0.
s_e = \sqrt{ \frac{1}{N-\left ( k+1 \right )}\sum_{i=1}^{n}\epsilon_i^2}

 

If the \epsilon_i are noise, then it estimates the volume of this noise. The term “Residual standard error” is confusing for two reasons:

  • “Standard error” usually designates the standard deviation of a coefficient estimator, but not here.
  • “Residual” and “error” are competing names for the \epsilon_i. The most common distinction between the two is that the residuals are the differences between the estimates and the response in the data, while the error is the corresponding random variable in the model. Here, “error” is used for the data.

The explanation commonly given for dividing by N-(k+1) is that it is the size of the sample minus the number of estimated coefficients. Calling it the number of degrees of freedom makes it easy to remember but does not justify it.

Where there is just one coefficient, \alpha, it reduces to the formula for estimating the variance of a variable when its mean is estimated by the sample average, and its denominator is N-1.

Demystifying sample variance formula is a recent blog post by biostatistician Amr Mahmoud does not include the simple math that shows why you divide the sum of squares by N-1 rather than N. See the appendix for proofs of both this simple case and the general one.

R^2= 1-\frac{\sum_{i=1}^{N}\left ( y_i -\hat{y}_i \right )^2}{\sum_{i=1}^{N}\left ( y_i -\bar{y} \right )^2} = 1-\frac{\sum_{i=1}^{N}\epsilon_i^2}{\sum_{i=1}^{N}\left ( y_i -\bar{y} \right )^2}

 

The summations in the numerator and denominator look alike. The key difference is the fitted values \hat{y}_i in the numerator versus the average \bar{y} in the denominator. R^2 is sometimes called coefficient of determination.

\bar{R}_{adjusted}^2= 1- \left ( 1-R^2 \right )\frac{N-1}{N-\left ( k+1 \right )}

 

With a large dataset and a small number of coefficients, the difference between the R-squared and the Adjusted R-squared is negligible.

F = \frac{\sum_{i=1}^{N}\left ( \hat{y}_i -\bar{y} \right )^2/k}{\sum_{i=1}^{N}\left ( y_i - \hat{y}_i \right )^2/\left ( N -(k+1) \right )}

 

where N-(k+1) is the number of degrees of freedom. This is used to calculate the p-value.

Standard Errors in Simple Linear Regression

For \alpha and \beta, we have

\sigma_{\hat{\alpha}_N} = \frac{\sigma_\mathcal{E}}{\sqrt{N}}\times \sqrt{1 + \frac{N}{N-1}\times \frac{\bar{x}^2}{s_x^2}}

 

and

\sigma_{\hat{\beta}_N} = \frac{\sigma_\mathcal{E}}{\sqrt{\sum_{i=1}^{N}\left ( x_i -\bar{x}_N \right )^2}} = \frac{\sigma_\mathcal{E}}{\sqrt{N-1}\times s_x}

 

where

s^2_x = \frac{1}{N-1}\sum_{i=1}^{N}\left ( x_i -\bar{x}_N \right )^2

 

is the classic, unbiased estimator of \textrm{Var}(X) from \left ( x_1,\dots,x_N \right ).

Unsurprisingly, they both decrease like 1/\sqrt{N}; more surprisingly, they also drop like 1/s_x. The more spread out the x_i are, the lower the standard error of both \hat{\alpha} and \hat{\beta}.

Consequently, we need to worry about unusual values of the explanatory variable, as they inflate the estimation of its standard deviation. They could lead us to believe that the coefficient estimates are more precise than they really are. See the appendix for justification of these formulas.

Generalization

As above, matrix algebra extends this discussion to multiple and multivariate regression, but not to methods of model fitting other than least squares. Where you estimate coefficients numerically by successive approximations, you don’t have simple formulas for standard errors. In this case, you may use simulations, like bootstrapped samples, to study the distribution of coefficient estimates and residuals.

Gaussian Residuals and p-Values

In simple linear regression, we have

Y = \alpha + \beta X + \mathcal{E}

 

The estimates \hat{\beta}_N and \hat{\alpha}_N from a sample of size N are linear combinations of Gaussians and, therefore, Gaussian themselves. This is the basis for the p-values and Stars columns in the summary.

Conversely, if the residuals cannot reasonably be modeled as independent, identically distributed Gaussians, then we don’t know that the p-values and stars in the summary are valid. This means that we should analyze the residuals before acting on star ratings.

Collinearity

Unbeknownst to you, some explanatory variables may be functions of other explanatory variables. Collinearity is the special case where they are linear combinations of other variables. In this case, including them in a linear model doesn’t improve it.

Alexander Krannich recently posted on LinkedIn about the Variance Inflation Factor (VIF), which is based on using regression to estimate each explanatory variable from all others. If R_i^2 is the Multiple R-Squared of the model for variable i, then

\mathbf{VIF}_i = \frac{1}{1- R_i^2}

 

In our example, the VIF of the different variables are all near 1, indicating the absence of collinearity. Values above 5 warrant investigation.

Diagnostic Plots

In addition to the summary discussed above, the lm regression package in R provides several plots that reveal features that you would not guess from the numbers in the summary.

Residuals versus Fitted Values

This applies to multiple regression, where the fitted values are scalars, but not to multivariate regression, where they are vectors. It simply plots the residuals as a function of the fitted values.

Plot from lm

 

As seen above, almost none of the Quality Scores are below 250 or above 500, but the linear model doesn’t know it and assigns values outside of this range based the explanatory variables. This is why the low fitted values correspond to large, positive Residuals and the large fitted values to large negative Residuals.

The extreme values are marked with the row number of the data points to help analyze what may be special about them.

When drawing this chart for 30,000 data points, the dots in the center merge into a blob, which makes it difficult to gauge how frequent the excursions seen on the left and right actually are. The red line is obtained by locally weighted scatterplot smoothing (LOWESS), and you would expect it to follow the cloud of points. It doesn’t here because so many points are in the black blob.

Heat Map Alternative

We can see the pattern more clearly in a heat map, with the red indicating a high concentration of points. The yellow line is a trend line drawn with settings different from the default plot, which make it follow the cloud of points more closely.

Converting a scatterplot to a heat map is one of the techniques proposed by Winston Chang in his R Graphics Cookbook to deal with overplotting (pp. 86-87 in the 1st edition). This chart applies the hexbin package that divides the plot area into tiny hexagons and assigns each a color based on the number of points inside.

This plot tells us something we didn’t already know: The model’s Fitted Values are not good predictors of the Quality Score when they are below 250 or above 500. Within the 250 to 500 range, the distribution is flat enough to assume homoscedasticity—that is, that the residuals’ variance is constant.

In simple linear regression, we can visualize it as a function of the explanatory variable because there is only one. Here we have multiple regression, and the domain over which this variance is constant is the set of values of the explanatory variables that produce a Fitted Value within the range.

Scale-Location

This plot is intended to highlight heteroscedasticity – differences in the distribution of residuals as a function of the fitted values. There shouldn’t be any.

The y-axis label on this plot references standardized residuals, which usually – but not here – are ratios of the residuals to their standard deviation estimate:

r_i = \frac{y_i- \hat{y}_i}{\sqrt{\frac{1}{N-(k+1)}\times\sum_{j=1}^{N}\left ( y_j-\hat{y}_j \right )^2}}

 

In multiple regression, the vector of fitted values \mathbf{\hat{y}} is the product of the response data \mathbf{y} with a matrix \mathbf{H} = \left ( h_{ij} \right )_{i,j = 1,\dots, N}, called the hat matrix, a function of the explanatory variables x_i, i = 1,\dots, N, identified in 1972 by John Tukey with many useful characteristics described in the appendix.

The Scale-Location plot from the lm package in R does not, in fact show the standardized residuals but instead the studentized residuals:

t_i = \frac{r_i}{\sqrt{1-h_{ii}}}

 

The denominator is intended to account for the dependence of the conditional variance given \mathbf{X} of the residuals on the leverage h_{ii} of point i.

On the other hand, the justification for plotting specifically the square roots of the absolute values of the studentized residuals is not clear. The only explanation I could find is that “it helps stabilize the variance of the residuals.”

Scale-Location plot from lm

Scale-Location Heat Map

As before, a heat map lets us see inside the black blob:

In case it’s not visually obvious, Breusch and Pagan developed a test of heteroscedasticity in 1979.

Leverage and Influence

These are two characteristics of the data that sound alike but are different. On p. 190 of his Introduction to Linear Regression Analysis, Douglas Montgomery uses the following chart to explain them:

Leverage

Leverage matters because, as discussed above, the standard errors of the coefficients in simple linear regression are inversely proportional to the standard deviation estimate s_x calculated from the x_i, i = 1,\dots, N. The higher the leverage, the more precise the estimations \hat{\alpha} and \hat{\beta}. A high leverage point inflates s_x and misleads us about the precision of the coefficient estimates.

A high-leverage point may be an accurate observation or the result of a data collection error. Either way, it warrants special attention.

In simple linear regression, you could use the ratio of the square (x_i -\bar{x})^2 to the sum of squaresNs^2 as a measure of how far x_i is from the mainstream. Multiple regression is more complex, and you use the result of matrix calculations, explained in the appendix for interested readers.

The diagonal element h_{ii} of the hat matrix \mathbf{H} = \left ( h_{ij} \right )_{i,j = 1,\dots, N} is defined as the leverage of the ith point. For the ith fitted value \mathbf{\hat{Y}}_i we have:

Var\left (\hat{Y}_i | \mathbf{X}\right ) = \sigma_{\mathcal{E}}^2 h_{ii}

 

and the variance of the ith residual, by orthogonality, is

Var(\mathcal{E}_i|\mathbf{X}) = Var\left (Y_i - \hat{Y}_i \right|\mathbf{X} ) = \sigma_{\mathcal{E}}^2 \left ( 1 - h_{ii} \right )

 

Influence

The influence of a data point is a measure of the change in fitted values caused by its removal from the training set. Successively removing each point from the training set and recalculating the model sounds computationally intensive but isn’t, at least when fitting by least squares. The measure of influence is called Cook’s Distance, even though it’s not a distance in the usual sense of the word.

D_i=\frac{\sum_{j=1}^N\left(\hat{y}_j-\hat{y}_{j(i)}\right)^2}{(1+k) s^2_e}

 

where

  • \hat{y}_j, j= 1,\dots,N are the fitted values of the model
  • \hat{y}_{j(i)}, j= 1,\dots,N are the fitted values obtained when the model is regenerated with x_i omitted.
  • k is the number of explanatory variables
  • s^2_e is the estimate of the variance of the residuals. s^2_e = \frac{1}{N - (k+1)}\sum_{i=1}^{N}\left ( y_i - \hat{y}_i \right )^2

Using the same matrix \mathbf{H} as in the leverage calculation, you can calculate Cook’s Distance for every point in the sample without refitting a model for each one:

D_i=\frac{\left ( y_i -\hat{y}_i \right )^2}{\left ( 1+k \right ) s^2_e}\times\frac{h_{i i}}{1-h_{i i} }

 

Even though the literature describes this as a “simple reformulation” of the previous formula, it is anything but obvious. It shows Cook’s Distance as a function of both the residual and the leverage. It also provides a way to calculate it without explicitly regenerating the model after removing one value.

Studentized Residuals versus Leverage from lm

R’s lm also gives you a scatterplot of Standardized Residuals versus Leverage that is not straightforward to interpret.

Heat Map of Studentized Residuals versus Leverage

As in other plots, the 30,000 points form a black blob within which we see nothing, but a heat map makes everything more clearly. Because Cook’s Distance is a function of both residuals and leverage, it is shown on this chart as contour lines.

Testing Cook’s Distance

The point of calculating Cook’s Distance for every point in the training set is, obviously, to identify high influence points. The inclusion of the estimated variance of the residuals in the denominator is to make the distance testable with the same threshold regardless of the sizes of \mathbf{X} and \mathbf{Y}.

The values of the Cook’s Distance in our example are quite low, and no indication of high leverage or high influence points. Values of 2 or 3 would be cause for concern. The default plot from lm shows contour lines for Cook’s Distance for multiples of 0.5, as in this example from RStudio’s RPubs:

Plotting studentized rather than standardized residuals

The lm package in R does not, in fact show standardized residuals but instead the studentized residuals:

t_i=\frac{r_i}{\sqrt{1-h_{i i}}}

 

The denominator is intended to account for the dependence of the variance of the residuals on the leverage h_{ii} of point i.

Conclusions

Vetting a regression model is more complex than generating one. The tools provided with lm – the most basic linear regression tool in R – are powerful but require studying. The plots provided are useful for datasets of at most a few hundred points but the points merge into black blobs with tens of thousands of points. Regenerating them as heat maps makes them makes them readable. There is abundant literature on the topic, and its recent vintage suggests that it is an active area of research in data science.

Appendix: The Math Behind

This is an introduction to the math of linear regression in the form of matrix algebra, which is needed for multiple regression – more than one explanatory variable – and multivariate regression – multidimensional response. We start by showing how matrix algebra tells us what we already know about simple linear regression. All the formulas in the post are explained here, except for the simplification of the Cook’s Distance formula which, while described as a “simple reformulation,” stumped me. 

Estimating a Variance

Starting with something basic, if X is a random variable with mean \mu and standard deviation \sigma, then, for a sample X_1,\dots,X_N, the sum of squared deviations from the sample average \bar{X} =\frac{1}{N}\sum_{i=1}^{N}X_i is

\sum_{i-1}^{N}\left ( X_i - \bar{X} \right )^2 = \sum_{i-1}^{N}X_i^2 + N\bar{X}^2 - 2\bar{X}\sum_{i-1}^{N}X_i = \sum_{i-1}^{N}X_i^2 -N\bar{X}^2

 

then the expected values of the two terms are

E\left [ \sum_{i=1}^{N} X_i^2\right ]= N\left ( \mu^2 + \sigma^2 \right )

 

and

E\left [ \sum_{i=1}^{N} N\bar{X\frac{}{}}_i^2\right ]= \frac{1}{N}E\left [ \sum_{i=1}^{N} X_i\right ]^2 = \mu^2 + \sigma^2 + \left ( N-1\right )\mu^2

 

Taking the difference between the two, we get

E\left [ \sum_{i-1}^{N}\left ( X_i - \bar{X} \right )^2 \right ] = \left ( N-1 \right )\sigma^2

 

which means that we need to divide the sum of squares by (N-1) and not N to have an unbiased estimator of the variance.

With multiple coefficients, it isn’t quite so obvious. It requires matrix calculations, shown below.

Standard Errors in Simple Linear Regression

The simple linear regression model is

\mathbf{Y}= \alpha + \beta \mathbf{X} +\mathcal{E}

 

Let us go back to simple linear regression, and the formula for the estimate of the slope \beta:

\hat{\beta}_N = \frac{\mathbf{x}\cdot \mathbf{y} - N\bar{x}_N\bar{y}_N}{\mathbf{x}\cdot \mathbf{x} - N\bar{x}_N^2}

 

Noting that

\mathbf{x}\cdot \mathbf{y} - N\bar{x}_N\bar{y}_N = \sum_{i=1}^{N}\left ( x_i - \bar{x}_N \right )y_i - \bar{y}\times \sum_{i=1}^{N} \left ( x_i - \bar{x}_N \right )= \sum_{i=1}^{N}\left ( x_i - \bar{x}_N \right )y_i

 

We can express \hat{\beta}_N as

\hat{\beta}_N = \sum_{i=1}^{N}c_iy_i = \mathbf{c}\cdot \mathbf{y}

 

where

c_i = \frac{x_i - \bar{x}}{\mathbf{x}\cdot \mathbf{x} - N\bar{x}_N^2}, i =1\dots N

 

is only a function of the x_i.

We are interested in the way \hat{\beta}_N varies with Y, given the x_i. Since the y_i are independent instances of Y, their variances add up and are all equal to \mathrm{Var}(Y|\mathbf{X}) = \sigma_\mathcal{E}^2. Therefore

\mathrm{Var}\left ( \hat{\beta}_N \right ) = \mathrm{Var}\left ( Y|\mathbf{X} \right )\sum_{i = 1}^{N}c_i= \frac{\sigma_\mathcal{E}^2}{\sum_{i=1}^{N}\left ( x_i -\bar{x}_N \right )^2}

 

and the standard error of \beta is the square root of this expression:

\sigma_{\hat{\beta}_N} = \frac{\sigma_\mathcal{E}}{\sqrt{\sum_{i=1}^{N}\left ( x_i -\bar{x}_N \right )^2}} = \frac{\sigma_\mathcal{E}}{\sqrt{N-1}\times s_x}

 

where

s^2_x = \frac{1}{N-1}\sum_{i=1}^{N}\left ( x_i -\bar{x}_N \right )^2

 

is the classic estimator of \textrm{Var}(X) from \left ( x_1,\dots,x_N \right ).

 

For the estimator \hat{\alpha}_N of the intercept \alpha,

\mathrm{Var}\left ( \hat{\alpha}_N \right )= \mathrm{Var}\left ( \bar{Y}|\mathbf{X}\right ) + \bar{x}^2\times \mathrm{Var}\left ( \hat{\beta}_N \right ) = \sigma_\mathcal{E}^2\left ( \frac{1}{N} + \frac{\bar{x}^2}{\left ( N-1 \right )s_x^2}\right )

 

and

\sigma_{\hat{\alpha}_N} = \frac{\sigma_\mathcal{E}}{\sqrt{N}}\times \sqrt{1 + \frac{N}{N-1}\times \frac{\bar{x}^2}{s_x^2}}

 

Unsurprisingly, standard errors should decrease like 1/\sqrt{N}, as this applies to estimating an expected value from a sample average. What is more remarkable is that they also decrease like 1/s_x, which means that the more spread out the x_i, the more precise the estimates, in a quantifiable way.

From data, the variance \sigma_\mathcal{E}^2 is estimated from the residuals by:

s_\mathcal{E}^2 = \frac{1}{N-2}\sum_{i=1}^{N}\left ( y_i -\hat{y}_i \right )^2

 

See below the explanation for the division by N-2.

 

Matrix View of Simple Linear Regression

Within the training set, the fitted values \hat{y}_i, i = 1, \dots, N are calculated as follows:

\hat{y}_i = \hat{\alpha} + \hat{\beta} \times x_i

 

Or, in terms of matrices,

\mathbf{\hat{y}}= \hat{\alpha} + \mathbf{x}\hat{\beta} = \begin{bmatrix}\hat{\alpha} + \hat{\beta}x_1\\\vdots\\\hat{\alpha} + \hat{\beta}x_N\end{bmatrix} = \begin{bmatrix}1 & x_1 \\\vdots & \vdots \\ 1 & x_N\\\end{bmatrix}\times\begin{bmatrix}\hat{\alpha} \\\hat{\beta}\end{bmatrix}= \mathbf{U}\times \mathbf{\hat{\Gamma}}

 

The orthogonality of the vector of fitted values \mathbf{\hat{y}} with the vector of residuals \mathbf{y} - \mathbf{\hat{y}} means that their inner product is 0.:

\left ( \mathbf{y} - \mathbf{\hat{y}} \right )\cdot \mathbf{\hat{y}} = 0

 

If we write \mathbf{A}^T for the transpose of matrix \mathbf{A} then, in terms matrices, the above becomes:

\mathbf{ \hat{\Gamma}}^T\mathbf{U}^T\left ( \mathbf{y} - \mathbf{U}\mathbf{\hat{\Gamma}}\right )= 0

 

This is true for:

\mathbf{ \hat{\Gamma}} = \left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\mathbf{U}^T\mathbf{y}

 

provided the inverse \left ( \mathbf{U}^T\mathbf{U} \right )^{-1} exists.

Match with Simple Linear Regression

While it is not immediately obvious, in Simple Linear Regression the matrix formula for

\mathbf{ \hat{\Gamma}} =\begin{bmatrix}\hat{\alpha}\\ \hat{\beta} \end{bmatrix}

 

produces the same results as the formulas we calculated before. The point of switching to this matrix representation is that it works verbatim for multiple and multivariate linear regression, with simple linear regression just the special case where there is only one explanatory variable.

\mathbf{U}= \begin{bmatrix}1 & \dots & 1 \\x_1 &\dots & x_N \\\end{bmatrix}\begin{bmatrix} 1 & x_1 \\\vdots &\vdots \\1 & x_N \\\end{bmatrix} = \begin{bmatrix}N & \sum_{i=1}^{N}x_i\\\sum_{i=1}^{N}x_i & \sum_{i=1}^{N}x^2_i \\\end{bmatrix} = N\begin{bmatrix}1 & \bar{x} \\\bar{x} & s^2 + \bar{x}^2 \\\end{bmatrix}

 

with

s^2 = \frac{1}{N}\sum_{i=1}^{N}\left ( x_i - \bar{x} \right )^2

 

The determinant of this matrix is

Det\left ( \mathbf{ U}^T\mathbf{U}\right )= N\begin{vmatrix}1 & \bar{x} \\\bar{x} & s^2 + \bar{x}^2 \end{vmatrix} =Ns^2

 

and its inverse:

\left (\mathbf{ U}^T\mathbf{U}\right )^{-1}= \frac{1}{Ns^2}\begin{bmatrix}s^2 + \bar{x}^2 & -\bar{x} \\-\bar{x} & 1 \\\end{bmatrix}

 

Then we have:

\mathbf{ U}^T\mathbf{y}= \begin{bmatrix}1 & \dots & 1 \\x_1 &\dots & x_N \\\end{bmatrix}\begin{bmatrix} y_1 \\\vdots y_N \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{N}y_i \\\sum_{i=1}^{N}x_iy_i \end{bmatrix}=\begin{bmatrix} N\bar{y} \mathbf{x}\cdot\mathbf{y} \end{bmatrix}

 

and

\begin{bmatrix}\hat{\alpha} \\\hat{\beta}\end{bmatrix}= \mathbf{\hat{\Gamma}}= \frac{1}{Ns^2}\begin{bmatrix}s^2 + \bar{x}^2 & -\bar{x} \\-\bar{x} & 1 \\\end{bmatrix}\begin{bmatrix}N\bar{y} \mathbf{x}\cdot\mathbf{y}\end{bmatrix}

 

which works out to the formulas we already had for \hat{\alpha} and \hat{\beta}

 

Extension to Multiple Regression

In multiple regression, we have p explanatory variables rather than 1,

\textbf{U} =\begin{bmatrix}1 &x_{11} &\dots & x_{p1} \\\vdots& \vdots &\vdots &\vdots \\1 &x_{1N} &\dots & x_{pN} \\\end{bmatrix}

 

and

\hat{y}_i = \hat{\alpha} + \hat{\beta}_1 \times x_{1i} + \dots + \hat{\beta}_p \times x_{pi}

 

The matrix formulation of the results is otherwise unchanged.

In multivariate regression, you have a vector \mathbf{Y} of responses rather than one number, but the matrix formulation is otherwise unchanged.

Its conciseness and generality are precisely the point.

The Hat Matrix

The hat matrix was introduced by John Tukey in 1972. The matrix representation also allows us to express the fitted values \mathbf{\hat{y}} as the product of the “hat matrix” \mathbf{H}=\mathbf{U}\left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\mathbf{U}^T with the observations \mathbf{y}:

\mathbf{\hat{y}}= \mathbf{U}\mathbf{ \hat{\Gamma}} = \mathbf{U}\left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\mathbf{U}^T\mathbf{y}= \mathbf{H}\mathbf{y}

 

As \hat{y} is an orthogonal projection, \hat{\hat{y}} = \hat{y}, meaning that \mathbf{H}^2 = \mathbf{H}, and \mathbf{H} is idempotent. It is easily verified:

\mathbf{H}^2=\mathbf{U}\left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\left ( \mathbf{U}^T\mathbf{U} \right )\left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\mathbf{U}^T = \mathbf{U}\left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\mathbf{U}^T = \mathbf{H}

 

Because \mathbf{H} is symmetric, it can be transformed into a diagonal matrix, with diagonal elements called eigenvalues. Because it is idempotent, its eigenvalues can only be 0 and 1, and the number of eigenvalues that are 1 is the rank of \mathbf{H}.

This number is also equal to the sum of the diagonal elements of \mathbf{H}, Tr( [katex]\mathbf{H} ). If the k explanatory variables are independent, \mathbf{H} projects \mathbf{Y} onto a space of dimension 1+k, which is also the rank of \mathbf{H}. It is equal to Tr( [katex]\mathbf{H}).

\mathbf{H}actually projects \mathbf{Y} onto the space generated by the columns of \mathbf{U}, which are therefore its eigenvectors:

\mathbf{H}\mathbf{U} =\mathbf{U}\left ( \mathbf{U}^T\mathbf{U} \right )^{-1}\mathbf{U}^T\mathbf{U} = \mathbf{U}

 

Let h_{ij} for i,j = 1,\dots,N be the general element of \mathbf{H}.

As the first column of \mathbf{U} is all 1s, \mathbf{HU} = \mathbf{U} gives us:

\sum_{j=1}^{N} h_{ij} = 1 for i = 1,\dots,N

The rows of \mathbf{H} add up to 1 and, by symmetry, so do its columns.

Because \mathbf{H}^2 = \mathbf{H} and \mathbf{H}^T = \mathbf{H}, we also have

\sum_{j=1}^{N} h_{ij}^2 = h_{ii} for i = 1,\dots,N

 

and

Tr( \mathbf{H}) =\sum_{i = 1}^{N} h_{ii} = 1 + k

 

The Hat Matrix in Simple Linear Regression

In the simple linear regression case,

\mathbf{H} = \frac{1}{Ns^2}\begin{bmatrix}1 & x_1 \\\vdots &\vdots \\1 & x_N \\\end{bmatrix}\begin{bmatrix}s^2 + \bar{x}^2 & -\bar{x} \\-\bar{x} & 1 \\\end{bmatrix}\begin{bmatrix}1 & \dots & 1 \\x_1 &\dots & x_N \\\end{bmatrix}

 

\mathbf{H} is a square N\times N and, therefore, in the example we used, would be about 30,000 rows and 30,000 columns. If h_{ij} is the element in the i-th row and j-th 1 column, we have

h_{ij} = \frac{1}{N}\times \left [ 1+\frac{\left ( x_i - \bar{x} \right ) \left ( x_j - \bar{x} \right )}{s^2} \right ]

 

and, in particular, on the diagonal of H, for i=1,\dots,N

 

h_{ii} = \frac{1}{N}\times \left [ 1+\frac{\left ( x_i - \bar{x} \right )^2 }{s^2} \right ]

 

Estimating the Variance of the Residuals

We assume that the observations

\mathbf{y}= \begin{bmatrix}y_1\\\vdots y_N \\\end{bmatrix}

 

are drawn from a random vector

\mathbf{Y}= \begin{bmatrix}Y_1\\\vdots Y_N \\\end{bmatrix}

 

of variables Y_i that are independent and identically distributed with means \mu and standard deviations \sigma.

The variance of \mathbf{Y}, given \mathbf{X}, is Var(\mathcal{E})= \sigma_{\mathcal{E}}^2.

By independence of the occurrences of Y in the sample, for any i \neq j, Cov\left ( Y_i, Y_j | \mathbf{X}\right ) = 0 , and therefore the covariance matrix of \mathbf{Y} given \mathbf{X} is \sigma_{\mathcal{E}}^2\mathbf{I}_N, where \mathbf{I}_N is the identity N\times N matrix, with 1s on the diagonal and 0s elsewhere.

The fitted values then form the random vector \hat{\mathbf{Y}}= \mathbf{HY}. Because \mathbf{H} is symmetric, \mathbf{H}^T = \mathbf{H} and because it is idempotent, \mathbf{H}^2 = \mathbf{H}. Therefore

Cov\left ( \mathbf{HY} \right|\mathbf{X} )= \mathbf{H} Cov\left ( Y |\mathbf{X}\right )\mathbf{H}^T = \sigma_{\mathcal{E}}^2 \mathbf{H} \mathbf{H}^T = \sigma_{\mathcal{E}}^2 \mathbf{H}^2 = \sigma_{\mathcal{E}}^2 \mathbf{H}

 

This means, in particular, that if h_{ij} is the element of \mathbf{H} at row i and column \j, then, for i=1, \dots, N,

Var\left (\hat{Y}_i | \mathbf{X}\right ) = \sigma_{\mathcal{E}}^2 h_{ii}

 

and the variance of the ith residual, by orthogonality, is

Var(\mathcal{E}_i|\mathbf{X}) = Var\left (Y_i - \hat{Y}_i \right|\mathbf{X} ) = \sigma_{\mathcal{E}}^2 \left ( 1 - h_{ii} \right )

 

As the residuals are uncorrelated, the covariance matrix of \mathbf{Y} - \mathbf{\hat{Y}} given \mathbf{X} is diagonal with \sigma^2 \left ( 1 - h_{11} \right ),\dots, \sigma^2 \left ( 1 - h_{NN} \right ) on the diagonal:

Cov(\mathcal{E}|\mathbf{X}) = Cov\left (\mathbf{Y}- \mathbf{\hat{Y}} \right |\mathbf{X}) = \sigma_{\mathcal{E}}^2 \left [ \left ( 1 - h_{11} \right )\dots \left ( 1 - h_{NN} \right ) \right ]\mathbf{I}_N

 

But

Var(Y_i - \hat{Y}_i|X) = E(Y_i - \hat{y}_i)^2

 

and therefore

E\left [ \sum_{i=1}^{N}(Y_i - \hat{y}_i)^2 \right ]= \mathbf{Tr}\left [ Cov(\mathcal{E}|\mathbf{X}) \right ] = \sigma_{\mathcal{E}}^2 \left [ N - \sum_{i=1}^{N}h_{ii}\right ] = \sigma_{\mathcal{E}}^2 \left [ N - \left ( 1 +k \right )\right ]

 

From this, we conclude that

s_e^2 = \frac{1}{N-(1+k)}\sum_{i =1}^{N}\left ( y_i - \hat{y}_i \right )^2

 

is an unbiased estimator of \sigma_{\mathcal{E}}^2.

Standardized and Studentized Residuals

We express the residuals as multiples of their standard deviations to generate plots where the absolute size is irrelevant.

For point i, The standardized residual is

r_i=\frac{\left ( y_i -\hat{y}_i \right )}{ s_e}

 

It does not take into account the differences in the variance of the residuals with i, based on the value of the explanatory variables \mathbf{X}. If we think conditionally on \mathbf{X}, we get the studentized residual:

t_i=\frac{\left ( y_i -\hat{y}_i \right )}{ s_e\sqrt{1-h_{i i}}}

 

The plots generated from the lm package in R have axes labeled Standardized Residuals when the calculations are actually for Studentized Residuals.

References

#quality, #regression, #validation, #linearmodel