9 Least Squares

PDF version

Regression analysis is concerned with the approximation of a dependent variable Y_i (regressand, response variable) by a function f(X_i) of a vector of independent variables X_i (regressors, predictor variables): Y_i \approx f(X_i), \quad i=1, \ldots, n.

The least squares method selects the regression function that minimizes the squared approximation error: \min_{f} \sum_{i=1}^n (Y_i - f(X_i))^2. \tag{9.1}

In linear regression, the regression function f(\cdot) is a linear function, and the minimization problem of Equation 9.1 is called ordinary least squares (OLS) problem.

9.1 The OLS principle

We approximate the dependent variable Y_i by a linear function of k-1 independent variables X_{i2}, \ldots, X_{ik} plus an intercept: Y_i \approx b_1 + b_2 X_{i2} + \ldots + b_k X_{ik}. The regression function can be written as an inner product: f(X_i) = b_1 + b_2 X_{i2} + \ldots + b_k X_{ik} = X_i'b, with X_i = \begin{pmatrix} 1 \\ X_{i2} \\ \vdots \\ X_{ik} \end{pmatrix}, \quad b = \begin{pmatrix} b_1 \\ b_2 \\ \vdots \\ b_k \end{pmatrix}, where X_i is the vector of k regressors and b is the vector of k regression coefficients.

For a given sample \{(Y_1, X_1'), \ldots, (Y_n, X_n')\} and a given coefficient vector b \in \mathbb R^K the sum of squared errors is S_n(b) = \sum_{i=1}^n (Y_i - f(X_i))^2 = \sum_{i=1}^n (Y_i - X_i'b)^2. The least squares coefficient vector \widehat \beta is the minimizing argument: \widehat \beta = \argmin_{b \in \mathbb R^k} \sum_{i=1}^n (Y_i - X_i'b)^2 \tag{9.2}

Least squares coefficients

If the k \times k matrix (\sum_{i=1}^n X_i X_i') is invertible, the solution to Equation 9.2 is unique with \widehat \beta = \Big( \sum_{i=1}^n X_i X_i' \Big)^{-1} \sum_{i=1}^n X_i Y_i. \tag{9.3}

Proof

Proof. Write the sum of squared errors as \begin{align*} S_n(b) = \sum_{i=1}^n (Y_i - X_i'b)^2 = \sum_{i=1}^n Y_i^2 - 2 b' \sum_{i=1}^n X_i Y_i + b' \sum_{i=1}^n X_i X_i' b. \end{align*} The first-order condition for a minimum in b=\widehat \beta is that the gradient is zero, and the second-order condition is that the Hessian matrix is positive definite (see Matrix calculus).

The first-order condition \begin{align*} \frac{\partial S_n(b)}{\partial b} = -2 \sum_{i=1}^n X_i Y_i + 2 \sum_{i=1}^n X_i X_i' b \overset{!}{=} \boldsymbol 0_k \end{align*} implies \begin{align*} \sum_{i=1}^n X_i X_i' \widehat \beta = \sum_{i=1}^n X_i Y_i \quad \Leftrightarrow \quad \widehat \beta = \Big( \sum_{i=1}^n X_i X_i' \Big)^{-1} \sum_{i=1}^n X_i Y_i. \end{align*}

The second-order condition
\begin{align*} \frac{\partial^2}{\partial b \partial b'} S_n(b) = 2 \sum_{i=1}^n X_i X_i' \overset{!}{>} 0 \end{align*} is satisfied for any b since c'\bigg(\sum_{i=1}^n X_i X_i'\bigg)c = \sum_{i=1}^n c'X_i X_i'c = \sum_{i=1}^n \underbrace{(c'X_i)^2}_{\geq 0} > 0 for any nonzero vector c \in \mathbb R^k (see definite matrix). The strict positivity follows from the condition that \sum_{i=1}^n X_i X_i' is invertible, i.e., it is nonsingular and does not have any zero eigenvalues (see here). \square

OLS fitted values and residuals

The fitted values or predicted values are \widehat Y_i = \widehat \beta_1 + \widehat \beta_2 X_{i2} + \ldots + \widehat \beta_k X_{ik} = X_i'\widehat \beta, \quad i=1, \ldots, n. The residuals are \widehat u_i = Y_i - \widehat Y_i = Y_i - X_i'\widehat \beta, \quad i=1, \ldots, n.

9.2 Simple linear regression (k=2)

A simple linear regression is a linear regression of a dependent variable on a constant and a single independent variable.

Let’s revisit the wage and education data from Table 9.1 and consider only the first 20 observations:

Click here to see the dataset

Table 9.1: ALLBUS data sub-sample

Person	Wage	log(Wage)	Education	Education^2	Edu x log(wage)
1	12.91	2.56	18	324	46.08
2	11.49	2.44	14	196	34.16
3	10.22	2.32	14	196	32.48
4	11.49	2.44	16	256	39.04
5	9.20	2.22	16	256	35.52
6	14.94	2.70	14	196	37.80
7	11.75	2.46	16	256	39.36
8	15.09	2.71	16	256	43.36
9	23.95	3.18	18	324	57.24
10	8.62	2.15	12	144	25.80
11	25.54	3.24	18	324	58.32
12	15.84	2.76	14	196	38.64
13	5.17	1.64	12	144	19.68
14	28.74	3.36	21	441	70.56
15	6.44	1.86	14	196	26.04
16	12.92	2.56	12	144	30.72
17	9.20	2.22	13	169	28.86
18	13.65	2.61	21	441	54.81
19	12.64	2.54	12	144	30.48
20	18.11	2.90	21	441	60.90
sum	277.91	50.87	312	5044	809.85

We regress Y_i = \log(wage_i) on a constant and Z_i = education_i. The regressor vector is X_i = (1,Z_i)'.

The OLS coefficients are \begin{align*} \begin{pmatrix} \hat \beta_1 \\ \hat \beta_2 \end{pmatrix} &= \Big( \sum_{i=1}^n X_i X_i' \Big)^{-1} \sum_{i=1}^n X_i Y_i \\ &= \begin{pmatrix} n & \sum_{i=1}^n Z_i \\ \sum_{i=1}^n Z_i & \sum_{i=1}^n Z_i^2 \end{pmatrix}^{-1} \begin{pmatrix} \sum_{i=1}^n Y_i \\ \sum_{i=1}^n Z_i Y_i \end{pmatrix} \end{align*}

Evaluate sums: \begin{align*} \sum_{i=1}^n X_i Y_i = \begin{pmatrix} 50.87 \\ 809.85 \end{pmatrix}, \quad \sum_{i=1}^n X_i X_i' = \begin{pmatrix} 20 & 312 \\ 312 & 5044 \end{pmatrix} \end{align*}

OLS coefficients: \begin{align*} \begin{pmatrix} \hat \beta_1 \\ \hat \beta_2 \end{pmatrix} &= \begin{pmatrix} 20 & 312 \\ 312 & 5044 \end{pmatrix}^{-1} \begin{pmatrix} 50.87 \\ 809.85 \end{pmatrix} = \begin{pmatrix} 1.107 \\ 0.092 \end{pmatrix} \end{align*}

In R, we can use the lm() function to compute the least squares coefficients:

fit1 = lm(logwage~education)
fit1


Call:
lm(formula = logwage ~ education)

Coefficients:
(Intercept)    education  
    1.10721      0.09207

The OLS coefficient vector is \widehat \beta = (1.107, 0.092)' and the fitted regression line is f(education) = 1.107 + 0.092 \ education

## Plot regression line
plot(education,logwage)
abline(fit1, col="blue")

par(mfrow = c(1,2))
## Fitted values and residuals
fitted = fit1$fitted.values
residuals = fit1$residuals
plot(education,fitted, main="Fitted values")
abline(fit1, col="blue")
plot(education,residuals, main="Residuals")
abline(h=0)

There is another, simpler formula for \widehat \beta_1 and \widehat \beta_2 in the simple linear regression model. It can be expressed in terms of sample moments:

Simple linear regression

The least squares coefficients in a simple linear regression can be written as \widehat \beta_2 = \frac{\widehat \sigma_{YZ}}{\widehat \sigma_Z^2} = \frac{s_{YZ}}{s_Z^2}, \quad \widehat \beta_1 = \overline Y - \widehat \beta_2 \overline Z \tag{9.4}

The results coincide with those from above:

beta2.hat = cov(logwage,education)/var(education)
beta1.hat = mean(logwage) - beta2.hat*mean(education)
c(beta1.hat, beta2.hat)

[1] 1.10720588 0.09207014

Here is an illustration of fitted values and residuals for an OLS regression line of another dataset:

Figure 9.1: Illustration of fitted values and residuals

9.3 Linear regression with k=3

The following code computes \widehat{\beta} for a given dataset with k=3 regressors X_i = (1,X_{i2},X_{i3})':

# Some given data
Y = c(0.7,-1.0,-0.2,-1.2,-0.1,3.4,0.0,0.8,3.7,2.0)
X_2 = c(1.9,0.8,1.25,0.1,-0.1,4.4,4.6,1.6,5.5,3.4)
X_3 = c(66, 62, 59, 61, 63, 70, 68, 62, 68, 66)
## Compute the OLS estimation
fit2 = lm(Y ~ X_2 + X_3)
## Plot sample regression surface
library("scatterplot3d") # library for 3d plots
plot3d  <- scatterplot3d(x = X_2, y = X_3, z = Y,
            angle = 33, scale.y = 0.8, pch = 16,
            color ="red", 
            xlab = "X_2",
            ylab = "X_3",
            main ="OLS Regression Surface")
plot3d$plane3d(fit2, lty.box = "solid", col=gray(.5), draw_polygon=TRUE)

fit2$coefficients

(Intercept)         X_2         X_3 
 -8.4780709   0.4955995   0.1259828

The fitted regression surface is f(X_i) = -8.478 + 0.496 X_{i2} + 0.126 X_{i3}.

9.4 Matrix notation

To avoid the summation notation in Equation 9.3, we stack the n observations of the dependent variable into a vector and the n regressor vectors as 1\times k row vectors into a n \times k matrix: \begin{align*} \boldsymbol Y = \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{pmatrix}, \quad \boldsymbol X = \begin{pmatrix} X_1' \\ X_2' \\ \vdots \\ X_n' \end{pmatrix} = \begin{pmatrix} 1 & X_{12} & \ldots & X_{1k} \\ \vdots & & & \vdots \\ 1 & X_{n2} &\ldots & X_{nk} \end{pmatrix} \end{align*}

We have \sum_{i=1}^n X_i X_i' = \boldsymbol X'\boldsymbol X and \sum_{i=1}^n X_i Y_i = \boldsymbol X' \boldsymbol Y.

The least squares coefficients become \widehat \beta = \Big( \sum_{i=1}^n X_i X_i' \Big)^{-1} \sum_{i=1}^n X_i Y_i = (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X'\boldsymbol Y. By using matrix notation, the OLS coefficient formula can be quickly implemented in R (recall that %*% is matrix multiplication, solve() computes the inverse, and t() the transpose):

X = cbind(rep(1,10), X_2, X_3)
solve(t(X) %*% X) %*% t(X) %*% Y

          [,1]
    -8.4780709
X_2  0.4955995
X_3  0.1259828

The vector of fitted values is \widehat{\boldsymbol Y} = \begin{pmatrix} \widehat Y_1 \\ \vdots \\ \widehat Y_n \end{pmatrix} = \boldsymbol X \widehat \beta = \underbrace{\boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X'}_{=\boldsymbol P}\boldsymbol Y = \boldsymbol P \boldsymbol Y. The vector of residuals is \widehat{\boldsymbol u} = \begin{pmatrix} \widehat u_1 \\ \vdots \\ \widehat u_n \end{pmatrix} = \boldsymbol Y - \widehat{\boldsymbol Y} = \boldsymbol Y - \boldsymbol P \boldsymbol Y = \underbrace{(\boldsymbol I_n - \boldsymbol P)}_{= \boldsymbol M}\boldsymbol Y = \boldsymbol M \boldsymbol Y

9.5 Projection matrices

The matrix \boldsymbol P = \boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' is the n\times n projection matrix that projects any vector from \mathbb{R}^n into the column space spanned by the column vectors of \boldsymbol X. It is also called influence matrix or hat-matrix because it maps the vector of response values \boldsymbol Y on the vector of fitted values (hat values): \boldsymbol P \boldsymbol Y = \widehat{\boldsymbol Y}.

Its diagonal entries h_{ii} = X_i'(\boldsymbol X' \boldsymbol X)^{-1} X_i are called leverage values, which measure how far away the regressor values of the i-th observation X_i are from those of the other observations.

Properties of leverage values: 0 \leq h_{ii} \leq 1, \quad \sum_{i=1}^n h_{ii} = k.

A large h_{ii} occurs when the observation i has a big influence on the regression line, e.g., the last observation in the following dataset:

X=c(10,20,30,40,50,60,70,500)
Y=c(1000,2200,2300,4200,4900,5500,7500,10000)
plot(X,Y, main="OLS regression line with and without last observation")
abline(lm(Y~X), col="blue")
abline(lm(Y[1:7]~X[1:7]), col="red")

hatvalues(lm(Y~X))

        1         2         3         4         5         6         7         8 
0.1657356 0.1569566 0.1492418 0.1425911 0.1370045 0.1324820 0.1290237 0.9869646

The wage and education data is quite balanced and has moderate leverage values:

plot(hatvalues(fit1), type="h")

hatvalues(fit1)

         1          2          3          4          5          6          7 
0.08257919 0.06447964 0.06447964 0.05090498 0.05090498 0.06447964 0.05090498 
         8          9         10         11         12         13         14 
0.05090498 0.08257919 0.12330317 0.08257919 0.06447964 0.12330317 0.21493213 
        15         16         17         18         19         20 
0.06447964 0.12330317 0.08823529 0.21493213 0.12330317 0.21493213

The matrix \boldsymbol M = \boldsymbol I_n - \boldsymbol P = \boldsymbol I_n - \boldsymbol X (\boldsymbol X'\boldsymbol X)^{-1} \boldsymbol X' is the n\times n orthogonal projection matrix that projects any vector from \mathbb{R}^n into the vector space orthogonal to that spanned by \boldsymbol X. It is also called annihilator matrix or residual maker matrix because it maps the vector of response values \boldsymbol Y on the vector of residuals: \boldsymbol M \boldsymbol Y = \widehat{\boldsymbol u}.

The projection matrices \boldsymbol P and \boldsymbol M have some nice properties:

\boldsymbol P and \boldsymbol M are symmetric, i.e. \boldsymbol P=\boldsymbol P' and \boldsymbol M=\boldsymbol M'.
\boldsymbol P and \boldsymbol M are idempotent, i.e. \boldsymbol P\boldsymbol P=\boldsymbol P and \boldsymbol M\boldsymbol M=\boldsymbol M.
Moreover, we have that \boldsymbol X'\boldsymbol P=\boldsymbol X', \boldsymbol P \boldsymbol X=\boldsymbol X, \boldsymbol X'\boldsymbol M=\boldsymbol 0_{k\times n}, \boldsymbol M \boldsymbol X=\boldsymbol 0_{n \times k}, and \boldsymbol P \boldsymbol M=\boldsymbol 0_{n \times n}.

The properties (a)-(c) follow directly from the definitions of \boldsymbol P and \boldsymbol M (try it as an exercise). Using these properties one can show that the residual vector \widehat{\boldsymbol u} is orthogonal to each of the column vectors in \boldsymbol X: \boldsymbol X'\widehat{\boldsymbol u} = \underbrace{\boldsymbol X' \boldsymbol M}_{= \boldsymbol 0_{k \times n}} \boldsymbol Y = \boldsymbol 0_{k \times 1}. \tag{9.5}

Equation 9.5 implies that the residual vector \widehat{\boldsymbol u} is orthogonal to the predicted values vector \boldsymbol{\widehat Y} since \boldsymbol X'\widehat{\boldsymbol u}=\boldsymbol 0_{k \times 1} \quad \Rightarrow\quad \underbrace{\widehat\beta'\boldsymbol X'}_{=\boldsymbol{\widehat Y}'}\widehat{\boldsymbol u}=\underbrace{\widehat\beta'\boldsymbol 0_{k \times 1}}_{=0}.

Hence, OLS produces a decomposition of the dependent variable values into the orthogonal vectors of fitted values and residuals, \boldsymbol Y= \boldsymbol{\widehat Y} + \boldsymbol{\widehat u} = \boldsymbol P \boldsymbol Y + \boldsymbol M \boldsymbol Y, with \boldsymbol{\widehat Y}'\boldsymbol{\widehat u}=0. \tag{9.6}

9.6 Analysis of Variance

The first and second sample moments of \boldsymbol Y = (Y_1, \ldots, Y_n)' can be expressed in terms of the sample moments of the fitted values \boldsymbol{\widehat Y} = (\widehat Y_1, \ldots, \widehat Y_n)' and the residuals \boldsymbol{\widehat u} = (\widehat u_1, \ldots, \widehat u_n)'.

The first sample moment is \overline Y = \frac{1}{n} \sum_{i=1}^n Y_i = \frac{1}{n} \bigg( \sum_{i=1}^n \widehat Y_i + \underbrace{\sum_{i=1}^n \widehat u_i}_{= 0} \bigg) = \overline{\widehat Y}, which follows from Equation 9.5 and the fact that the linear regression function contains an intercept. The reason is that, due to the intercept, the first column of \boldsymbol X consists of 1’s, and the first entry of \boldsymbol X'\widehat{\boldsymbol u} is \sum_{i=1}^n \widehat u_i.

The second sample moment is \overline{Y^2} = \frac{1}{n} \sum_{i=1}^n Y_i^2 = \frac{1}{n} \sum_{i=1}^n \widehat Y_i^2 + \frac{2}{n} \underbrace{\sum_{i=1}^n \widehat Y_i \widehat u_i}_{= \boldsymbol{\widehat Y}' \boldsymbol{\widehat u} = 0} + \frac{1}{n} \sum_{i=1}^n \widehat u_i^2 = \overline{\widehat Y^2} + \overline{\widehat u^2}, which follows from Equation 9.6.

The sample variance is \widehat \sigma_Y^2 = \overline{Y^2} - \overline Y^2 = \overline{\widehat Y^2} + \overline{\widehat u^2} - \overline{\widehat Y} = \widehat \sigma_{\widehat Y}^2 + \widehat \sigma_{\widehat u}^2, where \widehat \sigma_{\widehat u}^2 = \overline{\widehat u^2} - \overline{\widehat u}^2 = \overline{\widehat u^2} since \sum_{i=1}^n \widehat u_i = 0 due to the intercept in the regression function.

We obtain the well-known variance decomposition result for OLS regressions:

Analysis-of-variance formula \begin{align*} \underset{\text{total sample variance}}{\frac{1}{n}\sum_{i=1}^n\left(Y_i-\overline{Y}\right)^2}&=&\underset{\text{explained sample variance}}{\frac{1}{n}\sum_{i=1}^n\left(\widehat{Y}_i-\overline{\widehat{Y}}\right)^2}+\underset{\text{unexplained sample variance}}{\frac{1}{n}\sum_{i=1}^n\widehat u_i^2.} \end{align*} Alternatively, we can write \widehat \sigma_Y^2 = \widehat \sigma_{\widehat Y}^2 + \widehat \sigma_{\widehat u}^2 or s_Y^2 = s_{\widehat Y}^2 + s_{\widehat u}^2.

9.7 Coefficients of determination

The larger the proportion of the explained sample variance, the better the fit of the OLS regression. This motivates the definition of the so-called R^2 coefficient of determination:

Coefficient of determination

R^2= \frac{\sum_{i=1}^n (\widehat Y_i - \overline{\widehat Y})^2}{\sum_{i=1}^n (Y_i - \overline Y)^2} = 1 - \frac{\sum_{i=1}^n \widehat u_i^2}{\sum_{i=1}^n (Y_i - \overline Y)^2}

R^2 describes the proportion of sample variation in \boldsymbol Y explained by \widehat{\boldsymbol Y}. Obviously, we have that 0\leq R^2\leq 1.

If R^2 = 0, the sample variation in \widehat{\boldsymbol Y} is zero (e.g. if the regression line/surface is horizontally flat). The closer R^2 lies to 1, the better the fit of the OLS regression to the data. If R^2 = 1, the variation in \widehat{\boldsymbol u} is zero, so the OLS regression explains the entire variation in \boldsymbol Y (perfect fit).

A low R^2 does not necessarily mean the regression specification is bad. It just means that there is a high share of unobserved heterogeneity in \boldsymbol Y that is not captured by the regressors \boldsymbol X. A high R^2 does not necessarily mean a good regression specification. It just means that the regression fits the sample well. Too many unnecessary regressors lead to overfitting. If k=n, we have R^2 = 1 even if none of the regressors has an actual influence on the dependent variable.

This brings us to the most commonly criticized drawback of R^2: Additional regressors (relevant or not) always increase the R^2. Here is an example of this problem.

set.seed(123)
n     <- 100                    # Sample size
X     <- runif(n, 0, 10)        # Relevant X variable
X_ir  <- runif(n, 5, 20)        # Irrelevant X variable
extra <- rnorm(n, 0, 10)        # Unobserved additional variable
Y     <- 1 + 5 * X + extra      # Y variable
fit1  <- lm(Y~X)       # Correct OLS regression 
fit2   <- lm(Y~X+X_ir)  # OLS regression with X_ir 
c(summary(fit1)$r.squared, summary(fit2)$r.squared)

[1] 0.6933407 0.6933606

So, R^2 increases here even though X_ir is a completely irrelevant explanatory variable.

## Simulate n-2 additional irrelevant regressors:
X_ir2  <- matrix(runif(n*(n-2), 5, 20),nrow=n)
## use 50 irrelevant regressors:
fit3   <- lm(Y~X+X_ir2[,1:50])
## use all 98 irrelevant regressors:
fit4   <- lm(Y~X+X_ir2)
c(summary(fit3)$r.squared, summary(fit4)$r.squared)

[1] 0.8353157 1.0000000

If we add 50 irrelevant regressors, the R^2 increases even further. If we add 98 irrelevant regressors (k=100=n), we obtain a perfect fit with R^2 = 1.

Possible solutions are given by penalized criteria such as the so-called adjusted R-squared or R-bar-squared:

Adjusted R-squared

\overline{R}^2 = 1 - \frac{\frac{1}{n-k}\sum_{i=1}^n \widehat u_i^2}{\frac{1}{n-1}\sum_{i=1}^n (Y_i - \overline Y)^2}

The adjustment is in terms of the degrees of freedom. We lose k degrees of freedom in the OLS regression since we have k regressors (k linear restrictions). We lose one degree of freedom in computing the sample variance due to the sample mean (one linear restriction).

The squareroot of the adjusted sample variance in the numerator of the adjusted R-squared formula is called the standard error of the regression (SER): SER := \sqrt{ \frac{1}{n-k}\sum_{i=1}^n \widehat u_i^2 } = \sqrt{\frac{n}{n-k} \widehat \sigma_{\widehat u}^2}.

An alternative for comparing different regression specifications is the leave-one-out R-squared or R-tilde-squared, which is based on the leave-one-out cross-validation (LOOCV) principle. The i-th leave-one-out OLS coefficient is the OLS coefficient using the sample without the i-th observation: \widehat \beta_{(-i)} = \Big( \sum_{j \neq i} X_j X_j' \Big)^{-1} \sum_{j \neq i} X_j Y_j = (\boldsymbol X' \boldsymbol X - X_i X_i')^{-1} (\boldsymbol X' \boldsymbol Y - X_i Y_i) The i-th leave-one-out predicted value \widetilde Y_i = X_i'\widehat \beta_{(-i)} is an authentic prediction for Y_i since observation Y_i is not used to construct \widetilde Y_i. The leave-one-out residual is \widetilde u_i = Y_i - \widetilde Y_i and obeys the relation \widetilde u_i = \frac{\widehat u_i}{1- h_{ii}}, where h_{ii} = X_i'(\boldsymbol X'\boldsymbol X)^{-1} X_i is the i-th leverage value.

The leave-one-out R-squared is the R-squared using leave-one-out residuals:

Leave-one-out R-squared

\widetilde R^2 = 1 - \frac{\sum_{i=1}^n \widetilde u_i^2}{\sum_{i=1}^n (Y_i - \overline Y)^2} = 1 - \frac{\sum_{i=1}^n \frac{\widehat u_i^2}{(1- h_{ii})^2}}{\sum_{i=1}^n (Y_i - \overline Y)^2}.

Let’s compare the three R-squared versions for various numbers of included regressors.

library(tidyverse)
R2 = function(fit){
  summary(fit)$r.squared
}
AdjR2 = function(fit){
  summary(fit)$adj.r.squared
}
LoocvR2 = function(fit, Y){
  LoocvResid = residuals(fit)/(1-hatvalues(fit))
  1-sum(LoocvResid^2)/sum((Y - mean(Y))^2)
}

## use 90 irrelevant regressors:
fit5   <- lm(Y~X+X_ir2[,1:90])
## use 95 irrelevant regressors:
fit6   <- lm(Y~X+X_ir2[,1:95])

specification = c(0, 1, 50, 90, 95, 98)
Rsquared = sapply(list(fit1, fit2, fit3, fit5, fit6, fit4), R2)
adjRsquared = sapply(list(fit1, fit2, fit3, fit5, fit6, fit4), AdjR2)
LoocvRsquared = sapply(list(fit1, fit2, fit3, fit5, fit6, fit4), LoocvR2, Y=Y)

round(tibble("#irrelevant regressors" = specification, Rsquared, adjRsquared, LoocvRsquared),4) %>% kbl(align = 'c')

#irrelevant regressors	Rsquared	adjRsquared	LoocvRsquared
0	0.6933	0.6902	0.6799
1	0.6934	0.6870	0.6738
50	0.8353	0.6603	0.2681
90	0.9669	0.5900	-5.7026
95	0.9912	0.7103	-30.7321
98	1.0000	NaN	NaN

For comparing different OLS regression specifications, the leave-one-out R-squared should be preferred.

9.8 Too many regressors

The illustrative simulation above shows that the problem of overfitting increases as the number of regressors increases. If k=n we obtain a perfect fit.

par(mfrow=c(1,2))
## k=n=2
Y = c(0.7,-1.0)
X = c(1.9,0.8)
fit1 = lm(Y~X)
plot(X,Y, main="OLS with k=n=2")
abline(fit1)
## k=n=3
# Some given data
Y = c(0.7,-1.0,-0.2)
X_2 = c(1.9,0.8,1.25)
X_3 = c(66, 62, 59)
fit2 = lm(Y ~ X_2 + X_3)
plot3d  <- scatterplot3d(x = X_2, y = X_3, z = Y,
            angle = 33, scale.y = 0.8, pch = 16,
            color ="red", 
            xlab = "X_2",
            ylab = "X_3",
            main ="OLS with k=n=3")
plot3d$plane3d(fit2, lty.box = "solid", col=gray(.5), draw_polygon=TRUE)

If the number of regressors and observations k=n are higher, we can no longer visualize the OLS fit, but the problem still exists: we have Y_i=\widehat Y_i for all i=1, \ldots n.

For k>n, we cannot compute the OLS regression because \sum_{i=1}^n X_i X_i' = \boldsymbol X' \boldsymbol X is not invertible.

Regression problems with k \approx n or k>n are called high-dimensional regressions. In this case, regularization techniques such as Ridge, Lasso, Elastic-net, or dimension-reduction techniques such as principal components or partial least squares should be used.

OLS should be considered for regression problems with k << n (small k and large n).

9.9 Multicollinearity

The only condition for computing the OLS coefficients is that \sum_{i=1}^n X_i X_i' = \boldsymbol X' \boldsymbol X is invertible.

A necessary condition is that k \leq n, as discussed above.

Another reason the matrix may not be invertible is if two or more regressors are perfectly collinear. Two variables are perfectly collinear if their sample correlation is 1 or -1. We have multicollinearity if one variable is a linear combination of the other variables.

This can happen when the same regressor is included twice or in two different units (e.g., GDP in EUR and USD).

Suppose we include a variable that has the same value for all observations (e.g., a dummy variable where all individuals in the sample belong to the same group). In this case, the variable is collinear with the intercept variable. Including too many dummy variables can lead to the dummy variable trap (see below).

The regressor variables are strictly multicollinear (or perfectly multicollinear) if the regressor matrix does not have full column rank: \text{rank}(\boldsymbol X) < k. It implies \text{rank}(\boldsymbol X' \boldsymbol X) < k, so that the matrix is singular and \widehat \beta cannot be computed.

A related situation is near multicollinearity (often just called multicollinearity). It occurs when two columns of \boldsymbol X have a sample correlation very close to 1 or -1. Then, (\boldsymbol X' \boldsymbol X) is “near singular”, its eigenvalues are very small, and (\boldsymbol X' \boldsymbol X)^{-1} becomes very large, causing numerical problems.

Multicollinearity means that at least one regressor is redundant and can be dropped.

9.10 The dummy variable trap

In the following, we consider a simple linear regression model that aims to predict wages in 2015 using gender as the only predictor. We use data provided in the accompanying materials of Stock and Watson’s Introduction to Econometrics textbook. You can download the data stored as an xlsx-file cps_ch3.xlsx HERE.

Let us first prepare the dataset:

## load the 'tidyverse' package
library("tidyverse")
## load the 'readxl' package
library("readxl")

## import the data into R
cps = read_excel(path = "cps_ch3.xlsx")

## Data wrangling
cps_2015 = cps %>% mutate(
    wage   = ahe15,      # rename "ahe15" as "wage"    
    gender = fct_recode( # rename factor "a_sex" as "gender"
      as_factor(a_sex),     
                "male" = "1",  # rename factor level "1" to "male"
                "female" = "2" # rename factor level "2" to "female"
             ) 
  ) %>%  
  filter(year == 2015) %>%     # Only data from year 2008
  select(wage, gender)         # Select only the variables "wage" and "gender"

The first six lines of the dataset look as follows:

wage	gender
25.64103	female
16.02564	female
33.65817	male
23.07692	female
12.98077	female
11.53846	male

Computing the estimation results:

lm_obj = lm(wage ~ gender, data = cps_2015)
coef(lm_obj)

gives

\widehat\beta_1= 28.06
\widehat\beta_2= -5.02

To compute these estimation results, one must assign a numeric 0/1 coding to the factor levels male and female of the factor variable gender. To see the numeric values used by R, one can take a look at model.matrix(lm_obj):

X = model.matrix(lm_obj) # this is the internally used X-matrix
X[1:6,]

  (Intercept) genderfemale
1           1            1
2           1            1
3           1            0
4           1            1
5           1            1
6           1            0

cps_2015$gender[1:6]

[1] female female male   female female male  
Levels: male female

Thus, R internally codes female subjects by 1 and male subjects by 0, such that \widehat\beta_1 + \widehat\beta_2 X_{i,gender}=\left\{\begin{array}{ll}\widehat\beta_1&\text{ if }X_{i,gender}=\text{male}\\\widehat\beta_1 + \widehat\beta_2&\text{ if }X_{i,gender}=\text{female}\end{array} \right.

Interpretation:

The average wage of male workers in 2015 was \widehat{\beta}_1= 28.06 (USD/Hour).
The average wage of female workers in 2015 was \hat{\beta}_1 + \widehat{\beta}_2= 23.04 (USD/Hour).
The difference in the earnings between male and female workers in 2015 is \widehat{\beta}_2= -5.02 (USD/Hour).

Above, we used R’s internal handling of factor variables, which you should always do. However, if you construct a dummy variable for each of the levels of a factor, you may fall into the dummy variable trap:

## Intercept variable
X_1      <- rep(1, times = nrow(cps_2015))

## 1. Dummy variable for 'female'
X_female <- ifelse(cps_2015$gender == "female", 1, 0)

## 2. Dummy variable for 'male'
X_male   <- ifelse(cps_2015$gender == "male", 1, 0)

## Construct the model matrix 'X'
X        <- cbind(X_1, X_female, X_male)

Computing the OLS coefficients “by hand” yields

## Dependent variable
Y        <- cps_2015$wage

## Computing the estimator
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y

Error in solve.default(X %*% t(X)) : 
  system is exactly singular

An error message! We fell into the dummy variable trap!

The estimation result is not computable since (\boldsymbol X' \boldsymbol X) is not invertible due to the perfect multicollinearity between the intercept and the two dummy variables X_female and X_male X_1 = X_female + X_male

which violates the invertibility condition (no perfect multicollinearity).

Solution: Use one dummy variable less than factor levels. I.e., in this example, you can drop X_female or X_male.

## New model matrix after dropping X_male:
X        <- cbind(X_1, X_female)

## Computing the estimator
beta_hat <- solve(t(X) %*% X) %*% t(X) %*% Y
beta_hat

             [,1]
X_1      28.05536
X_female -5.01638

This gives the same result as computed by R’s lm() function using the factor variable gender: f(X_i) = 28.055 -5.016 X_{i,female}.

9.11 OLS without intercept

An intercept is not necessarily required for the OLS method. In this case, the first column of the regressor matrix \boldsymbol X does not contain 1’s. An alternative to solve the dummy variable trap is to drop the intercept:

## Write -1 to drop the intercept
lm_obj2 = lm(Y~X_female+X_male-1)
coefficients(lm_obj2)

X_female   X_male 
23.03898 28.05536

The fitted regression function is f(X_i) = 23.039 X_{i,female} + 28.055 X_{i,male}.

Notice the different interpretations of the OLS coefficients!

9.12 Additional reading

Stock and Watson (2019), Sections 4, 6, 19
Hansen (2022b), Section 4
Davidson and MacKinnon (2004), Section 2

9.13 R-codes

statistics-sec9.R