11  Classical Inference

PDF version

In this section, we study confidence intervals and hypothesis tests for linear regression coefficients under normality and homoskedasticity.

Of course, normality and homoskedasticity are restrictive cases. It is helpful to consider this special case first because we obtain exact inferential methods, i.e., exact confidence intervals and exact size-\alpha test for any fixed sample size n. For the more general cases, the inferential methods can be easily adapted using large n approximations. We will discuss this in the final section.

Recall that the normal linear regression model is defined by the following assumptions:

Normal Linear Regression Assumptions

The variables Y_i and X_i = (1, X_{i2}, \ldots, X_{ik})' satisfy the linear regression equation Y_i = X_i' \beta + u_i, \quad i=1, \ldots, n, \tag{11.1} which has the matrix representation \boldsymbol Y = \boldsymbol X \beta + \boldsymbol u.

For each i=1, \ldots, n we assume

  • (A1) conditional mean independence: E[u_i \mid X_i] = 0

  • (A2) random sampling: (Y_i, X_i') are i.i.d. draws from their joint population distribution

  • (A3) large outliers unlikely: 0 < E[Y_i^4] < \infty, 0 < E[X_{ij}^4] < \infty for all j=1, \ldots, k

  • (A4) no perfect multicollinearity: \boldsymbol X has full column rank

  • (A5) homoskedasticity: Var[u_i \mid X_i] = \sigma^2

  • (A6) normal errors: u_i \mid X_i \sim \mathcal N(0, \sigma^2)

Useful results from the previous section:

  1. \widehat \beta = \beta + (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' \boldsymbol u
  2. E[\widehat \beta \mid \boldsymbol X] = \beta under (A1)–(A4)
  3. \boldsymbol D = Var[\boldsymbol u \mid \boldsymbol X] = diag(\sigma_1^2, \ldots , \sigma_n^2) under (A1)–(A4)
  4. Var[\widehat \beta \mid \boldsymbol X] = (\boldsymbol X' \boldsymbol X)^{-1} (\boldsymbol X' \boldsymbol D \boldsymbol X) (\boldsymbol X' \boldsymbol X)^{-1} under (A1)–(A4)
  5. \boldsymbol D = Var[\boldsymbol u \mid \boldsymbol X] = \sigma^2 \boldsymbol I_n under (A1)–(A5)
  6. Var[\widehat \beta \mid \boldsymbol X] = \sigma^2 (\boldsymbol X' \boldsymbol X)^{-1} under (A1)–(A5)
  7. \widehat \beta\mid \boldsymbol X is normally distributed under (A1)–(A6)

11.1 Standardized OLS coefficients

Under (A1)–(A6) we have \widehat \beta\mid \boldsymbol X \sim \mathcal N(\beta, \sigma^2 (\boldsymbol X' \boldsymbol X)^{-1}), which is a k-variate random variable. For each component j=1, \ldots, k, we have \widehat \beta_j \mid \boldsymbol X \sim \mathcal N(\beta_j, \sigma^2 [(\boldsymbol X' \boldsymbol X)^{-1}]_{jj}), which is a univariate random variable. Note that [(\boldsymbol X' \boldsymbol X)^{-1}]_{jj} is the j-th diagonal element of the k \times k matrix (\boldsymbol X' \boldsymbol X)^{-1}.

In section Section 6.2 we studied how to obtain a confidence interval for normally distributed estimators.

Analogously to the sample mean case, let’s standardize our estimator using the conditional variance: Z_j := \frac{\widehat \beta_j - \beta_j}{sd(\hat \beta_j \mid \boldsymbol X)} = \frac{\widehat \beta_j - \beta_j}{\sqrt{\sigma^2 E[(\boldsymbol X' \boldsymbol X)^{-1}]_{jj}}}. It is standard normal: Z_j \mid \boldsymbol X \sim \mathcal N(0,1), \quad Z_j \sim \mathcal N(0,1). The distribution of Z_j is standard normal conditional on \boldsymbol X because \widehat \beta_j \mid \boldsymbol X is normal, and the unconditional distribution of Z_j is standard normal as well because for every \boldsymbol X the conditional distribution is the same. This is because we standardize conditionally on \boldsymbol X.

11.2 Exact confidence intervals

The fact that Z_j is standard normal directly implies that the infeasible confidence interval derived in Section 6.2 can be used (set \widehat \theta = \widehat \beta_j and \theta = \beta_j): I_{1-\alpha} = \big[ \widehat \beta_j - z_{(1-\frac{\alpha}{2})} \cdot sd(\hat \beta_j \mid \boldsymbol X); \ \widehat \beta_j + z_{(1-\frac{\alpha}{2})} \cdot sd(\hat \beta_j \mid \boldsymbol X) \big] The interval satisfies P(\beta_j \in I_{1-\alpha}) = 1-\alpha for any fixed sample size n.

The interval is infeasible since sd(\hat \beta_j \mid \boldsymbol X) = \sqrt{\sigma^2 E[(\boldsymbol X' \boldsymbol X)^{-1}]_{jj}} is unknown. The error variance \sigma^2 = Var[u_i \mid X_i] must be estimated.

An unbiased and consistent estimator for \sigma^2 is s^2 = \frac{1}{n-k} \sum_{i=1}^n \widehat u_i. We correct by the k degrees of freedom we lose since we have k estimated coefficients that define the residuals \widehat u_i = Y_i - X_i'\widehat \beta. Similarly to the sample variance case in Section 6.5, it satisfies \frac{(n-k)s^2}{\sigma^2} \sim \chi^2_{n-k}.

It’s squareroot s is the standard error of regression s = SER = \sqrt{\frac{1}{n-k} \sum_{i=1}^n \widehat u_i}. We also call it residual standard error. The classical standard error is se(\widehat \beta_j) = \sqrt{s^2 E[(\boldsymbol X' \boldsymbol X)^{-1}]_{jj}}, and the t-ratio for the true coefficient \beta_j is T = \frac{\widehat \beta_j - \beta_j}{se(\widehat \beta_j)} \sim t_{n-k}. Then, following Section 6.5, a feasible and exact (1-\alpha)-confidence interval for \beta_j is I_{1-\alpha} = \big[ \widehat \beta_j - t_{(n-k; 1-\frac{\alpha}{2})} \cdot se(\hat \beta_j); \ \widehat \beta_j + t_{(n-k; 1-\frac{\alpha}{2})} \cdot se(\hat \beta_j) \big]

Instead of s^2, we could also use the sample variance \widehat \sigma_{\widehat u}^2 (divided by n instead of n-k) as an estimate of \sigma^2. Note that s^2 is unbiased and \widehat \sigma_{\widehat u}^2 is biased, but \widehat \sigma_{\widehat u}^2 has a lower variance than s^2 (bias-variance tradeoff). We use s^2 in practice because otherwise the exact t-distribution would not hold.

11.3 Exact t-tests

The t-ratio for the hypothesis H_0: \beta_j = \beta_j^0 is T_0 = \frac{\widehat \beta_j - \beta_j^0}{se(\widehat \beta_j)}. Under H_0, it is t_{n-k}-distributed. Hence, the two-sided t-test for H_0 is given by the test decision \begin{align*} \text{do not reject} \ H_0 \quad \text{if} \ &|T_0| \leq t_{(n-k;1-\frac{\alpha}{2})}, \\ \text{reject} \ H_0 \quad \text{if} \ &|T_0| > t_{(n-k;1-\frac{\alpha}{2})}. \end{align*}

Let F_{0} be the CDF of the null-distribution t_{n-k}, i.e. F_{0}(t_{(n-k;p)}) = p. The p-value for the two-sided t-test is p\text{-value} = 2(1-F_{0}(|T_0|)) The test decision can be equivalently reached by the rule \begin{align*} \text{reject} \ H_0 \quad &\text{if p-value} < \alpha \\ \text{do not reject} \ H_0 \quad &\text{if p-value} \geq \alpha \end{align*}

The most commonly tested hypotheses are H_0: \beta_j = 0 vs. H_1: \beta_j \neq 0 for j=1, \ldots, n, which is the t-test for individual significance of the j-th variable.

11.4 Regression outputs

Let’s consider the dataset CPSdata from the statisticsdata package and regress log-wages on education, experience, squared experience, and a randomly generated variable.

library(statisticsdata)
set.seed(42)
randomdata = rnorm(dim(CPSdata)[1]) ## generate a random regressor
fit = lm(
  log(wage) ~ education + experience + I(experience^2) + randomdata, 
  data = CPSdata)

The lm-summary output

The summary output of R shows a typical regression output table. It consists of

  1. the OLS coefficients \widehat \beta_j in column 1,
  2. the classical standard errors se(\widehat \beta_j) in column 2,
  3. the t-ratios for the hypothesis H_0: \beta_j = 0 in column 3, which is T_0 = \widehat \beta_j/se(\widehat \beta_j),
  4. the p-values for the two-sided t-tests H_0: \beta_j = 0 vs. H_1: \beta_j \neq 0 in column 4.
summary(fit)

Call:
lm(formula = log(wage) ~ education + experience + I(experience^2) + 
    randomdata, data = CPSdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0782  -0.3134   0.0095   0.3391   2.8193 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.9371641  0.0164574  56.945   <2e-16 ***
education        0.1125763  0.0009752 115.436   <2e-16 ***
experience       0.0362323  0.0008074  44.874   <2e-16 ***
I(experience^2) -0.0005772  0.0000167 -34.556   <2e-16 ***
randomdata      -0.0004240  0.0026086  -0.163    0.871    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5905 on 50737 degrees of freedom
Multiple R-squared:  0.2368,    Adjusted R-squared:  0.2367 
F-statistic:  3936 on 4 and 50737 DF,  p-value: < 2.2e-16

It also returns the residual standard error s = SER, the degrees of freedom n-k, the R-squared and the adjusted R-squared, and the overall F-statistic along with the p-value of the overall F-test (we will discuss the F-test below).

Regression outputs in economic journals

Beautiful LaTeX and HTML output tables can be produced with the stargazer package and function:

library(stargazer)
stargazer(fit, header=FALSE, type='html')
Dependent variable:
log(wage)
education 0.113***
(0.001)
experience 0.036***
(0.001)
I(experience2) -0.001***
(0.00002)
randomdata -0.0004
(0.003)
Constant 0.937***
(0.016)
Observations 50,742
R2 0.237
Adjusted R2 0.237
Residual Std. Error 0.590 (df = 50737)
F Statistic 3,935.634*** (df = 4; 50737)
Note: p<0.1; p<0.05; p<0.01

This is the most common style of regression output found in econometric journals. The constant is reported last. T-ratios and p-values are usually not shown because they can be computed from the estimate and standard error if needed (T_0 = \widehat \beta_j/se(\widehat \beta_j)).

The test decision for the two-sided t-test for H_0: \beta_j = 0 (test for individual significance) can be reached by inspecting the number of stars shown. Three stars \phantom{}^{***} indicate that the t-test rejects at \alpha=0.01, two stars \phantom{}^{**} indicate a rejection at \alpha=0.05 but not at \alpha=0.01, and one star \phantom{}^{*} indicates a rejection at \alpha=0.1 but not at \alpha=0.05.

As expected, H_0 cannot be rejected for the variable randomdata, since it was generated independently of log(wage). All other variables have three stars. We say that the variables are highly significant, i.e. they show a strong correlative relation conditional on the other variables.

Note that summary has a different style and uses three stars for \alpha = 0.001, one star for \alpha=0.05, and . for \alpha=0.1 (always check the table notes).

Stargazer can display different regression outputs in a nice side-by-side view:

fit2 = lm(log(wage) ~ education + experience + I(experience^2), data=CPSdata)
fit3 = lm(log(wage) ~ education, data=CPSdata)
fit4 = lm(log(wage) ~ 1, data=CPSdata)
stargazer(fit, fit2, fit3, header=FALSE, type='html')
Dependent variable:
log(wage)
(1) (2) (3) (4)
education 0.113*** 0.113*** 0.108***
(0.001) (0.001) (0.001)
experience 0.036*** 0.036***
(0.001) (0.001)
I(experience2) -0.001*** -0.001***
(0.00002) (0.00002)
randomdata -0.0004
(0.003)
Constant 0.937*** 0.937*** 1.440*** 2.946***
(0.016) (0.016) (0.014) (0.003)
Observations 50,742 50,742 50,742 50,742
R2 0.237 0.237 0.193 0.000
Adjusted R2 0.237 0.237 0.193 0.000
Residual Std. Error 0.590 (df = 50737) 0.590 (df = 50738) 0.607 (df = 50740) 0.676 (df = 50741)
F Statistic 3,935.634*** (df = 4; 50737) 5,247.604*** (df = 3; 50738) 12,134.720*** (df = 1; 50740)
Note: p<0.1; p<0.05; p<0.01

To customize the table to your needs, stargazer offers many options (have a look at the documentation by typing ?stargazer).

11.5 Multiple testing

Consider the usual two-sided t-tests for the hypotheses H_0: \beta_1 = 0 (test1) and H_0: \beta_2 = 0 (test2).

Each test on its own is a valid hypothesis test of size \alpha. However, applying these tests one after the other leads to a multiple testing problem. The probability of falsely rejecting the joint hypothesis H_0: \beta_1 = 0 \ \text{and} \ \beta_2 = 0 \quad \text{vs.} \quad H_1: \text{not} \ H_0 is too large. “Not H_0” means “\beta_1 \neq 0 \ \text{or} \ \beta_2 \neq 0 \ \text{or both}”.

To see this, suppose that, for simplicity, the t-statistics \widehat \beta_1/se(\widehat \beta_1) and \widehat \beta_2/se(\widehat \beta_2) are independent, which implies that the test decisions of the two tests are independent. \begin{align*} &P(\text{both tests do not reject} \mid H_0 \ \text{is true}) \\ &=P(\{\text{test1 does not reject}\} \cap \{\text{test2 does not reject}\} \mid H_0 \ \text{is true}) \\ &= P(\text{test1 does not reject} \mid H_0) \cdot P(\text{test2 does not reject} \mid H_0 ) \\ &= (1-\alpha)^2 = \alpha^2 - 2\alpha + 1 \end{align*}

The size of the combined test is larger than \alpha: \begin{align*} &P(\text{at least one test rejects} \mid H_0 \ \text{is true}) \\ &= 1 - P(\text{both tests do not reject} \mid H_0 \ \text{is true}) \\ &= 1 - (\alpha^2 - 2\alpha + 1) = 2\alpha - \alpha^2 = \alpha(2-\alpha) > \alpha \end{align*}

If the two test statistics are dependent, then the probability of at least one of the tests falsely rejecting depends on their correlation and will also exceed \alpha.

Therefore, another rejection rule must be found for repeated t-tests.

11.6 Exact F-tests

Suppose we want to test that the last q regression coefficients equal zero. We have q conditions to test:

\begin{align*} H_0: \begin{pmatrix} \beta_{k-q+1} \\ \vdots \\ \beta_k \end{pmatrix} = \begin{pmatrix} 0 \\ \vdots \\ 0 \end{pmatrix} \end{align*}

For instance, we would like to test in our first regression whether all coefficients except the constant are zero (q=4), i.e. H_0: \beta_2 = \beta_3 = \beta_4 = \beta_5 = 0.

\widehat{\boldsymbol u}' \widehat{\boldsymbol u} = \sum_{i=1}^n \widehat u_i^2 is the residual sum of squares when using all regressors (unrestricted LS estimation).

sum(residuals(fit)^2)
[1] 17690.81

Let \tilde{\boldsymbol u}' \tilde{\boldsymbol u} be the residual sum of squares when using only the first k-q regressors (restricted LS estimation). In our case, we use only the constant.

sum(residuals(fit4)^2)
[1] 23179.86

The F-statistic is \begin{align*} F = \frac{ ( \tilde{\boldsymbol u}' \tilde{\boldsymbol u} - \widehat{\boldsymbol u}' \widehat{\boldsymbol u})/q }{( \widehat{\boldsymbol u}' \widehat{\boldsymbol u})/(n-k)} \end{align*}

n = length(fit$fitted.values)
k = fit$rank
q = fit$rank - fit4$rank
## n, k, and q
c(n,k,q)
[1] 50742     5     4
numerator = (sum(residuals(fit4)^2) -  sum(residuals(fit)^2))/q
denominator = sum(residuals(fit)^2)/(n-k)
## F-statistic
numerator/denominator
[1] 3935.634

Note that the number coincides with the F-statistic in the regression output. We call this F-statistic the overall F-statistic or the F-statistic for overall significance because we compare our model with the intercept-only model.

The null-distribution of the F-statistic in the normal regression model is the F-distribution with q degrees of freedom in the numerator and n-k degrees of freedom in the denominator: F \sim \frac{\chi^2_q/q}{\chi^2_{n-k}/(n-k)} = F_{q,n-k} F-test decision rule:

Reject H_0 if the F-statistic exceeds the (1-\alpha) quantile of the F_{q;n-k} distribution.

Click to see F-distribution quantiles
0.95-quantiles of the F_{m,r}-distribution
r / m 1 2 3 4 5 6 7 8
1 161.5 199.5 215.7 224.6 230.2 234.0 236.8 238.9
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95
\to \infty 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94

The F-test also allows for a general hypothesis form H_0: \boldsymbol R' \beta = \boldsymbol c, where \boldsymbol R is a k \times q matrix with \text{rank}(\boldsymbol R) = q and \boldsymbol c is a q \times 1 vector. Then, the F-statistic has the alternative direct representation \begin{align*} F = \frac{1}{q} (\boldsymbol R' \widehat \beta - \boldsymbol c)' (\boldsymbol R' \widehat{\boldsymbol V} \boldsymbol R)^{-1} (\boldsymbol R' \widehat \beta - \boldsymbol c), \end{align*} where \widehat{\boldsymbol V} = s^2(\boldsymbol X' \boldsymbol X)^{-1} is the classical covariance matrix estimator.

Consider the linear regression with k=3: \begin{align*} Y_i = \beta_1 + \beta_2 X_{i2} + \beta_3 X_{i3} + u_i \end{align*}

Example with q=2: The hypothesis H_0 : (\beta_2 = 0 and \beta_3 = 0) is translated into H_0: \boldsymbol R' \beta = \boldsymbol c with \boldsymbol R = \begin{pmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \end{pmatrix} , \quad \boldsymbol c = \begin{pmatrix} 0 \\ 0 \end{pmatrix}

Example with q=1: The hypothesis H_0: \beta_2 + \beta_3 = 1 is translated into H_0: \boldsymbol R' \beta = \boldsymbol c with \boldsymbol R = \begin{pmatrix} 0 \\ 1 \\ 1 \end{pmatrix} , \quad \boldsymbol c = \begin{pmatrix} 1 \end{pmatrix}

The car package provides the linearHypothesis function

library(car)
linearHypothesis(fit, c("education = 0.11", "randomdata = 0"))
Linear hypothesis test

Hypothesis:
education = 0.11
randomdata = 0

Model 1: restricted model
Model 2: log(wage) ~ education + experience + I(experience^2) + randomdata

  Res.Df   RSS Df Sum of Sq      F Pr(>F)  
1  50739 17693                             
2  50737 17691  2    2.4433 3.5036 0.0301 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypothesis H_0: \beta_2 = 0.11, \beta_5 = 0 cannot be rejected at the 0.01 significance level, but it can be rejected at the 0.05 significance level.

11.7 Additional reading

  • Stock and Watson (2019), Sections 5, 7, 18, 19
  • Hansen (2022b), Sections 5
  • Davidson and MacKinnon (2004), Section 4, 5

11.8 R-codes

statistics-sec11.R