6  Confidence Intervals

PDF version

6.1 Estimation uncertainty

An estimator approximates an unknown population parameter in the form of a single estimate, which is a real number. We call it a point estimate. The sample mean of the variable wage of the dataset of the previous section is 17.06. It is a point estimate for the population mean of the variable wage. It is unbiased and consistent, but it does not provide any information about how accurate the estimate is for a given sample size n.

A point estimate does not indicate the uncertainty inherent in the estimation process. Consistency results, such as the law of large numbers, state that as the sample size n approaches infinity, the estimate becomes increasingly accurate, i.e., the uncertainty of the estimate diminishes for large n. In practice, however, we are always faced with finite sample sizes and must understand the estimation uncertainty for a fixed sample size n.

We already learned that the MSE for the sample mean is mse(\overline Y) = \frac{\sigma^2}{n}, where 0 < Var[Y] = \sigma^2 < \infty. A quantity with better interpretability than the MSE is the square root of the MSE, similar to the variance and standard deviation. The root mean squared error (RMSE) of an estimator \widehat \theta for \theta is rmse(\widehat \theta) = \sqrt{mse(\widehat \theta)} = \sqrt{E[(\widehat \theta - \theta)^2]}. The RMSE measures how much an estimate differs on average from its true parameter value for a given sample size n. The RMSE of the sample mean is rmse(\overline Y) = \frac{\sigma}{\sqrt n}. Since the RMSE is a linear function of 1/\sqrt n, we say that the sample mean has the rate of convergence \sqrt n. We have \lim_{n \to \infty} \sqrt n \cdot rmse(\widehat \theta) = \sigma.

Rate of convergence

An estimator \widehat \theta with \lim_{n \to \infty} mse(\widehat \theta) = 0 has convergence rate \sqrt n if 0 < \lim_{n \to \infty} \Big( \sqrt n \cdot rmse(\widehat \theta) \Big) < \infty More generally, the rate of convergence is g(n) if 0 < \lim_{n \to \infty} \Big( g(n) \cdot rmse(\widehat \theta) \Big) < \infty.

The rate \sqrt n is the standard convergence rate for estimators and valid for most estimators we use in practice under mild conditions. If the rate of convergence is \sqrt n, we say that the estimator has a parametric convergence rate. There are exceptions where estimators have slower or faster convergence rates (nonparametric estimators, bootstrap, cointegration, long-memory time series).

The rate of convergence gives a first indication of how fast the uncertainty decreases as we get more observations. Consider the case of a \sqrt n case as in the sample mean case. To halve the average deviation of the estimate from the true parameter value by a factor of 2, we need to increase the sample size by a factor of 4 since \sqrt{4}=2. To halve the rmse by a factor of 4, we already need to increase the sample size by a factor of 16.

6.2 Interval estimates

The convergence rate indicates the relative estimation uncertainty, i.e., how much more accurate an estimate gets if we increase the sample size by a certain factor. However, it does not offer a way to quantify the uncertainty precisely.

One of the most common methods of incorporating estimation uncertainty into estimation results is through interval estimates, often referred to as confidence intervals. A confidence interval defines a range of values within which the true parameter is expected to fall, with a specified coverage probability, denoted as 1-\alpha.

More precisely, if \theta is the parameter of interest and \widehat \theta is a point estimator for \theta, a symmetric confidence interval I with coverage probability 1-\alpha can be expressed as I_{1-\alpha} = [\widehat \theta - c_{1-\alpha}; \ \widehat \theta + c_{1-\alpha}] with the property that P(\theta \in I_{1-\alpha}) = 1-\alpha. \tag{6.1} Common coverage probabilities are 0.95, 0.99, and 0.90.

To derive the value c_{1-\alpha} for a given sample size, we need to solve Equation 6.1 for c_{1-\alpha}. Note that the value c_{1-\alpha} will depend on the distribution of \widehat \theta. Also, note that \widehat \theta is a consistent estimator with a sampling variance that converges to 0. It is useful to consider the standardized estimator Z = \frac{\widehat \theta - E[\widehat \theta]}{sd(\widehat \theta)}, which satisfies E[Z] = 0 and Var[Z] = 1 for any sample size n.

Let’s reformulate Equation 6.1 with respect to Z_n. For simplicity, let’s focus on the case where \widehat \theta is unbiased, i.e., E[\widehat \theta] = \theta. The interval event can be rearranged as \begin{align*} &\phantom{\Leftrightarrow} \quad \theta \in I_{1-\alpha} \\ &\Leftrightarrow \quad \widehat \theta - c_{1-\alpha} \leq \theta \leq \widehat \theta + c_{1-\alpha} \\ &\Leftrightarrow \quad - c_{1-\alpha} \leq \theta - \widehat \theta \leq c_{1-\alpha} \\ &\Leftrightarrow \quad c_{1-\alpha} \geq \widehat \theta - \theta \geq - c_{1-\alpha} \\ &\Leftrightarrow \quad \frac{c_{1-\alpha}}{sd(\widehat \theta)} \geq Z \geq - \frac{c_{1-\alpha}}{sd(\widehat \theta)} \\ \end{align*} Hence, Equation 6.1 becomes P\bigg(\frac{- c_{1-\alpha}}{sd(\widehat \theta)} \leq Z \leq \frac{c_{1-\alpha}}{sd(\widehat \theta)}\bigg) = 1-\alpha. \tag{6.2} The next step would be to apply the CDF of Z to solve for c_{1-\alpha}. Suppose, for instance, \widehat \theta has a normal distribution. Then, Z is standard normal and has the CDF \Phi and quantile function \Phi^{-1}, which implies that the equation above becomes \begin{align*} 1-\alpha &= \Phi\bigg( \frac{c_{1-\alpha}}{sd(\widehat \theta)} \bigg) - \Phi\bigg( \frac{-c_{1-\alpha}}{sd(\widehat \theta)} \bigg) \\ &= \Phi\bigg( \frac{c_{1-\alpha}}{sd(\widehat \theta)} \bigg) - \bigg( 1 - \Phi\bigg( \frac{c_{1-\alpha}}{sd(\widehat \theta)} \bigg) \bigg) \\ &= 2\Phi\bigg( \frac{c_{1-\alpha}}{sd(\widehat \theta)} \bigg) - 1, \end{align*} which is equivalent to \begin{align*} &\phantom{\Leftrightarrow}& \quad \frac{2-\alpha}{2} &= \Phi\bigg( \frac{c_{1-\alpha}}{sd(\widehat \theta)} \bigg) \\ &\Leftrightarrow& \quad \Phi^{-1}\bigg(1 - \frac{\alpha}{2} \bigg) &= \frac{c_{1-\alpha}}{sd(\widehat \theta)} \\ &\Leftrightarrow& \quad z_{(1-\frac{\alpha}{2})} \cdot sd(\widehat \theta) &= c_{1-\alpha}, \end{align*} where z_{(1-\frac{\alpha}{2})} is the 1-\alpha/2-quantile of \mathcal N(0,1). The confidence interval is I_{1-\alpha} = [\widehat \theta - z_{(1-\frac{\alpha}{2})} \cdot sd(\widehat \theta); \ \widehat \theta + z_{(1-\frac{\alpha}{2})} \cdot sd(\widehat \theta)].

Standard normal quantiles can be obtained using the R command qnorm or by using statistical tables:

Quantiles of the standard normal distribution
0.9 0.95 0.975 0.99 0.995
1.28 1.64 1.96 2.33 2.58

Therefore, 90%, 95%, and 99% confidence intervals for \theta are given by \begin{align*} I_{0.9} &= [\widehat \theta - 1.64 \cdot sd(\widehat \theta); \ \widehat \theta + 1.64 \cdot sd(\widehat \theta)] \\ I_{0.95} &= [\widehat \theta - 1.96 \cdot sd(\widehat \theta); \ \widehat \theta + 1.96 \cdot sd(\widehat \theta)] \\ I_{0.99} &= [\widehat \theta - 2.58 \cdot sd(\widehat \theta); \ \widehat \theta + 2.58 \cdot sd(\widehat \theta)] \end{align*}

In the case of the sample mean \widehat \theta = \overline Y as an estimator for the population mean \theta = \mu, we have sd(\widehat \theta) = \sigma/\sqrt n in the i.i.d. sampling case and sd(\widehat \theta) = \omega/\sqrt n in case of a stationary short-memory time series.

In any case, Equation 6.1 is satisfied, so the true parameter lies inside the confidence interval with probability 1-\alpha. With probability \alpha the parameter is not in the interval. The more we want to be sure that the true parameter is in the interval, the smaller we have to choose \alpha and the larger the interval becomes. If we choose \alpha = 0, the interval would be infinite, which does not help much. A certain amount of uncertainty always remains. We can control this by choosing the value for \alpha.

Notice that we made two restrictive assumptions in the derivations above. First, we assumed that the estimator is unbiased. The assumption is, in fact, unproblematic if the estimator is asymptotically unbiased. Then, instead of Equation 6.1, the confidence interval is only asymptotically valid such that \lim_{n \to \infty} P( \theta \in I_{1-\alpha}) = 1-\alpha. \tag{6.3} If Equation 6.3 is satisfied, we say that I_{1-\alpha} is an asymptotic confidence interval for \theta.

Our second restrictive assumption is that \widehat \theta follows a normal distribution for any given sample size n. As we will see in the next section, this assumption is less restrictive than one might think. Many estimators are asymptotically normal under general conditions (for instance, maximum likelihood estimators) so that the distribution of \widehat \theta comes closer and closer to a normal distribution as the sample size increases. Therefore Equation 6.3 is satisfied also under non-normality in many cases.

6.3 Central limit theorem

Consider the sample mean \widehat \theta = \overline Y as an estimator for the population mean \theta = \mu, which is unbiased with E[\overline Y] = \mu.

If the sample is i.i.d., Var[\overline Y] = \sigma^2/n. Under the additional assumption that the sample \{Y_1, \ldots, Y_n\} is \mathcal N(\mu, \sigma^2) distributed, it follows that the sample mean is also normal since it is a linear combination of the sample values. Therefore, \overline Y = \frac{1}{n} \sum_{i=1}^n Y_i \sim\mathcal N\Big(\mu, \frac{\sigma^2}{n}\Big) For this case, the 1-\alpha confidence interval is I_{1-\alpha} = \Big[\overline Y - z_{(1-\frac{\alpha}{2})} \cdot \frac{\sigma}{\sqrt n}; \ \overline Y + z_{(1-\frac{\alpha}{2})} \cdot \frac{\sigma}{\sqrt n}\Big]. For stationary short memory time series the distribution is \mathcal N(\mu, \omega^2/n), and we have I_{1-\alpha} = \Big[\overline Y - z_{(1-\frac{\alpha}{2})} \cdot \frac{\omega}{\sqrt n}; \ \overline Y + z_{(1-\frac{\alpha}{2})} \cdot \frac{\omega}{\sqrt n}\Big] \tag{6.4} instead.

Fortunately, the central limit theorem tells us that we can drop the normality assumption and still obtain a valid asymptotic confidence interval in the sense of Equation 6.3.

Convergence in distribution

A statistic S_n converges in distribution to the random variable S if \lim_{n \to \infty} P(S_n \leq c) = P(S\leq c) for all c \in \mathbb R at which the distribution function F(c) = P(S \leq c) is continuous. We write S_n \overset{D}{\rightarrow} S.

If S has the distribution \mathcal N(\mu, \sigma^2), we write S_n \overset{D}{\rightarrow} \mathcal N(\mu, \sigma^2).

To formulate the central limit theorem, consider the standardized sample mean in the i.i.d. case, Z_n = \frac{\overline Y - E[\overline Y]}{sd(\overline Y)} = \frac{\overline Y - \mu}{\sigma/\sqrt n} = \frac{\sqrt n (\overline Y - \mu)}{\sigma} where \sigma can be replaced by \omega in the stationary short-memory time series case.

Central Limit Theorem (CLT)

  1. Let \{Y_1, \ldots, Y_n\} be an i.i.d. sample with E[Y_i] = \mu and 0 < Var[Y_i] = \sigma^2 < \infty. Then, the sample mean satisfies \frac{\sqrt n (\overline Y - \mu)}{\sigma} \overset{D}{\longrightarrow} \mathcal N(0,1), or, equivalently, \sqrt n (\overline Y - \mu) \overset{D}{\longrightarrow} \mathcal N(0,\sigma^2).

  2. Let \{Y_1, \ldots, Y_n\} be a stationary short-memory time series with with mean \mu and long-run variance 0 < \omega^2 < \infty. Moreover, let E[Y_t^4] < \infty and let Y_t and Y_{t-\tau} become independent as \tau gets large. Then, \frac{\sqrt n (\overline Y - \mu)}{\omega} \overset{D}{\longrightarrow} \mathcal N(0,1). or, equivalently, \sqrt n (\overline Y - \mu) \overset{D}{\longrightarrow} \mathcal N(0,\omega^2).

Below, you will find an interactive shiny app for the central limit theorem:

SHINY APP: CLT

Note that for an asymptotic confidence interval (Equation 6.3), Equation 6.2 becomes \lim_{n \to \infty} P\bigg(\frac{- c_{1-\alpha}}{sd(\widehat \theta)} \leq Z_n \leq \frac{c_{1-\alpha}}{sd(\widehat \theta)}\bigg) = 1-\alpha, and the CLT implies that Z_n \overset{D}{\longrightarrow} \mathcal N(0,1). Therefore, all derivations from the previous subsection to obtain c_{1-\alpha} are still valid, and I_{1-\alpha} = \Big[\overline Y - z_{(1-\frac{\alpha}{2})} \cdot sd(\overline Y); \ \overline Y + z_{(1-\frac{\alpha}{2})} \cdot sd(\overline Y)\Big] \tag{6.5} is an asymptotic confidence interval for \mu.

The condition of “becoming independent as \tau gets large” in the CLT is a weak dependence condition. Note that the random variables Y_t and Y_{t-\tau} are independent if P(Y_t \leq a, Y_{t-\tau}\leq b) - P(Y_t\leq a) P(Y_{t-\tau} \leq b) = 0 \tag{6.6} for all a and b. Intuitively, weak dependence means that the left-hand side of Equation 6.6 might be nonzero but must converge to 0 as \tau tends to infinity. I.e., the amount of dependence must decline for large \tau.

6.4 Standard errors

The standard deviation of the sample mean and the confidence intervals depend on the unknown parameter \sigma^2 (i.i.d. case) or \omega^2 (short-memory time series case).

To obtain feasible confidence intervals, we have to estimate the unknown parameters. I.e., we need an estimator for the standard deviation of the sample mean.

Standard error

A standard error se(\widehat \theta) for an estimator \widehat \theta is an estimator for the estimators’ standard deviation sd(\widehat \theta) = \sqrt{Var[\widehat \theta]}. The standard error is called consistent, if \frac{se(\widehat \theta)}{sd(\widehat \theta)} \overset{p}{\to} 1.

For the i.i.d. case., we can replace \sigma^2 by the sample variance \widehat \sigma_Y^2 or the bias-corrected sample variance s_Y^2. The classical standard error for the sample mean is se(\overline Y) = \frac{s_Y}{\sqrt n}.

## Classical standard error
se = sd(wg)/sqrt(length(wg))
se
[1] 1.049272
t.test(wg)$stderr
[1] 1.049272
## 95% confidence interval
I = mean(wg) + c(-qnorm(0.975)*se, +qnorm(0.975)*se)
I
[1] 14.99907 19.11213

For short-memory stationary time series, classical standard errors are not valid. We need an estimator for the long-run variance \omega^2 = \gamma(0) + 2 \sum_{\tau=1}^\infty \gamma(\tau). A commonly applied estimator developed by Whitney K. Newey and Kenneth D. West in 1987 is \widehat \omega^2_{nw} = \widehat \gamma(0) + 2 \sum_{\tau=1}^{\ell_n-1} \frac{\ell_n - \tau}{\ell_n} \widehat \gamma(\tau), where \widehat \gamma(\tau) is the sample autocovariance function, and \ell_n is a data-dependent truncation parameter. The corresponding autocorrelation robust standard error is se_{nw}(\overline Y) = \frac{\widehat \omega_{nw}}{\sqrt n}. In R, you can get the robust standard error by the following command

library(sandwich)
## Robust standard error
seNW = sqrt(NeweyWest(lm(gdp~1)))
seNW
            (Intercept)
(Intercept)   0.4988509
## Robust confidence interval
mean(gdp) + c(-qnorm(0.975)*seNW, +qnorm(0.975)*seNW)
[1] 1.899616 3.855075
## Classical confidence intervals are too small:
se = sd(gdp)/sqrt(length(gdp))
mean(gdp) + c(-qnorm(0.975)*se, +qnorm(0.975)*se)
[1] 2.422138 3.332553

Confidence interval

Let \widehat \theta be a consistent estimator for the parameter \theta with mse[\widehat \theta] \to 0, and let the standardized estimator satisfy \frac{\widehat \theta - E[\widehat \theta]}{sd(\widehat \theta)} \overset{D}{\rightarrow} \mathcal N(0,1). Moreover, let se(\widehat \theta) be a consistent standard error for \widehat \theta. Then, I_{1-\alpha} = \Big[\widehat \theta - z_{(1-\frac{\alpha}{2})} \cdot se(\widehat \theta); \ \widehat \theta + z_{(1-\frac{\alpha}{2})} \cdot se(\widehat \theta)\Big] is an asymptotic 1-\alpha confidence interval for \theta.

6.5 Exact confidence intervals under normality

Consider again the sample mean \overline Y of an i.i.d. sample \{Y_1, \ldots, Y_n\} from some distribution with mean \mu and variance 0 < \sigma^2 < \infty.

The CLT implies that the sample mean is asymptotically normal, which theoretically justifies that Equation 6.5 is an asymptotic confidence interval for \mu. In practice, Equation 6.5 is not feasible because \sigma and thus sd(\overline Y) are unknown.

We can use classical standard errors, replacing \sigma with s_Y, and still get an asymptotically valid confidence interval: \lim_{n \to \infty} P\bigg( \mu \in \Big[\overline Y - z_{(1-\frac{\alpha}{2})} \cdot \frac{s_Y}{\sqrt n}; \ \overline Y + z_{(1-\frac{\alpha}{2})} \cdot \frac{s_Y}{\sqrt n}\Big] \bigg) = 1-\alpha. \tag{6.7} The approximation is quite accurate for large samples, but for small samples the confidence interval may be imprecise. Therefore, it may be helpful to see if we can derive an exact confidence interval I_{1-\alpha}^* such that P(\mu \in I_{1-\alpha}^*) = 1-\alpha for any small sample size n.

Under the restrictive assumption that the population distribution is normal, i.e., Y_i \sim \mathcal N(\mu, \sigma^2) for all i=1, \ldots, n, the sample mean is also normal since it is a weighted average of normal variables: \overline Y = \frac{1}{n} \sum_{i=1}^n Y_i \sim \mathcal N\Big(\mu, \frac{\sigma^2}{n}\Big). \tag{6.8} In this case, the infeasible confidence interval in Equation 6.5 is indeed an exact confidence interval for \mu. However, the feasible counterpart in Equation 6.7 is still only asymptotically valid, even if the underlying sample is normal.

Fortunately, the additional layer of uncertainty introduced by replacing \sigma with its estimator s_Y can be precisely quantified. Under the same conditions as in Equation 6.8, the bias-corrected sample variance as a \chi^2-distribution and is independent of \overline Y: \frac{(n-1)s_Y^2}{\sigma^2} \sim \chi^2_{n-1} Consequently, standardizing with s_Y/\sqrt n instead of \sigma/\sqrt n yields a t-distributed statistic: \frac{\overline Y - \mu}{s_Y/n} = \frac{\overline Y - \mu}{\sigma/n} \cdot \frac{\sigma}{s_Y} \sim \frac{\mathcal N(0,1)}{\sqrt{\chi_{n-1}^2/(n-1)}} = t_{n-1}. Therefore, to obtain a feasible exact confidence interval for \mu, we replace the standard normal quantile z_{(1-\frac{\alpha}{2})} by the t-quantile t_{(n-1;1-\frac{\alpha}{2})} with n-1 degrees of freedom: P\bigg( \mu \in \Big[\overline Y - t_{(n-1;1-\frac{\alpha}{2})} \cdot \frac{s_Y}{\sqrt n}; \ \overline Y + t_{(n-1;1-\frac{\alpha}{2})} \cdot \frac{s_Y}{\sqrt n}\Big] \bigg) = 1-\alpha. \tag{6.9}

Click to see Student's t-distribution quantiles
Student’s t-distribution quantiles
df 0.9 0.95 0.975 0.99 0.995
1 3.08 6.31 12.71 31.82 63.66
2 1.89 2.92 4.30 6.96 9.92
3 1.64 2.35 3.18 4.54 5.84
4 1.53 2.13 2.78 3.75 4.60
5 1.48 2.02 2.57 3.36 4.03
6 1.44 1.94 2.45 3.14 3.71
8 1.40 1.86 2.31 2.90 3.36
10 1.37 1.81 2.23 2.76 3.17
15 1.34 1.75 2.13 2.60 2.95
20 1.33 1.72 2.09 2.53 2.85
25 1.32 1.71 2.06 2.49 2.79
30 1.31 1.70 2.04 2.46 2.75
40 1.30 1.68 2.02 2.42 2.70
50 1.30 1.68 2.01 2.40 2.68
60 1.30 1.67 2.00 2.39 2.66
80 1.29 1.66 1.99 2.37 2.64
100 1.29 1.66 1.98 2.36 2.63
\to \infty 1.28 1.64 1.96 2.33 2.58

Note that Equation 6.9 is only valid if \{Y_1, \ldots, Y_n\} is normally distributed. In any case, if \{Y_1, \ldots, Y_n\} is non-normal, we have \lim_{n \to \infty} P\bigg( \mu \in \Big[\overline Y - t_{(n-1;1-\frac{\alpha}{2})} \cdot \frac{s_Y}{n}; \ \overline Y + t_{(n-1;1-\frac{\alpha}{2})} \cdot \frac{s_Y}{n}\Big] \bigg) = 1-\alpha. \tag{6.10}

Hence, Equation 6.7 and Equation 6.10 yield asymptotical confidence intervals for the mean for any distribution with finite variance, where only the latter is an exact confidence interval if the sample is normal.

In statistical software packages, Equation 6.10 is typically implemented. It is also a little more conservative than Equation 6.7 since t_{(n-1;1-\frac{\alpha}{2})} > z_{(1-\frac{\alpha}{2})}.

## confidence interval with normal quantiles
n=length(wg)
mean(wg) + c(-qnorm(0.975),+qnorm(0.975))*sd(wg)/sqrt(n)
[1] 14.99907 19.11213
## confidence interval with t-quantiles
mean(wg) + c(-qt(0.975,n-1),+qt(0.975,n-1))*sd(wg)/sqrt(n)
[1] 14.97362 19.13758
## built-in confidence interval using the t-test function
t.test(wg, conf.level = 0.95)$conf.int
[1] 14.97362 19.13758
attr(,"conf.level")
[1] 0.95

Since the CDF of a t-distribution with n-1 degrees of freedom converges to the CDF of \mathcal N(0,1), the confidence intervals in Equation 6.7 and Equation 6.10 are close to each other for large samples.

6.6 Additional reading

  • Stock and Watson (2019), Sections 3
  • Hansen (2022a), Section 14

6.7 R-codes

statistics-sec6.R