Linear Models - Inference

We first describe the properties of OLS estimator \(\hat{\boldsymbol{\beta}}\) and the corresponding residuals \(\hat{\boldsymbol{\varepsilon} }\). Then we introduce sum of squares, \(R\)-squared, hypothesis testing and confidence intervals. All these methods assume normality of the error terms \(\varepsilon_i \overset{\text{iid}}{\sim} N(0, \sigma^2)\) unless otherwise specified.

Coefficients \(\boldsymbol{\beta}\)

Unbiasedness

The OLS estimators are unbiased since

\[\begin{split}\begin{align} \operatorname{E}\left( \hat{\boldsymbol{\beta} } \right) &= \operatorname{E}\left( (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X}^\top \boldsymbol{y} \right) \\ &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \operatorname{E}\left( \boldsymbol{y} \right) \\ &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \boldsymbol{X} \boldsymbol{\beta} \\ &= \boldsymbol{\beta} \end{align}\end{split}\]

when \(p=2\),

\[\begin{split}\begin{align} \hat{\beta}_{1} &=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \\ \end{align}\end{split}\]

To prove unbiasedness, using the fact that for any constant \(c\),

\[ \sum_i (x_i - \bar{x})c = 0 \]

Then, the numerator becomes

\[\begin{split}\begin{align} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) &=\sum\left(x_{i}-\bar{x}\right)\left(\beta_{0}+\beta_{1} x_{i}+u_{i}\right) \\ &=\sum\left(x_{i}-\bar{x}\right) \beta_{0}+\sum\left(x_{i}-\bar{x}\right) \beta_{1} x_{i} +\sum\left(x_{i}-\bar{x}\right) u_{i} \\ &=\beta_{0} \sum\left(x_{i}-\bar{x}\right)+\beta_{1} \sum\left(x_{i}-\bar{x}\right) x_{i} +\sum\left(x_{i}-\bar{x}\right) u_{i} \\ &=\beta_{1} \sum\left(x_{i}-\bar{x}\right)^2 +\sum\left(x_{i}-\bar{x}\right) u_{i} \\ \end{align}\end{split}\]

Hence

\[ \begin{equation} \hat{\beta}_{1}=\beta_{1}+\frac{\sum\left(x_{i}-\bar{x}\right) u_{i}}{\sum \left(x_{i}-\bar{x}\right)^{2}} \end{equation} \]

and

\[ \operatorname{E}\left( \hat{\beta}_1 \right) = \beta_1 \]

Variance

The variance (covariance matrix) of the coefficients is

\[\begin{split}\begin{align} \operatorname{Var}\left( \boldsymbol{\beta} \right) &= \operatorname{Var}\left( (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X}^\top \boldsymbol{y} \right) \\ &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X}^\top \operatorname{Var}\left( \boldsymbol{y} \right) \boldsymbol{X} (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \\ &= \sigma^2 (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1}\\ \end{align}\end{split}\]

What is the \((j,j)\)-th entry \(\operatorname{Var}\left( \hat{\beta}_j \right)\)?

More specifically, for the \(j\)-th coefficient estimator \(\hat{\beta}_j\), its variance is,

\[\begin{split}\begin{align} \operatorname{Var}\left( \hat{\beta}_j \right) &= \sigma^2 \left[ (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \right]_{[j,j]} \\ &= \sigma^2 \frac{1}{1- R^2_{j}} \frac{1}{\sum_i (x_{ij} - \bar{x}_j)^2} \\ &= \sigma^2 \frac{TSS_j}{RSS_j} \frac{1}{TSS_j} \\ &= \sigma^2 \frac{1}{\sum_i(\hat{x}_{ij} - x_{ij})^2} \\ \end{align}\end{split}\]

where \(R_j^2\), \(RSS_j\), \(TSS_j\), and \(\hat{x}_{ij}\) are the corresponding representatives when we regress \(X_j\) over all other explanatory variables.

Note that the value of \(R^2\) when we regressing \(X_1\) to an constant intercept is 0. So we have the particular result for simple linear regression.

More generally, for the heteroscedasticity case, let \(\hat{u}_i = \hat{x}_{ij} - x_{ij}\)

\[\begin{split}\begin{aligned} \operatorname{Var}\left( \hat{\beta}_j \right) &= \operatorname{Var}\left( \frac{\sum \hat{u}_{i}y_i}{\sum \hat{u}_{i}^2} \right)\\ &= \frac{\sum \operatorname{Var} \left( \hat{u}_{i} y_i \right)}{SSR_j^2}\\ &= \frac{\sum \hat{u}_{i}^2 \operatorname{Var}\left( \varepsilon_i \right)}{SSR_j^2}\\ \end{aligned}\end{split}\]

For derivation, see partialling out.

In particular, if homoskedasticity is assumed \(\operatorname{Var}\left( \varepsilon_i \right) = \sigma^2\), then it reduces to the formula above. In general, if heteroskedasticity exists but zero mean \(\operatorname{E}\left( \varepsilon_i \right)=0\) is assumed, one can estimate \(\operatorname{Var}\left( \varepsilon_i \right)\) by \(\hat{\varepsilon}_i^2\), and obtain standard error.

When \(p=2\), the inverse \((\boldsymbol{X} ^\top \boldsymbol{X} )^\top\) is

\[\begin{split} \begin{array}{c} \left(\boldsymbol{X} ^\top \boldsymbol{X} \right)^{-1} =\frac{1}{\sum_{i=1}^{n} \left(x_{i}-\bar{x}\right)^{2}}\left[\begin{array}{cc} \bar{x^2} & - \bar{x} \\ - \bar{x} & 1 \end{array}\right] \end{array} \end{split}\]

the variance of \(\hat{\beta}_1\) is

\[\begin{split}\begin{align} \operatorname{Var}\left( \hat{\beta}_1 \right) &= \operatorname{Var}\left( \beta_{1}+\frac{\sum\left(x_{i}-\bar{x}\right) u_{i}}{\sum \left(x_{i}-\bar{x}\right)^{2}} \right)\\ &= \frac{\operatorname{Var}\left( \sum\left(x_{i}-\bar{x}\right) u_{i} \right)}{\left[ \sum \left(x_{i}-\bar{x}\right)^{2} \right]^2}\\ &= \frac{\sum\left(x_{i}-\bar{x}\right)^2 \operatorname{Var}\left( u_{i} \right)}{\left[ \sum \left(x_{i}-\bar{x}\right)^{2} \right]^2}\\ &= \sigma^2 \frac{\sum\left(x_{i}-\bar{x}\right)^2 }{\left[ \sum \left(x_{i}-\bar{x}\right)^{2} \right]^2}\\ &= \frac{\sigma^2}{\sum_{i=1}^n \left(x_{i}-\bar{x}\right)^{2}}\\ \end{align}\end{split}\]

We conclude that

  • The larger the error variance, \(\sigma^2\), the larger the variance of the coefficient estimates.

  • The larger the variability in the \(x_i\), the smaller the variance.

  • A larger sample size should decrease the variance.

  • In multiple regression, reduce the relation between \(X_j\) and other covariates (e.g. by orthogonal design) can decreases \(R^2_{j}\), and hence decrease the variance.

A problem is that the error \(\sigma^2\) variance is unknown. In practice, we can estimate \(\sigma^2\) by its unbiased estimator \(\hat{\sigma}^2=\frac{\sum_i (x_i - \bar{x})}{n-2}\) (to be shown [link]), and substitute it into \(\operatorname{Var}\left( \hat{\beta}_1 \right)\). Since the error variance \(\hat{\sigma}^2\) is estimated, the slope variance \(\operatorname{Var}\left( \hat{\beta}_1 \right)\) is estimated too, and hence the square root is called standard error of \(\hat{\beta}\), instead of standard deviation.

\[\begin{split}\begin{align} \operatorname{se}\left(\hat{\beta}_{1}\right) &= \sqrt{\widehat{\operatorname{Var}}\left( \hat{\beta}_1 \right)}\\ &= \frac{\hat{\sigma}}{\sqrt{\sum \left(x_{i}-\bar{x}\right)^{2}}} \end{align}\end{split}\]

Efficiency (BLUE)

Theorem (Gauss–Markov)

The ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. In abbreviation, the OLS estimator is BLUE: Best (lowest variance) Linear Unbiased Estimator.

Moreover,

  • If error term is normally distributed, then OLS is most efficient among all consistent estimators (not just linear ones).

  • When the distribution of error term is non-normal, other estimators may have lower variance than OLS such as least absolute deviation (median regression).

Consistency

The OLS and consistent,

\[ \hat{\boldsymbol{\beta}}_{OLS} \stackrel{P}{\rightarrow} \boldsymbol{\beta} \]

since

\[\begin{split}\begin{aligned} \operatorname{plim} \hat{\boldsymbol{\beta}} &= \operatorname{plim} \left( \boldsymbol{\beta} + (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{X} ^\top \boldsymbol{\varepsilon} \right) \\ &= \boldsymbol{\beta} + \left( \frac{1}{n} \boldsymbol{X} ^\top \boldsymbol{X} \right)^{-1} \underbrace{\operatorname{plim} \left( \frac{1}{n} \boldsymbol{X} ^\top \boldsymbol{\varepsilon} \right) }_{=0 \text{ by CLM} }\\ &= \boldsymbol{\beta} \\ \end{aligned}\end{split}\]

Large Sample Distribution

If we assume \(\varepsilon_i \overset{\text{iid}}{\sim} N(0, \sigma^2)\), or \(\boldsymbol{\varepsilon} \sim N_n(\boldsymbol{0} , \boldsymbol{I} _n)\), then

\[ \boldsymbol{y} \sim N(\boldsymbol{X} \boldsymbol{\beta} , \sigma^2 \boldsymbol{I} ) \]

Hence, the distribution of the coefficients estimator is

\[\begin{split}\begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \boldsymbol{y} \\ &\sim N(\boldsymbol{\beta} , (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} \operatorname{Var}\left( \boldsymbol{y} \right)) \boldsymbol{X} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \\ &\sim N(\boldsymbol{\beta} , \sigma^2 (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} ) \\ \end{aligned}\end{split}\]

The assumption may fail when the response variable \(y\) is

  • right skewed, e.g. wages, savings

  • non-negative, e.g. counts, arrests

When the normality assumption of the error term fails, the OLS estimator is asymptotically normal,

\[ \hat{\boldsymbol{\beta}} \overset{\mathcal{D}}{\rightarrow} N(\boldsymbol{\beta},\sigma^2 (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} ) \]

Therefore, in a large sample, even if the normality assumption fails, we can still do hypothesis testing which assumes normality.

Residuals and Error Variance

Residuals \(\hat{\boldsymbol{\varepsilon}}\)

Definition

The residual is defined as the difference between the true response value \(y\) and our fitted response value \(\hat{y}\).

\[\hat\varepsilon_i = y_i - \hat{y}_i = y_i - \boldsymbol{x}_i ^\top \hat{\boldsymbol{\beta}}\]

It is an estimate of the error term \(\varepsilon_i\).

Properties

  1. The sum of the residual is zero: \(\sum_i \hat{\varepsilon}_i = 0\)

  2. The sum of the product of residual and any covariate is zero, or they are “uncorrelated”: \(\sum_i x_{ij} \hat{\varepsilon}_i = 0\) for all \(j\).

  3. The sum of squared residuals: \(\left\| \boldsymbol{\hat{\varepsilon}} \right\|^2 = \left\| \boldsymbol{y} - \boldsymbol{H} \boldsymbol{y} \right\|^2 = \boldsymbol{y} ^\top (\boldsymbol{I} - \boldsymbol{H} ) \boldsymbol{y}\)

Estimation of Error Variance

In the estimation section we mentioned the estimator for the error variance \(\sigma^2\) is

\[ \hat{\sigma}^{2}=\frac{\|\boldsymbol{y}-\boldsymbol{X} \hat{\boldsymbol{\beta}}\|^{2}}{n-p} \]

This is because

\[ \begin{align}\begin{aligned}\begin{split} \left\| \hat{\boldsymbol{\varepsilon}} \right\| ^2 \sim \sigma^2\chi ^2 _{n-p} \\\end{split}\\\Rightarrow \quad \sigma^2 =\operatorname{E}\left( \frac{\|\boldsymbol{y}-\boldsymbol{X} \hat{\boldsymbol{\beta}}\|^{2}}{n-p} \right) \end{aligned}\end{align} \]

and we used the method of moment estimator. The derivation of the above expectation is a little involved.

Can we find \(\operatorname{Var}\left( \hat{\sigma}^2 \right)\) like we did for \(\operatorname{Var}\left( \hat{\boldsymbol{\beta}} \right)\)? No, unless we assume higher order moments of \(\varepsilon_i\).

Independence of \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\)

To prove the independence between the coefficients estimator \(\hat{\boldsymbol{\beta} }\) and the error variance estiamtor \(\hat{\sigma}^2\), we need the Lemma below.

Lemma

Suppose a random vector \(\boldsymbol{y}\) follows multivariate normal distribution \(\boldsymbol{y} \sim N_m(\boldsymbol{\mu} , \sigma^2 I_m)\) and \(S, T\) are orthogonal subspaces of \(\mathbb{R} ^m\), then the two projected random vectors are independent

\[ \boldsymbol{P}_S (\boldsymbol{y}) \perp\!\!\!\perp \boldsymbol{P}_T (\boldsymbol{y}) \]

Note that \(\hat{\boldsymbol{\beta}}\) is a function of \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X}) } (\boldsymbol{y})\) since

\[\begin{split}\begin{aligned} \hat{\boldsymbol{\beta}} &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \boldsymbol{y} \\ &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \boldsymbol{X} (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \boldsymbol{y} \\ &= (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X} ^\top \boldsymbol{P}_{\operatorname{im}(\boldsymbol{X} ) } (\boldsymbol{y}) \\ \end{aligned}\end{split}\]

and note that \(\hat{\sigma}^2\) is a function of \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X} )^\bot } (\boldsymbol{y})\) since

\[ \hat{\sigma}^{2}=\frac{\|\hat{\boldsymbol{\varepsilon}} \|^{2}}{n-p} = \frac{\left\| \boldsymbol{P} _{\operatorname{im}(\boldsymbol{X})^\bot } (\boldsymbol{y}) \right\| ^2 }{n-p} \]

Since \(\operatorname{im}(\boldsymbol{X})\) and \(\operatorname{im}(\boldsymbol{X})^\bot\) are orthogonal subspaces of \(\mathbb{R} ^p\), by the lemma, \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X}) } (\boldsymbol{y})\) and \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X})^\bot } (\boldsymbol{y})\) are independent. Hence, \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are independent.

Sum of Squares

We can think of each observation as being made up of an explained part, and an unexplained part.

  • Total sum of squares: \(TSS = \sum\left(y_{i}-\bar{y}\right)^{2}\)

  • Explained sum of squares: \(ESS = \sum\left(\hat{y}_{i}-\bar{y}\right)^{2}\)

  • Residual sum of squares: \(RSS = \sum (y_i - \hat{y}_i)^2\)

Decomposition of TSS

We have the decomposition identity

\[\begin{split}\begin{align} TSS &=\sum\left(y_{i}-\bar{y}\right)^{2} \\ &=\sum\left[\left(y_{i}-\hat{y}_{i}\right)+\left(\hat{y}_{i}-\bar{y}\right)\right]^{2} \\ &=\sum\left[\hat{\varepsilon}_{i}+\left(\hat{y}_{i}-\bar{y}\right)\right]^{2} \\ &=\sum \hat{\varepsilon}_{i}^{2}+2 \sum \hat{\varepsilon}_{i}\left(\hat{y}_{i}-\bar{y}\right)+\sum\left(\hat{y}_{i}-\bar{y}\right)^{2} \\ &= RSS + 2 \sum \hat{\varepsilon}_{i}\left(\hat{\beta}_0 + \hat{\beta}_1 x_{i}-\bar{y}\right)+ ESS \\ &= RSS + ESS \end{align}\end{split}\]

where use the fact that \(\sum_i \varepsilon_i = 0\) and \(\sum_i \varepsilon_i x_i = 0\) shown [above].

Warning

Some courses use the letters \(R\) and \(E\) to denote the opposite quantity in statistics courses.

  • Sum of squares due to regression: \(SSR = \sum\left(\hat{y}_{i}-\bar{y}\right)^{2}\)

  • Sum of squared errors: \(SSE = \sum (y_i - \hat{y}_i)^2\)

Non-increasing RSS

Given a data set, when we add an new explanatory variable into a regression model, \(RSS\) is non-increasing.

This can be better understood if we view the linear regression problem as an approximation problem: Given a target vector \(\boldsymbol{y} \in \mathbb{R} ^n\) and candidate vectors \(\boldsymbol{x} _1, \boldsymbol{x} _2, \ldots, \boldsymbol{x} _p \in \mathbb{R} ^n\), where \(\boldsymbol{x} _1 = \boldsymbol{1}_n\) and \(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _{p}\) are linearly independent, we want to find a linear combination of \(\boldsymbol{x}\)’s to approximate \(\boldsymbol{y}\).

\[ \min \left\| \boldsymbol{y} - (\beta_1 \boldsymbol{x} _1 + \ldots + \beta_p \boldsymbol{x} _p) \right\| ^2 \]

The question is, what happen to the approximation performance if we add another candidate vector \(\boldsymbol{x} _{p+1} \in \mathbb{R} ^n\), where \(\boldsymbol{x} _{p+1} \notin \operatorname{span}(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _p)\)?

For any \(\boldsymbol{x} _{p+1}\), we can decompose it into three parts

\[ \boldsymbol{x} _{p+1} = (a_1 \boldsymbol{x} _1 + \ldots + a _p \boldsymbol{x} _p) + a_y \boldsymbol{y}+ a_w\boldsymbol{w} \]

where

  • \(\boldsymbol{w}\) is orthogonal to target \(\boldsymbol{y}\) and candidates \(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _p\)

  • \(a_y, a_w\) are not simultaneously zero to ensure \(\boldsymbol{x} _{p+1}\notin \operatorname{span}(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _p)\)

Now consider different values of \(a_y\) and \(a_w\)

  1. If \(a_y \ne 0\) and \(a_w=0\), then we should obtain a perfect fit with \(RSS=0\), \(\hat{\beta}_j = - \frac{a_j}{a_y}\) and \(\hat{\beta}_{p+1} = \frac{1}{a_y}\).

  2. If \(a_y = 0\) and \(a_w\ne 0\), then adding \(x_{p+1}\) has no improvement in approximating \(\boldsymbol{y}\), since the existing candidates already capture the variation in \(\boldsymbol{y}\) and \(\boldsymbol{w}\) is irrelevant/orthogonal/uncorrelated to \(\boldsymbol{y}\). We should obtain \(\hat{\beta}_{p+1}=0\) and the same \(RSS\). Note that this does NOT imply the new candidate \(\boldsymbol{x} _{p+1}\) and target \(\boldsymbol{y}\) are uncorrelated: \(\boldsymbol{x} _{p+1}^\top \boldsymbol{y} = 0\), as shown in the experiment below. We can only say there is a part in \(\boldsymbol{x}\) that is uncorrealted with \(\boldsymbol{y}\), or sometimes people interpret this as “\(X_{p+1}\) has no explanatory power in \(Y\) given other covariates.”

  3. if \(a_y \ne 0\) and \(a_w\ne 0\), this is the general case, where \(x_{p+1}\) contains some relevant information about \(\boldsymbol{y}\) (measured by \(a_y\)) in addition to other candidates’, and also some irrelevant information \(\boldsymbol{w}\) (measured by \(a_w\)). \(RSS\) decreases since we have an additional relevant candidate \(\boldsymbol{x} _{p+1}\), and we obtain \(\hat{\beta}_{p+1}\ne0\). Also note that this does NOT imply the new candidate \(\boldsymbol{x} _{p+1}\) and \(\boldsymbol{y}\) must be correlated: \(\boldsymbol{x} _{p+1} ^\top \boldsymbol{y} \ne 0\). See the example below.

To sum up, \(RSS\) either decreases or stays unchanged. It is unchanged iff \(\boldsymbol{x} _{p+1} = (a_1 \boldsymbol{x} _1 + \ldots + a _p \boldsymbol{x} _p) + a_w\boldsymbol{w}\), where \(a_w \ne 0\) and \(\boldsymbol{w}\) is orthogonal to \(\boldsymbol{y} , \boldsymbol{x}_1, \boldsymbol{x} _1, \ldots, \boldsymbol{x} _p\). To check if this condition holds, we can regress \(X_{p+1}\) over the existing \(p\) candidates. If \(\hat{\boldsymbol{u}}\ne \boldsymbol{0}\), then \(a_w a_y\ne 0\), further if \(\hat{\boldsymbol{u}}^\top \boldsymbol{y} =0\), then \(a_y=0, a_w\ne 0, \boldsymbol{w} = \hat{\boldsymbol{u}}\).

print("Experiment: \nDifferent values of a_y and a_w in x_3\n")
import numpy as np

y = np.vstack((np.array([1,1,1,1]).reshape(4,1), np.zeros((1,1))))
x1 = np.ones((5,1))
x2 = np.vstack((np.array([1,2,1]).reshape(3,1), np.zeros((2,1))))
w = np.vstack((np.array([1,-1,1,-1]).reshape(4,1), np.zeros((1,1))))

print(f"y: {y.T.flatten()}")
print(f"x1: {x1.T.flatten()}")
print(f"x2: {x2.T.flatten()}")
print(f"w: {w.T.flatten()}")
print(f"a1 = 0, a2 = 2")

def reg(ay=0, aw=0):

    if ay != 0 or aw != 0:
        print(f"\nay = {ay}, aw = {aw} " + "-"*50)
        x3 = 2*x2 + ay*y + aw*w
        X = np.hstack((x1, x2, x3))
    else:
        print("\nBase case " + "-"*50)
        X = np.hstack((x1, x2))
    XXinv = np.linalg.inv(np.dot(X.T, X))
    print("X^T y: ", np.dot(X.T, y).T.flatten())
    b = np.dot(XXinv, np.dot(X.T, y))
    print("coefficients: ", b.flatten())
    r = y - X.dot(b)
    print("RSS: ", r.T.dot(r).flatten())

reg()
reg(ay=0, aw=1)
reg(ay=1, aw=0)
reg(ay=1, aw=1)
Experiment: 
Different values of a_y and a_w in x_3

y: [1. 1. 1. 1. 0.]
x1: [1. 1. 1. 1. 1.]
x2: [1. 2. 1. 0. 0.]
w: [ 1. -1.  1. -1.  0.]
a1 = 0, a2 = 2

Base case --------------------------------------------------
X^T y:  [4. 4.]
coefficients:  [0.57142857 0.28571429]
RSS:  [0.57142857]

ay = 0, aw = 1 --------------------------------------------------
X^T y:  [4. 4. 8.]
coefficients:  [0.57142857 0.28571429 0.        ]
RSS:  [0.57142857]

ay = 1, aw = 0 --------------------------------------------------
X^T y:  [ 4.  4. 12.]
coefficients:  [-1.0658141e-14 -2.0000000e+00  1.0000000e+00]
RSS:  [5.01715536e-28]

ay = 1, aw = 1 --------------------------------------------------
X^T y:  [ 4.  4. 12.]
coefficients:  [0.5   0.    0.125]
RSS:  [0.5]
print("Example: \nAdding a variable that is uncorrelated with y, \nbut gives non-zero coefficient estimate.")

import numpy as np

y = np.array([[1,2,3]]).T
x0 = np.array([[1,1,1]])
x1 = np.array([[1,2,4]])

# reduced model
print("\nREDUCED MODEL " + "-"*50)
X = np.vstack((x0, x1)).T
XXinv = np.linalg.inv(np.dot(X.T, X))
b = np.dot(XXinv, np.dot(X.T, y))
print("coefficients: ", b.flatten())
r = y - X.dot(b)
print("RSS: ", r.T.dot(r).flatten())

# full model
print("\nFULL MODEl " + "-"*50)
x2 = np.array([[1,-2,1]])
print("x2 and y are uncorrelated, dot product is:", np.dot(x2, y).flatten())
X = np.vstack((x0, x1, x2)).T
XXinv = np.linalg.inv(np.dot(X.T, X))
b = np.dot(XXinv, np.dot(X.T, y))
print("coefficients: ", b.flatten())
r = y - X.dot(b)
X.dot(b)
print("RSS: ", r.T.dot(r).flatten())
Example: 
Adding a variable that is uncorrelated with y, 
but gives non-zero coefficient estimate.

REDUCED MODEL --------------------------------------------------
coefficients:  [0.5        0.64285714]
RSS:  [0.07142857]

FULL MODEl --------------------------------------------------
x2 and y are uncorrelated, dot product is: [0]
coefficients:  [ 0.44444444  0.66666667 -0.11111111]
RSS:  [6.27390939e-30]

\(R^2\) and Adjusted \(R^2\)

\(R\)-squared

Assuming the decomposition identity of \(TSS\) holds, we can define \(R\)-squared.

Definition (\(R\)-squared)

\(R\)-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

\[ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} \]

Properties

  1. \(R\)-squared can never decrease when an additional explanatory variable is added to the model.

    As long as \(Cov(Y, X_j) \ne 0\), then \(X_j\) has some explanatory power to \(Y\), and thus \(RSS\) decreases, See the section of \(RSS\) for details. As a result, \(R\)-squared is not a good measure for model selection, which can cause overfitting.

  2. \(R\)-squared equals the squared correlation coefficient between the actual value of the response and the fitted value \(\operatorname{Corr}\left( Y, \hat{Y} \right)^2\).

    In particular, in simple linear regression, \(R^2 = \rho_{X,Y}^2\).

Adjusted \(R\)-squared

Due to the non-decrease property of \(R\)-squared, we define adjusted \(R\)-squared which is a better measure of goodness of fitting.

Definition

Adjusted \(R\)-squared, denoted by \(\bar{R}^2\), is defined as

\[ \bar{R}^2 = 1-\frac{RSS / (n-p)}{ TSS / (n-1)} \]

Properties

  • \(\bar{R}^2\) can increase or decrease. When a new variable is included, \(RSS\) decreases, but \((n-p)\) also decreases.

    • Relation to \(R\)-squared is

\[ \bar{R}^2 = 1-\frac{n-1}{ n-p}(1 - R^2) < R^2 \]
  • Relation to estimated variance of random error and variance of response

    \[ \bar{R}^2 = 1-\frac{\hat{\sigma}^2}{\operatorname{Var}\left( y \right)} \]
  • Can be negative when

    \[ R^2 < \frac{p-1}{n-p} \]

    If \(p > \frac{n+1}{2}\) then the above inequality always hold, and adjusted \(R\)-squared is always negative.

\(t\)-test

We can use \(t\)-test to conduct a hypothesis testing on \(\boldsymbol{\beta}\), which has a general form

\[\begin{split}\begin{aligned} H_0 &: \boldsymbol{v} ^\top \boldsymbol{\beta}_{\text{null}} = c \\ H_1 &: \boldsymbol{v} ^\top \boldsymbol{\beta}_{\text{null}} \ne c (\text{two-sided} )\\ \end{aligned}\end{split}\]

Usually \(c=0\).

  • If \(c=0, \boldsymbol{v} = \boldsymbol{e} _j\) then this is equivalent to test \(\beta_j=0\), i.e. the variable \(X_i\) has no effect on \(Y\) given all other variabels.

  • If \(c=0, v_i=1, v_j=-1\) and \(v_k=0, k\ne i, j\) then this is equivalent to test \(\beta_i = \beta_j\), i.e. the two variables \(X_i\) and \(X_j\) has the same effect on \(Y\) given all other variables.

The test statistic is

\[ \frac{\boldsymbol{v} ^\top \hat{\boldsymbol{\beta}} - \boldsymbol{v} ^\top \boldsymbol{\beta}_{\text{null}}}{\hat{\sigma}\sqrt{\boldsymbol{v} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{v} }} \sim t_{n-p} \]

In particular, when \(p=2\), to test \(\beta_1 = c\), we use

\[ \frac{\hat{\beta}_1 - c}{\hat{\sigma}/ \sqrt{\operatorname{Var}\left( X_1 \right)}} \sim t_{n-2} \]

Following this analysis, we can find the \((1-\alpha)\%\) confidence interval for a scalar \(\boldsymbol{v} ^\top \boldsymbol{\beta}\) as

\[ \boldsymbol{v} ^\top \hat{\boldsymbol{\beta}} \pm t_{n-p}^{(1-\alpha/2)}\cdot \hat{\sigma} \sqrt{\boldsymbol{v} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{v} } \]

In particular,

  • If \(\boldsymbol{v} = \boldsymbol{e}_j\), then this is the confidence interval for a coefficient \(beta _j\).

  • If \(\boldsymbol{v} = \boldsymbol{x}_i\) where \(\boldsymbol{x}_i\) is in the data set, then this is the confidence interval for in-sample fitting of \(y_i\). We are making in-sample fitting at the mean value \(\operatorname{E}\left( \boldsymbol{y} _i \right) = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\). If \(\boldsymbol{x}_i\) is not in the design matrix, then we are doing out-of-sample prediction, and a prediction interval should be used instead.

\(F\)-test

Compare Two Nested Models

\(F\)-test can be used to compare two nested models

\[\begin{split}\begin{aligned} \text{Full model: } Y &\sim \left\{ X_j, j=1, \ldots, p-k-1, p-k, \ldots, p-1 \right\} \\ \text{Reduced model: } Y &\sim \left\{ X_j, j=1, \ldots, p-k-1 \right\} \end{aligned}\end{split}\]

The test statistic is

\[ \frac{(RSS_{\text{reduced} } - RSS_{\text{full} })/k}{RSS_{\text{full}}/(n-p)} \sim F_{k, n-p} \]

which can also be computed by \(R^2\) since \(TSS\) are the same for the two models

\[ F = \frac{(R^2 _{\text{full}} - R^2 _{\text{reduced}})/k}{(1 - R^2 _{\text{full}})/(n-p)} \]

In particular,

  • When \(k=p-1\), we are comparing a full model vs. intercept only, i.e.,

    \[ \beta_1 = \ldots = \beta_{p-1} = 0 \]

    In this case,

    \[ RSS_{\text{reduced}} = \left\| \boldsymbol{y} - \bar{y} \boldsymbol{1} _n \right\| ^2 = TSS \]

    and

    \[ F = \frac{(TSS - RSS_{\text{full}})/(p-1)}{RSS_{\text{full}}/(n-p)} = \frac{R^2 _{\text{full}}/k}{(1 - R^2 _{\text{full}})/(n-p)} \]
  • When \(k=1\), we are testing \(\beta_{p-1} = 0\). In this case, the \(F\)-test is equivalent to the \(t\)-test. The two test statistics have the relation \(F_{1, n-p}=t^2_{n-p}\). Note that the alternative hypothesis in both tests are two-sided, and \(t\)-test is two-sided while \(F\)-test is one-sided.

Caveats

  • A \(F\)-test on \(\beta_1=\beta_2=0\) is difference from two individual univariate \(t\)-tests \(\beta_1=0, \beta_2=0\). A group of \(t\)-tests may be misleading if the regressors are highly correlated.

  • When there are multiple covariates, we can conduct sequential \(F\)-test by using anova() function in R. The \(F\)-test in each row compares the existing model and the model plus the additional covariate. Hence, the order of input covariates matters. Exchanging order of terms usually changes sum of squares and \(F\)-test \(p\)-value.

Compare Two Groups of Data

If the data set contains data from two groups, then the regression line can be different. For instance,

  • In time series analysis, there may be a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war).

  • In program evaluation, the explanatory variables may have different impacts on different subgroups of the population.

Fig. 61 Illustration of two groups of data [Wikipedia]

Testing whether a regression function is different for one group \((i=1,\ldots,m)\) versus another \((i=m+1,\ldots,n)\), we can use \(F\)-test. The full model is

\[\begin{split}\begin{aligned} y_{i} &= \left( \beta_{0}+\beta_{1} x_{1 i}+\beta_{2} x_{2 i}+\ldots+\beta_{k} x_{k i} \right) \\ &\ + \left( \gamma_{0}+\gamma_{1} (d_i x_{1 i})+\gamma_{2} (d_i x_{2 i})+\ldots+\gamma_{k} (d_i x_{k i}) \right) \\ &\ +\varepsilon_{i} \end{aligned}\end{split}\]

where dummy variable \(d_i = 1\) if \(i=m+1, \ldots, n\) and \(0\) otherwise.

The null and alternative hypotheses are

\[\begin{split}\begin{aligned} H_{0}:&\ \gamma_{0}=0, \gamma_{1}=0, \ldots, \gamma_{k}=0 \\ H_{1}:&\ \text{otherwise} \end{aligned}\end{split}\]

The \(F\)-test statistic is

\[ F=\frac{\left(RSS_{\text{reduced}}-RSS_{\text{full}}\right) /(k+1)}{RSS_{\text{full}} / (n-2(k+1))} \]

Equivalently, there is a method called Chow test. It uses the fact that

\[ RSS_{\text{full}} = RSS_{1} + RSS_{2} \]

where \(RSS_1\) and \(RSS_2\) are the RSS obtained when we run the following regression model on two groups of data respectively,

\[ y_{i} = \beta_{0}+\beta_{1} x_{1 i}+\beta_{2} x_{2 i}+\ldots+\beta_{k} x_{k i} +\varepsilon_{i} \]

Hence, there is no need to create dummy variables \(d_i\). There are only three steps:

  1. Run the regression with group 1’s data to obtain \(RSS_1\)

  2. Run the regression with group 2’s data to obtain \(RSS_2\)

  3. Run the regression with all data to obtain \(RSS_{\text{reduced}}\)

And then use the \(F\)-test to test the hypothesis.

Confidence Region for \(\boldsymbol{\beta}\)

If we want to draw conclusions to multiple coefficients \(\beta_1, \beta_2, \ldots\) simultaneously, we need a confidence region, and consider the multiple testing issue.

To find a \((1-\alpha)\%\) confidence region for \(\boldsymbol{\beta}\), one attemp is to use a cuboid, whose \(j\)-th side length equals to the \((1-\alpha/p)-%\) confidence interval for \(\beta_j\). Namely, the confidence region is

\[ \left[ (1-\alpha/p) \text{ C.I. for } \beta_0 \right] \times \left[ (1-\alpha/p) \text{ C.I. for } \beta_1 \right] \times \ldots \times \left[ (1-\alpha/p) \text{ C.I. for } \beta_{p-1} \right] \]

In this way, we ensure the overall confidence of the confidence region is at least \((1-\alpha)\%\).

\[ \operatorname{P}\left( \text{every $\beta$ is in its C.I.} \right) \ge 1-\alpha \]

A more natural approach is using an ellipsoid. Recall that

\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta} , \sigma^2 (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} ) \]

Hence a pivot quantity for \(\boldsymbol{\beta}\) can be constructed as follows,

\[\begin{split}\begin{aligned} &&\frac{(\boldsymbol{X} ^\top \boldsymbol{X} )^{1/2}(\hat{\boldsymbol{\beta}} -\boldsymbol{\beta} )}{\sigma^2 } &\sim N(\boldsymbol{0} , \boldsymbol{I} _p) \\ &\Rightarrow& \ \frac{1}{\sigma^2} \left\| (\boldsymbol{X} ^\top \boldsymbol{X} )^{1/2}(\hat{\boldsymbol{\beta}} -\boldsymbol{\beta} ) \right\|^2 &\sim \chi ^2 _p \\ &\Rightarrow& \ \frac{\frac{1}{\sigma^2} \left\| (\boldsymbol{X} ^\top \boldsymbol{X} )^{1/2}(\hat{\boldsymbol{\beta}} -\boldsymbol{\beta} ) \right\|^2/p}{\frac{(n-p)\hat{\sigma}^2}{\sigma^2 }/(n-p)} &\sim F_{p, n-p}\\ &\Rightarrow& \ \frac{ (\hat{\boldsymbol{\beta}} -\boldsymbol{\beta} )^\top (\boldsymbol{X} ^\top \boldsymbol{X} )(\hat{\boldsymbol{\beta}} -\boldsymbol{\beta} )}{p \hat{\sigma}^2} &\sim F_{p, n-p}\\ \end{aligned}\end{split}\]

Therefore, we can obtain an \((1-\alpha)\%\) confidence region for \(\boldsymbol{\beta}\) from this distribution

\[ \boldsymbol{\beta} \in \left\{ \boldsymbol{v} \in \mathbb{R} ^p: \frac{ (\hat{\boldsymbol{\beta}} -\boldsymbol{v} )^\top (\boldsymbol{X} ^\top \boldsymbol{X} )(\hat{\boldsymbol{\beta}} -\boldsymbol{v} )}{p \hat{\sigma}^2} \le F_{p, n-p}^{(1-\alpha)} \right\} \]

which is an ellipsoid centered at \(\hat{\boldsymbol{\beta}}\), scaled by \(\frac{1}{p \hat{\sigma}}\), rotated by \((\boldsymbol{X} ^\top \boldsymbol{X})\).

In general, for matrix \(\boldsymbol{A} \in \mathbb{R} ^{p \times k}, \operatorname{rank}\left( \boldsymbol{A} \right) = k\), the confidence region for \(\boldsymbol{A} ^\top \boldsymbol{\beta}\) can be found in a similar way

\[\begin{split}\begin{aligned} \boldsymbol{A} ^\top \hat{\boldsymbol{\beta}} &\sim N(\boldsymbol{A} ^\top \boldsymbol{\beta} , \sigma^2 \boldsymbol{A} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{A} ) \\ \Rightarrow \quad\ldots &\sim F_{k, n-p} \\ \end{aligned}\end{split}\]

Prediction Interval for \(y_{new}\)

For a new \(\boldsymbol{x}\), the new response is

\[ y _{new} = \boldsymbol{x} ^\top \boldsymbol{\beta} + \boldsymbol{\varepsilon} _{new} \]

where \(\boldsymbol{\varepsilon} _{new} \perp\!\!\!\perp \hat{\boldsymbol{\beta}} , \hat{\sigma}\) since the RHS are from training set.

The prediction is

\[ \hat{y} _{new} = \boldsymbol{x} ^\top \hat{\boldsymbol{\beta}} \]

Thus, the prediction error is

\[\begin{split}\begin{aligned} y _{new} - \hat{y}_{new} &= \boldsymbol{\varepsilon} _{new} + \boldsymbol{x} ^\top (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}} )\\ &\sim N \left( \boldsymbol{0} , \sigma^2 (1 + \boldsymbol{x} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{x} ) \right) \\ \end{aligned}\end{split}\]

Hence, the \((1-\alpha)\%\) confidence prediction interval for a new response value \(\boldsymbol{y} _{new}\) at an out-of-sample \(\boldsymbol{x}\) is

\[ \boldsymbol{x} ^\top \hat{\boldsymbol{\beta}} \pm t_{n-p}^{(1-\alpha/2)}\cdot \hat{\sigma} \sqrt{1 + \boldsymbol{x} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{x} } \]

There is an additional “\(1\)” in the squared root, compared to the confidence interval for an in-sample fitting,

\[ \boldsymbol{x} ^\top \hat{\boldsymbol{\beta}} \pm t_{n-p}^{(1-\alpha/2)}\cdot \hat{\sigma} \sqrt{ \boldsymbol{x} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{x} } \]