Linear Models - Inference¶
We first describe the properties of OLS estimator \(\hat{\boldsymbol{\beta}}\) and the corresponding residuals \(\hat{\boldsymbol{\varepsilon} }\). Then we introduce sum of squares, \(R\)-squared, hypothesis testing and confidence intervals. All these methods assume normality of the error terms \(\varepsilon_i \overset{\text{iid}}{\sim} N(0, \sigma^2)\) unless otherwise specified.
Coefficients \(\boldsymbol{\beta}\)¶
Unbiasedness¶
The OLS estimators are unbiased since
when \(p=2\),
To prove unbiasedness, using the fact that for any constant \(c\),
Then, the numerator becomes
Hence
and
Variance¶
The variance (covariance matrix) of the coefficients is
What is the \((j,j)\)-th entry \(\operatorname{Var}\left( \hat{\beta}_j \right)\)?
More specifically, for the \(j\)-th coefficient estimator \(\hat{\beta}_j\), its variance is,
where \(R_j^2\), \(RSS_j\), \(TSS_j\), and \(\hat{x}_{ij}\) are the corresponding representatives when we regress \(X_j\) over all other explanatory variables.
Note that the value of \(R^2\) when we regressing \(X_1\) to an constant intercept is 0. So we have the particular result for simple linear regression.
More generally, for the heteroscedasticity case, let \(\hat{u}_i = \hat{x}_{ij} - x_{ij}\)
For derivation, see partialling out.
In particular, if homoskedasticity is assumed \(\operatorname{Var}\left( \varepsilon_i \right) = \sigma^2\), then it reduces to the formula above. In general, if heteroskedasticity exists but zero mean \(\operatorname{E}\left( \varepsilon_i \right)=0\) is assumed, one can estimate \(\operatorname{Var}\left( \varepsilon_i \right)\) by \(\hat{\varepsilon}_i^2\), and obtain standard error.
When \(p=2\), the inverse \((\boldsymbol{X} ^\top \boldsymbol{X} )^\top\) is
the variance of \(\hat{\beta}_1\) is
We conclude that
The larger the error variance, \(\sigma^2\), the larger the variance of the coefficient estimates.
The larger the variability in the \(x_i\), the smaller the variance.
A larger sample size should decrease the variance.
In multiple regression, reduce the relation between \(X_j\) and other covariates (e.g. by orthogonal design) can decreases \(R^2_{j}\), and hence decrease the variance.
A problem is that the error \(\sigma^2\) variance is unknown. In practice, we can estimate \(\sigma^2\) by its unbiased estimator \(\hat{\sigma}^2=\frac{\sum_i (x_i - \bar{x})}{n-2}\) (to be shown [link]), and substitute it into \(\operatorname{Var}\left( \hat{\beta}_1 \right)\). Since the error variance \(\hat{\sigma}^2\) is estimated, the slope variance \(\operatorname{Var}\left( \hat{\beta}_1 \right)\) is estimated too, and hence the square root is called standard error of \(\hat{\beta}\), instead of standard deviation.
Efficiency (BLUE)¶
- Theorem (Gauss–Markov)
The ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. In abbreviation, the OLS estimator is BLUE: Best (lowest variance) Linear Unbiased Estimator.
Proof
Let \(\tilde{\boldsymbol{\beta}} = \boldsymbol{C} \boldsymbol{y}\) be another linear estimator of \(\boldsymbol{\beta}\). We can write \(\boldsymbol{C} = \left( \boldsymbol{X} ^\top \boldsymbol{X} \right)^{-1} \boldsymbol{X} ^\top + \boldsymbol{D}\) where \(\boldsymbol{D} \ne \boldsymbol{0}\). Then
Hence, \(\tilde{\boldsymbol{\beta}}\) is unbiased iff \(\boldsymbol{D} \boldsymbol{X} = 0\).
The variance is
Since \(\sigma^2 \boldsymbol{D} \boldsymbol{D} ^\top \in \mathrm{PSD}\), we have
The equality holds iff \(\boldsymbol{D} ^\top \boldsymbol{D} = 0\), which implies that \(\operatorname{tr}\left( \boldsymbol{D} \boldsymbol{D} ^\top \right) = 0\), then \(\left\Vert \boldsymbol{D} \right\Vert _F^2 = 0\), then \(\boldsymbol{D} = 0\), i.e. \(\tilde{\boldsymbol{\beta} } = \hat{\boldsymbol{\beta} }\). Therefore, BLUE is unique.
Moreover,
If error term is normally distributed, then OLS is most efficient among all consistent estimators (not just linear ones).
When the distribution of error term is non-normal, other estimators may have lower variance than OLS such as least absolute deviation (median regression).
Consistency¶
The OLS and consistent,
since
Large Sample Distribution¶
If we assume \(\varepsilon_i \overset{\text{iid}}{\sim} N(0, \sigma^2)\), or \(\boldsymbol{\varepsilon} \sim N_n(\boldsymbol{0} , \boldsymbol{I} _n)\), then
Hence, the distribution of the coefficients estimator is
The assumption may fail when the response variable \(y\) is
right skewed, e.g. wages, savings
non-negative, e.g. counts, arrests
When the normality assumption of the error term fails, the OLS estimator is asymptotically normal,
Therefore, in a large sample, even if the normality assumption fails, we can still do hypothesis testing which assumes normality.
Sketch of derivation
Note that
By CLT, the second term
By Slusky’s Theorem, the product
or equivalently,
Residuals and Error Variance¶
Residuals \(\hat{\boldsymbol{\varepsilon}}\)¶
- Definition
The residual is defined as the difference between the true response value \(y\) and our fitted response value \(\hat{y}\).
\[\hat\varepsilon_i = y_i - \hat{y}_i = y_i - \boldsymbol{x}_i ^\top \hat{\boldsymbol{\beta}}\]It is an estimate of the error term \(\varepsilon_i\).
Properties
The sum of the residual is zero: \(\sum_i \hat{\varepsilon}_i = 0\)
The sum of the product of residual and any covariate is zero, or they are “uncorrelated”: \(\sum_i x_{ij} \hat{\varepsilon}_i = 0\) for all \(j\).
The sum of squared residuals: \(\left\| \boldsymbol{\hat{\varepsilon}} \right\|^2 = \left\| \boldsymbol{y} - \boldsymbol{H} \boldsymbol{y} \right\|^2 = \boldsymbol{y} ^\top (\boldsymbol{I} - \boldsymbol{H} ) \boldsymbol{y}\)
Proof
Recall the normal equation
We obtain
Since the first column of \(\boldsymbol{X}\) is \(\boldsymbol{1}\) , we have
For other columns \(\boldsymbol{x}_j\) in \(\boldsymbol{X}\), we have
The 3rd equality holds since \(\boldsymbol{I} - \boldsymbol{H}\) is a projection matrix if \(\boldsymbol{H}\) is a projection matrix, i.e.,
Estimation of Error Variance¶
In the estimation section we mentioned the estimator for the error variance \(\sigma^2\) is
This is because
and we used the method of moment estimator. The derivation of the above expectation is a little involved.
Derivation
Let
\(\boldsymbol{U} = [\boldsymbol{U} _ \boldsymbol{X} , \boldsymbol{U} _\bot]\) be an orthogonal basis of \(\mathbb{R} ^{n\times n}\) where
\(\boldsymbol{U} _ \boldsymbol{X} = [\boldsymbol{u} _1, \ldots, \boldsymbol{u} _p]\) is an orthogonal basis of the column space (image) of \(\boldsymbol{X}\), denoted \(\operatorname{col}(\boldsymbol{X} )\)
\(\boldsymbol{U} _ \bot = [\boldsymbol{u} _{p+1}, \ldots, \boldsymbol{u} _n]\) is an orthogonal basis of the orthogonal complement of the column space (kernel) of \(\boldsymbol{X}\), , denoted \(\operatorname{col}(\boldsymbol{X} ) ^\bot\).
Recall
which is
Note that assuming \(\boldsymbol{\varepsilon} \sim N(\boldsymbol{0} , \sigma^2 \boldsymbol{I} )\), we have
and hence the sum of squared normal variables follows
Thus,
The first moment is
or equivalently
Therefore, the method of moment estimator for \(\sigma^2\) is
which is unbiased.
Can we find \(\operatorname{Var}\left( \hat{\sigma}^2 \right)\) like we did for \(\operatorname{Var}\left( \hat{\boldsymbol{\beta}} \right)\)? No, unless we assume higher order moments of \(\varepsilon_i\).
Independence of \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\)¶
To prove the independence between the coefficients estimator \(\hat{\boldsymbol{\beta} }\) and the error variance estiamtor \(\hat{\sigma}^2\), we need the Lemma below.
- Lemma
Suppose a random vector \(\boldsymbol{y}\) follows multivariate normal distribution \(\boldsymbol{y} \sim N_m(\boldsymbol{\mu} , \sigma^2 I_m)\) and \(S, T\) are orthogonal subspaces of \(\mathbb{R} ^m\), then the two projected random vectors are independent
Note that \(\hat{\boldsymbol{\beta}}\) is a function of \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X}) } (\boldsymbol{y})\) since
and note that \(\hat{\sigma}^2\) is a function of \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X} )^\bot } (\boldsymbol{y})\) since
Since \(\operatorname{im}(\boldsymbol{X})\) and \(\operatorname{im}(\boldsymbol{X})^\bot\) are orthogonal subspaces of \(\mathbb{R} ^p\), by the lemma, \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X}) } (\boldsymbol{y})\) and \(\boldsymbol{P} _{\operatorname{im}(\boldsymbol{X})^\bot } (\boldsymbol{y})\) are independent. Hence, \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are independent.
Proof of Lemma
Let \(\boldsymbol{z} \sim N(\boldsymbol{0} , \boldsymbol{I} _m)\) be a standard multivariate normal random vector, and \(\boldsymbol{U} = [\boldsymbol{U} _S, \boldsymbol{U} _T]\) be orthogonal basis of \(\mathbb{R} ^n\). Then
Let \(\boldsymbol{y} = \boldsymbol{\mu} + \sigma \boldsymbol{z}\), then
\(\square\)
Sum of Squares¶
We can think of each observation as being made up of an explained part, and an unexplained part.
Total sum of squares: \(TSS = \sum\left(y_{i}-\bar{y}\right)^{2}\)
Explained sum of squares: \(ESS = \sum\left(\hat{y}_{i}-\bar{y}\right)^{2}\)
Residual sum of squares: \(RSS = \sum (y_i - \hat{y}_i)^2\)
Decomposition of TSS¶
We have the decomposition identity
where use the fact that \(\sum_i \varepsilon_i = 0\) and \(\sum_i \varepsilon_i x_i = 0\) shown [above].
Warning
Some courses use the letters \(R\) and \(E\) to denote the opposite quantity in statistics courses.
Sum of squares due to regression: \(SSR = \sum\left(\hat{y}_{i}-\bar{y}\right)^{2}\)
Sum of squared errors: \(SSE = \sum (y_i - \hat{y}_i)^2\)
Decomposition due to orthogonality
In vector form, the decomposition identity is equivalent to
From linear algebra’s perspective, it holds because the LHS vector \(\boldsymbol{y} - \bar{y}\boldsymbol{1} _n\) is the the sum of the two RHS vectors, and the two vectors are orthogonal
More specifically, they are orthogonal because
since \(\boldsymbol{1} _n \in \operatorname{col}(\boldsymbol{X})\), if an intercept term is included in the model.
Non-increasing RSS¶
Given a data set, when we add an new explanatory variable into a regression model, \(RSS\) is non-increasing.
This can be better understood if we view the linear regression problem as an approximation problem: Given a target vector \(\boldsymbol{y} \in \mathbb{R} ^n\) and candidate vectors \(\boldsymbol{x} _1, \boldsymbol{x} _2, \ldots, \boldsymbol{x} _p \in \mathbb{R} ^n\), where \(\boldsymbol{x} _1 = \boldsymbol{1}_n\) and \(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _{p}\) are linearly independent, we want to find a linear combination of \(\boldsymbol{x}\)’s to approximate \(\boldsymbol{y}\).
The question is, what happen to the approximation performance if we add another candidate vector \(\boldsymbol{x} _{p+1} \in \mathbb{R} ^n\), where \(\boldsymbol{x} _{p+1} \notin \operatorname{span}(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _p)\)?
For any \(\boldsymbol{x} _{p+1}\), we can decompose it into three parts
where
\(\boldsymbol{w}\) is orthogonal to target \(\boldsymbol{y}\) and candidates \(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _p\)
\(a_y, a_w\) are not simultaneously zero to ensure \(\boldsymbol{x} _{p+1}\notin \operatorname{span}(\boldsymbol{x} _1, \ldots, \boldsymbol{x} _p)\)
Now consider different values of \(a_y\) and \(a_w\)
If \(a_y \ne 0\) and \(a_w=0\), then we should obtain a perfect fit with \(RSS=0\), \(\hat{\beta}_j = - \frac{a_j}{a_y}\) and \(\hat{\beta}_{p+1} = \frac{1}{a_y}\).
If \(a_y = 0\) and \(a_w\ne 0\), then adding \(x_{p+1}\) has no improvement in approximating \(\boldsymbol{y}\), since the existing candidates already capture the variation in \(\boldsymbol{y}\) and \(\boldsymbol{w}\) is irrelevant/orthogonal/uncorrelated to \(\boldsymbol{y}\). We should obtain \(\hat{\beta}_{p+1}=0\) and the same \(RSS\). Note that this does NOT imply the new candidate \(\boldsymbol{x} _{p+1}\) and target \(\boldsymbol{y}\) are uncorrelated: \(\boldsymbol{x} _{p+1}^\top \boldsymbol{y} = 0\), as shown in the experiment below. We can only say there is a part in \(\boldsymbol{x}\) that is uncorrealted with \(\boldsymbol{y}\), or sometimes people interpret this as “\(X_{p+1}\) has no explanatory power in \(Y\) given other covariates.”
if \(a_y \ne 0\) and \(a_w\ne 0\), this is the general case, where \(x_{p+1}\) contains some relevant information about \(\boldsymbol{y}\) (measured by \(a_y\)) in addition to other candidates’, and also some irrelevant information \(\boldsymbol{w}\) (measured by \(a_w\)). \(RSS\) decreases since we have an additional relevant candidate \(\boldsymbol{x} _{p+1}\), and we obtain \(\hat{\beta}_{p+1}\ne0\). Also note that this does NOT imply the new candidate \(\boldsymbol{x} _{p+1}\) and \(\boldsymbol{y}\) must be correlated: \(\boldsymbol{x} _{p+1} ^\top \boldsymbol{y} \ne 0\). See the example below.
To sum up, \(RSS\) either decreases or stays unchanged. It is unchanged iff \(\boldsymbol{x} _{p+1} = (a_1 \boldsymbol{x} _1 + \ldots + a _p \boldsymbol{x} _p) + a_w\boldsymbol{w}\), where \(a_w \ne 0\) and \(\boldsymbol{w}\) is orthogonal to \(\boldsymbol{y} , \boldsymbol{x}_1, \boldsymbol{x} _1, \ldots, \boldsymbol{x} _p\). To check if this condition holds, we can regress \(X_{p+1}\) over the existing \(p\) candidates. If \(\hat{\boldsymbol{u}}\ne \boldsymbol{0}\), then \(a_w a_y\ne 0\), further if \(\hat{\boldsymbol{u}}^\top \boldsymbol{y} =0\), then \(a_y=0, a_w\ne 0, \boldsymbol{w} = \hat{\boldsymbol{u}}\).
print("Experiment: \nDifferent values of a_y and a_w in x_3\n")
import numpy as np
y = np.vstack((np.array([1,1,1,1]).reshape(4,1), np.zeros((1,1))))
x1 = np.ones((5,1))
x2 = np.vstack((np.array([1,2,1]).reshape(3,1), np.zeros((2,1))))
w = np.vstack((np.array([1,-1,1,-1]).reshape(4,1), np.zeros((1,1))))
print(f"y: {y.T.flatten()}")
print(f"x1: {x1.T.flatten()}")
print(f"x2: {x2.T.flatten()}")
print(f"w: {w.T.flatten()}")
print(f"a1 = 0, a2 = 2")
def reg(ay=0, aw=0):
if ay != 0 or aw != 0:
print(f"\nay = {ay}, aw = {aw} " + "-"*50)
x3 = 2*x2 + ay*y + aw*w
X = np.hstack((x1, x2, x3))
else:
print("\nBase case " + "-"*50)
X = np.hstack((x1, x2))
XXinv = np.linalg.inv(np.dot(X.T, X))
print("X^T y: ", np.dot(X.T, y).T.flatten())
b = np.dot(XXinv, np.dot(X.T, y))
print("coefficients: ", b.flatten())
r = y - X.dot(b)
print("RSS: ", r.T.dot(r).flatten())
reg()
reg(ay=0, aw=1)
reg(ay=1, aw=0)
reg(ay=1, aw=1)
Experiment:
Different values of a_y and a_w in x_3
y: [1. 1. 1. 1. 0.]
x1: [1. 1. 1. 1. 1.]
x2: [1. 2. 1. 0. 0.]
w: [ 1. -1. 1. -1. 0.]
a1 = 0, a2 = 2
Base case --------------------------------------------------
X^T y: [4. 4.]
coefficients: [0.57142857 0.28571429]
RSS: [0.57142857]
ay = 0, aw = 1 --------------------------------------------------
X^T y: [4. 4. 8.]
coefficients: [0.57142857 0.28571429 0. ]
RSS: [0.57142857]
ay = 1, aw = 0 --------------------------------------------------
X^T y: [ 4. 4. 12.]
coefficients: [-1.0658141e-14 -2.0000000e+00 1.0000000e+00]
RSS: [5.01715536e-28]
ay = 1, aw = 1 --------------------------------------------------
X^T y: [ 4. 4. 12.]
coefficients: [0.5 0. 0.125]
RSS: [0.5]
print("Example: \nAdding a variable that is uncorrelated with y, \nbut gives non-zero coefficient estimate.")
import numpy as np
y = np.array([[1,2,3]]).T
x0 = np.array([[1,1,1]])
x1 = np.array([[1,2,4]])
# reduced model
print("\nREDUCED MODEL " + "-"*50)
X = np.vstack((x0, x1)).T
XXinv = np.linalg.inv(np.dot(X.T, X))
b = np.dot(XXinv, np.dot(X.T, y))
print("coefficients: ", b.flatten())
r = y - X.dot(b)
print("RSS: ", r.T.dot(r).flatten())
# full model
print("\nFULL MODEl " + "-"*50)
x2 = np.array([[1,-2,1]])
print("x2 and y are uncorrelated, dot product is:", np.dot(x2, y).flatten())
X = np.vstack((x0, x1, x2)).T
XXinv = np.linalg.inv(np.dot(X.T, X))
b = np.dot(XXinv, np.dot(X.T, y))
print("coefficients: ", b.flatten())
r = y - X.dot(b)
X.dot(b)
print("RSS: ", r.T.dot(r).flatten())
Example:
Adding a variable that is uncorrelated with y,
but gives non-zero coefficient estimate.
REDUCED MODEL --------------------------------------------------
coefficients: [0.5 0.64285714]
RSS: [0.07142857]
FULL MODEl --------------------------------------------------
x2 and y are uncorrelated, dot product is: [0]
coefficients: [ 0.44444444 0.66666667 -0.11111111]
RSS: [6.27390939e-30]
\(R^2\) and Adjusted \(R^2\)¶
\(R\)-squared¶
Assuming the decomposition identity of \(TSS\) holds, we can define \(R\)-squared.
- Definition (\(R\)-squared)
\(R\)-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
Properties
\(R\)-squared can never decrease when an additional explanatory variable is added to the model.
As long as \(Cov(Y, X_j) \ne 0\), then \(X_j\) has some explanatory power to \(Y\), and thus \(RSS\) decreases, See the section of \(RSS\) for details. As a result, \(R\)-squared is not a good measure for model selection, which can cause overfitting.
\(R\)-squared equals the squared correlation coefficient between the actual value of the response and the fitted value \(\operatorname{Corr}\left( Y, \hat{Y} \right)^2\).
In particular, in simple linear regression, \(R^2 = \rho_{X,Y}^2\).
Proof
By the definition of correlation,
\[\begin{split}\begin{align} \operatorname{Corr}\left( y, \hat{y} \right)^2 &= \frac{\operatorname{Cov}\left( y, \hat{y} \right)^2}{\operatorname{Var}\left( y \right)\operatorname{Var}\left( \hat{y} \right)} \\ &= \frac{\operatorname{Cov}\left( \hat{y} + \hat{\varepsilon}, \hat{y} \right)^2}{\operatorname{Var}\left( y \right)\operatorname{Var}\left( \hat{y} \right)} \\ &= \frac{\operatorname{Cov}\left( \hat{y} , \hat{y} \right)^2}{\operatorname{Var}\left( y \right)\operatorname{Var}\left( \hat{y} \right)} \\ &= \frac{\operatorname{Var}\left( \hat{y} \right)^2}{\operatorname{Var}\left( y \right)\operatorname{Var}\left( \hat{y} \right)} \\ &= \frac{\operatorname{Var}\left( \hat{y} \right)}{\operatorname{Var}\left( y \right)} \\ &= \frac{SSR}{SST} \\ &= R^2 \\ \end{align}\end{split}\]The third equality holds since
\[ \operatorname{Cov}\left( \hat{\varepsilon}, \hat{y} \right) = \operatorname{Cov}\left( \hat{\varepsilon}, \sum_j x_j \hat{\beta}_j \right) = \sum_j \hat{\beta}_j \operatorname{Cov}\left( \hat{\varepsilon}, x_j \right) = 0 \]When \(p=2\), since \(\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x\), we have
\[ R^2 = \operatorname{Corr}\left(y, \hat{y} \right)^2 = \operatorname{Corr}\left(y, x \right)^2 \]
What is \(R^2\) if there is no intercept?
When there is no intercept, then \(\bar{y} \boldsymbol{1} _n \notin \operatorname{col}(X)\) and hence \(\hat{\boldsymbol{y} } - \bar{y} \boldsymbol{1} _n \notin \operatorname{col}(X)\). The decomposition identity may not hold. Thus, the value of \(R\)-squared may no longer be in \([0,1]\), and its interpretation is no longer valid. What actually happen to the value \(R\)-squared depends on whether we define it using \(TSS\) with \(RSS\) or \(ESS\).
If we define \(R^2 = \frac{ESS}{TSS}\), then when
we will have \(ESS > TSS\), i.e., \(R^2 > 1\).
On the other hand, if we define \(R^2 = 1 - \frac{RSS}{TSS}\), then when
we will have \(RSS > TSS\), i.e. \(R^2 < 0\).
Adjusted \(R\)-squared¶
Due to the non-decrease property of \(R\)-squared, we define adjusted \(R\)-squared which is a better measure of goodness of fitting.
- Definition
Adjusted \(R\)-squared, denoted by \(\bar{R}^2\), is defined as
Properties
\(\bar{R}^2\) can increase or decrease. When a new variable is included, \(RSS\) decreases, but \((n-p)\) also decreases.
Relation to \(R\)-squared is
Relation to estimated variance of random error and variance of response
\[ \bar{R}^2 = 1-\frac{\hat{\sigma}^2}{\operatorname{Var}\left( y \right)} \]Can be negative when
\[ R^2 < \frac{p-1}{n-p} \]If \(p > \frac{n+1}{2}\) then the above inequality always hold, and adjusted \(R\)-squared is always negative.
\(t\)-test¶
We can use \(t\)-test to conduct a hypothesis testing on \(\boldsymbol{\beta}\), which has a general form
Usually \(c=0\).
If \(c=0, \boldsymbol{v} = \boldsymbol{e} _j\) then this is equivalent to test \(\beta_j=0\), i.e. the variable \(X_i\) has no effect on \(Y\) given all other variabels.
If \(c=0, v_i=1, v_j=-1\) and \(v_k=0, k\ne i, j\) then this is equivalent to test \(\beta_i = \beta_j\), i.e. the two variables \(X_i\) and \(X_j\) has the same effect on \(Y\) given all other variables.
The test statistic is
In particular, when \(p=2\), to test \(\beta_1 = c\), we use
Following this analysis, we can find the \((1-\alpha)\%\) confidence interval for a scalar \(\boldsymbol{v} ^\top \boldsymbol{\beta}\) as
In particular,
If \(\boldsymbol{v} = \boldsymbol{e}_j\), then this is the confidence interval for a coefficient \(beta _j\).
If \(\boldsymbol{v} = \boldsymbol{x}_i\) where \(\boldsymbol{x}_i\) is in the data set, then this is the confidence interval for in-sample fitting of \(y_i\). We are making in-sample fitting at the mean value \(\operatorname{E}\left( \boldsymbol{y} _i \right) = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\). If \(\boldsymbol{x}_i\) is not in the design matrix, then we are doing out-of-sample prediction, and a prediction interval should be used instead.
Derivation
First, we need to find the distribution of \(\boldsymbol{v} ^\top \hat{\boldsymbol{\beta}}\). Recall that
Hence,
or
Also recall that the RSS has the distribution
and the two quantities \(\boldsymbol{v} ^\top \hat{\boldsymbol{\beta}}\) and \((n-p)\frac{\hat{\sigma}^2}{\sigma^2 }\) are independent. Therefore, with a standard normal and a Chi-squared that are independent, we can construct a \(t\)-test statistic
i.e.,
Note that, compared with
the RHS changes from \(N(0,1)\) to \(t_{n-p}\) because we are estiamteing \(\sigma\) by \(\hat{\sigma}\). Similar to the case of univariate mean test.
Permutation test
Another way to conduct a test without normality assumption is to use permutation test. For instance, to test \(\beta_2=0\), we fix \(y\) and \(x_1\), and sample the same \(n\) values of \(x_2\) from the column of \(X_2\), and compute the \(t\) statistic. Repeat the permutation for multiple times and compute the percentage that
which is the \(p\)-value.
Social science’s trick to test \(\beta_1 = \beta_2\)
Some social science courses introduce a trick to test \(\beta_1 = \beta_2\) by rearranging the explanatory variables. For instance, if our model is,
Then they define \(\gamma = \beta_1 - \beta_2\) and \(x_{i3} = x_{i1} + x_{i2}\), rearrange RHS,
Finally, they run the regression of the last line and check the \(p\)-value of \(\gamma\). Other parts of the model (\(R\)-squared, \(p\)-value of \(\beta_0\), etc) remain the same.
\(F\)-test¶
Compare Two Nested Models¶
\(F\)-test can be used to compare two nested models
The test statistic is
which can also be computed by \(R^2\) since \(TSS\) are the same for the two models
In particular,
When \(k=p-1\), we are comparing a full model vs. intercept only, i.e.,
\[ \beta_1 = \ldots = \beta_{p-1} = 0 \]In this case,
\[ RSS_{\text{reduced}} = \left\| \boldsymbol{y} - \bar{y} \boldsymbol{1} _n \right\| ^2 = TSS \]and
\[ F = \frac{(TSS - RSS_{\text{full}})/(p-1)}{RSS_{\text{full}}/(n-p)} = \frac{R^2 _{\text{full}}/k}{(1 - R^2 _{\text{full}})/(n-p)} \]When \(k=1\), we are testing \(\beta_{p-1} = 0\). In this case, the \(F\)-test is equivalent to the \(t\)-test. The two test statistics have the relation \(F_{1, n-p}=t^2_{n-p}\). Note that the alternative hypothesis in both tests are two-sided, and \(t\)-test is two-sided while \(F\)-test is one-sided.
Derivation
We need to find the distribution of \(RSS_{\text{reduced} }\) and \(RSS_{\text{full}}\) and then construct a pivot quantity.
Let \(\boldsymbol{U}\) be an orthogonal basis of \(\mathbb{R} ^n\) with three orthogonal parts
Then
Thus, we have pairwise independences among \(\boldsymbol{U}_1 ^\top \boldsymbol{y} , \boldsymbol{U} _2 ^\top \boldsymbol{y}\) and \(\boldsymbol{U} _3 ^\top \boldsymbol{y}\).
Moreover, by the property of multivariate normal, we have
The RSSs have the relations
Hence
Therefore, we have the pivot quantity
Caveats
A \(F\)-test on \(\beta_1=\beta_2=0\) is difference from two individual univariate \(t\)-tests \(\beta_1=0, \beta_2=0\). A group of \(t\)-tests may be misleading if the regressors are highly correlated.
When there are multiple covariates, we can conduct sequential \(F\)-test by using
anova()
function in R. The \(F\)-test in each row compares the existing model and the model plus the additional covariate. Hence, the order of input covariates matters. Exchanging order of terms usually changes sum of squares and \(F\)-test \(p\)-value.
Compare Two Groups of Data¶
If the data set contains data from two groups, then the regression line can be different. For instance,
In time series analysis, there may be a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war).
In program evaluation, the explanatory variables may have different impacts on different subgroups of the population.
Testing whether a regression function is different for one group \((i=1,\ldots,m)\) versus another \((i=m+1,\ldots,n)\), we can use \(F\)-test. The full model is
where dummy variable \(d_i = 1\) if \(i=m+1, \ldots, n\) and \(0\) otherwise.
The null and alternative hypotheses are
The \(F\)-test statistic is
Equivalently, there is a method called Chow test. It uses the fact that
where \(RSS_1\) and \(RSS_2\) are the RSS obtained when we run the following regression model on two groups of data respectively,
Hence, there is no need to create dummy variables \(d_i\). There are only three steps:
Run the regression with group 1’s data to obtain \(RSS_1\)
Run the regression with group 2’s data to obtain \(RSS_2\)
Run the regression with all data to obtain \(RSS_{\text{reduced}}\)
And then use the \(F\)-test to test the hypothesis.
Confidence Region for \(\boldsymbol{\beta}\)¶
If we want to draw conclusions to multiple coefficients \(\beta_1, \beta_2, \ldots\) simultaneously, we need a confidence region, and consider the multiple testing issue.
To find a \((1-\alpha)\%\) confidence region for \(\boldsymbol{\beta}\), one attemp is to use a cuboid, whose \(j\)-th side length equals to the \((1-\alpha/p)-%\) confidence interval for \(\beta_j\). Namely, the confidence region is
In this way, we ensure the overall confidence of the confidence region is at least \((1-\alpha)\%\).
A more natural approach is using an ellipsoid. Recall that
Hence a pivot quantity for \(\boldsymbol{\beta}\) can be constructed as follows,
Therefore, we can obtain an \((1-\alpha)\%\) confidence region for \(\boldsymbol{\beta}\) from this distribution
which is an ellipsoid centered at \(\hat{\boldsymbol{\beta}}\), scaled by \(\frac{1}{p \hat{\sigma}}\), rotated by \((\boldsymbol{X} ^\top \boldsymbol{X})\).
In general, for matrix \(\boldsymbol{A} \in \mathbb{R} ^{p \times k}, \operatorname{rank}\left( \boldsymbol{A} \right) = k\), the confidence region for \(\boldsymbol{A} ^\top \boldsymbol{\beta}\) can be found in a similar way
Prediction Interval for \(y_{new}\)¶
For a new \(\boldsymbol{x}\), the new response is
where \(\boldsymbol{\varepsilon} _{new} \perp\!\!\!\perp \hat{\boldsymbol{\beta}} , \hat{\sigma}\) since the RHS are from training set.
The prediction is
Thus, the prediction error is
Hence, the \((1-\alpha)\%\) confidence prediction interval for a new response value \(\boldsymbol{y} _{new}\) at an out-of-sample \(\boldsymbol{x}\) is
There is an additional “\(1\)” in the squared root, compared to the confidence interval for an in-sample fitting,
Width of an interval
When we are building confidence interval for \(\boldsymbol{y} _i\) or prediction interval for \(\boldsymbol{y} _{new}\), the width depends on the magnitude of \(n\) the choice of \(\boldsymbol{x}\).
As \(n \rightarrow \infty\), we have \(\boldsymbol{a} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{a}\rightarrow 0\) for all \(\boldsymbol{a}\), hence the
CI for \(\boldsymbol{y} _i\): \(\operatorname{se} \rightarrow 0, \operatorname{width} \rightarrow 0\)
PI for \(\boldsymbol{y} _{new}\): \(\operatorname{se} \rightarrow \hat{\sigma}, \operatorname{width} \rightarrow 2 \times t_{n-p}^{(1-\alpha/2)} \hat{\sigma}\)
The width also depends on the choice of \(\boldsymbol{x}\).
If \(\boldsymbol{x}\) is aligned with a large eigenvector of \(\boldsymbol{X} ^\top \boldsymbol{X}\), then \(\boldsymbol{x} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{x}\) is small. This is because larger eigenvectors indicate a direction of large variation in the data set, and hence it has more distinguishability.
If \(\boldsymbol{x}\) is aligned with a small eigenvector of \(\boldsymbol{X} ^\top \boldsymbol{X}\), then \(\boldsymbol{x} ^\top (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{x}\) is large. This is because smaller eigenvectors indicate a direction of small variation in the data set, and hence it has less distinguishability and more uncertainty.