Linear Models - Estimation

From this section we introduce linear models from a statistics’ perspective. There are three sections in total.

  1. The first section covers the model fundamentals, including assumptions, estimation, interpretation and some exercise.

  2. The next section introduce statistical inference for linear models, such as distribution of the estimated coefficients, \(t\)-test, \(F\)-test, etc. Usually machine learning community focuses more on prediction, less on inference. But inference does matters. It analyzes how important each variable is in the model from a rigorous approach.

  3. The third section introduce some issues in linear models, e.g. omitted variables bias, multicollinearity, heteroskedasticity, and some alternative models, e.g. Lasso, ridge regression, etc.

Objective

Linear models aim to model the relationship between a scalar response and one or more explanatory variables in a linear format:

\[Y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_{p-1} x_{i,p-1} + \varepsilon_i \]

for observations \(i=1, 2, \ldots, n\).

In matrix form,

\[ \boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}. \]

where

  • \(\boldsymbol{X}_{n\times p}\) is called the design matrix. The first column is usually set to be \(\boldsymbol{1}\), i.e., intercept. The remaining \(p-1\) columns are designed values \(x_{ij}\) where \(i = 1, 2, \ldots, n\) and \(j=1, \ldots, p-1\). These \(p-1\) columns are called explanatory/independent variables, or covariates.

  • \(\boldsymbol{y}_{n \times 1}\) is a vector of response/dependent variables \(Y_1, Y_2, \ldots, Y_n\).

  • \(\boldsymbol{\beta}_{p \times 1}\) is a vector of coefficients to be estimated.

  • \(\boldsymbol{\varepsilon}_{n \times 1}\) is a vector of unobserved random errors, which includes everything that we have not measured and included in the model.

When \(p=2\), we have

\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

which is called simple linear regression.

When \(p>2\), it is called multiple linear regression. For instance, when \(p=3\)

\[ Y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon_i \]

When there are multiple dependent variables, we call it multivariate regression, which will be introduced in another section.

When \(p=1\),

  • if we include intercept, then the regression model \(y_i = \beta_0\) means that we use a single constant to predict \(y_i\). The estimator, \(\hat{\beta}_0\), by ordinary least square, should be the sample mean \(\hat{y}\).

  • if we do not include intercept, then the regression model \(y_i = \beta x_i\) means that we expect that \(y\) is proportional to \(x\).

Assumptions

Basic assumptions

  1. \(\operatorname{E}\left( y_i \right) = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\) is linear in covariates \(X_j\).

  2. The values of explanatory variables \(\boldsymbol{x}_i\) are known and fixed. Randomness only comes from \(\varepsilon_i\).

  3. No \(X_j\) is constant for all observations. No exact linear relationships among the explanatory variables (aka no perfect multicollinearity, or the design matrix \(\boldsymbol{X}\) is of full rank).

  4. The error terms are uncorrelated \(\operatorname{Cov}\left( \varepsilon_i, \varepsilon_j \right)= 0\), with common mean \(\operatorname{E}\left( \varepsilon_i \right) = 0\) and variance \(\operatorname{Var}\left( \varepsilon_i \right) = \sigma^2\) (homoskedasticity).

    As a result, \(\operatorname{E}\left( \boldsymbol{y} \mid \boldsymbol{X} \right) = \boldsymbol{X} \boldsymbol{\beta}\), or \(\operatorname{E}\left( y_i \mid x_i \right) = \beta_0 + \beta_1 x_i\) when \(p=2\), which can be illustrated by the plots below.

    Fig. 58 Distributions of \(y\) given \(x\) [Meyer 2021]

    Fig. 59 Observations of \(y\) given \(x\) [Meyer 2021]

    To predict \(\hat{y}_i\), we just use \(\hat{y}_i = \boldsymbol{x}_i ^\top \hat{\boldsymbol{\beta}}\) .

  5. The error terms are independent and follow Gaussian distribution \(\varepsilon_i \overset{\text{iid}}{\sim}N(0, \sigma^2)\), or \(\boldsymbol{\varepsilon} \sim N_n (\boldsymbol{0} , \sigma^2 \boldsymbol{I} _n)\).

    As a result, we have \(Y_i \sim N(\boldsymbol{x}_i ^\top \boldsymbol{\beta} , \sigma^2 )\) or \(\boldsymbol{y} \sim N_n(\boldsymbol{X} \boldsymbol{\beta} , \sigma^2 \boldsymbol{I} _n)\)

These assumptions are used for different objectives. The first 3 assumptions are the base, and in additiona to them,

  • derivation of \(\hat{\boldsymbol{\beta}}\) by least squares uses no more assumptions.

  • derivation of \(\hat{\boldsymbol{\beta}}\) by maximal likelihood uses assumptions 4 and 5.

  • derivation of \(\operatorname{E}\left( \hat{\boldsymbol{\beta}} \right)\) uses \(\operatorname{E}\left( \varepsilon_i \right) = 0\) in 4.

  • derivation of \(\operatorname{Var}\left( \hat{\boldsymbol{\beta}} \right)\) uses 1, 2, \(\operatorname{Cov}\left( \varepsilon_i, \varepsilon_j \right) = 0\) and \(\operatorname{Var}\left( \epsilon_i \right) = \sigma^2\) in 4.

  • proof of Gaussian-Markov Theorem (BLUE) uses 4.

  • derivation of the distribution of \(\hat{\boldsymbol{\beta} }\) uses 4 and 5.

Estimation

We introduce various methods to estimate the parameters \(\boldsymbol{\beta}\) and \(\sigma^2\).

Ordinary Least Squares

The most common way is to estimate the parameter \(\hat{\boldsymbol{\beta}}\) by minimizing the sum of squared errors \(\sum_i(y_i-\hat{y}_i)^2\).

\[\begin{split}\begin{align} \hat{\boldsymbol{\beta}} &= \underset{\boldsymbol{\beta} }{\mathrm{argmin}} \, \left\Vert \boldsymbol{y} - \hat{\boldsymbol{y}} \right\Vert ^2 \\ &= \underset{\boldsymbol{\beta} }{\mathrm{argmin}} \, \left\Vert \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \right\Vert ^2 \\ \end{align}\end{split}\]

The gradient w.r.t. \(\boldsymbol{\beta}\) is

\[\begin{split}\begin{align} \nabla_{\boldsymbol{\beta}} &= -2 \boldsymbol{X} ^\top (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} ) \\ &\overset{\text{set}}{=} \boldsymbol{0} \end{align}\end{split}\]

Hence, we have

\[ \boldsymbol{X} ^\top \boldsymbol{X} \boldsymbol{\beta} = \boldsymbol{X} ^\top \boldsymbol{y} \]

This linear system is called the normal equation.

The closed form solution is

\[\hat{\boldsymbol{\beta}} = \left( \boldsymbol{X} ^\top \boldsymbol{X} \right)^{-1}\boldsymbol{X} ^\top \boldsymbol{y} \]

Note that \(\hat{\boldsymbol{\beta}}=(\boldsymbol{X} ^\top \boldsymbol{X} ) ^{-1} \boldsymbol{X}^\top \boldsymbol{y}\) is a random variable, since it is a linear combination of the random vector \(\boldsymbol{y}\). This means that, keeping \(\boldsymbol{X}\) fixed, repeat the experiment, we will probably get different response values \(\boldsymbol{y}\), and hence different \(\hat{\boldsymbol{\beta}}\). As a result, there is a sampling distribution of \(\hat{\boldsymbol{\beta}}\), and we can find its mean, variance, and conduct hypothesis testing.

An unbiased estimator of the error variance \(\sigma^2 = \operatorname{Var}\left( \varepsilon \right)\) is (to be discussed [later])

\[ \hat{\sigma}^2 = \frac{\left\Vert \boldsymbol{y} - \boldsymbol{X} \hat{\boldsymbol{\beta}} \right\Vert ^2}{n-p} \]

When \(p=2\), we have

\[\hat{\beta}_0, \hat{\beta}_1 = \underset{\beta_0, \beta_1 }{\mathrm{argmin}} \, \sum_i \left( y_i - \beta_0 - \beta_1 x_i \right)^2\]

Differentiation w.r.t. \(\beta_1\) gives

\[ - 2\sum_i (y_i - \beta_0 - \beta_1 x_i) x_i = 0 \]

Differentiation w.r.t. \(\beta_0\) gives

\[ - 2\sum_i (y_i - \beta_0 - \beta_1 x_i) = 0 \]

Solve the system of the equations, we have

\[\begin{split}\begin{align} \hat{\beta}_{1} &=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}} \\ \hat{\beta}_{0} &=\bar{y}-\hat{\beta}_{1} \bar{x} \end{align}\end{split}\]

The expression for \(\hat{\beta}_0\) implies that the fitted line cross the sample mean point \((\bar{x}, \bar{y})\).

Moreover,

\[ \hat{\sigma}^2 = \frac{1}{n-2} \sum_i \hat\varepsilon_i^2 \]

where \(\hat\varepsilon_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i\).

Minimizing mean squared error

The objective function, sum of squared errors,

\[ \left\Vert \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \right\Vert ^2 = \sum_i \left( y_i - \boldsymbol{x}_i ^\top \boldsymbol{\beta} \right)^2 \]

can be replaced by mean squared error,

\[ \frac{1}{n} \sum_i \left( y_i - \boldsymbol{x}_i ^\top \boldsymbol{\beta} \right)^2 \]

and the results are the same.

By Assumptions

In some social science courses, the estimation is done by using the assumptions

  • \(\operatorname{E}\left( \varepsilon \right) = 0\)

  • \(\operatorname{E}\left( \varepsilon \mid X \right) = 0\)

The first one gives

\[ \frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1} x_{i}\right)=0 \]

The second one gives

\[\begin{split}\begin{align} \operatorname{Cov}\left( X, \varepsilon \right) &= \operatorname{E}\left( X \varepsilon \right) - \operatorname{E}\left( X \right) \operatorname{E}\left( \varepsilon \right) \\ &= \operatorname{E}\left[ \operatorname{E}\left( X \varepsilon \mid X \right) \right] - \operatorname{E}\left( X \right)\operatorname{E}\left[ \operatorname{E}\left( \varepsilon \mid X\right) \right]\\ &= \operatorname{E}\left[ X \operatorname{E}\left( \varepsilon \mid X \right) \right] - \operatorname{E}\left( X \right)\operatorname{E}\left[ \operatorname{E}\left( \varepsilon \mid X\right) \right]\\ &= 0 \end{align}\end{split}\]

which gives

\[ \frac{1}{n} \sum_{i=1}^{n} x_{i}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1} x_{i}\right)=0 \]

Therefore, we have the same normal equations to solve for \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

Warning

Estimation by the two assumptions derived from the zero conditional mean assumption can be problematic. Consider a model without intercept \(y_i = \beta x_i + \varepsilon_i\). Fitting by OLS, we have only ONE first order condition

\[ \sum_{i=1}^{n} x_{i}\left(y_{i}-\hat{\beta}_{1} x_{i}\right)=0 \]

If we fit by assumptions, then in addition to the condition above, the first assumption \(\operatorname{E}\left( \varepsilon \right) = 0\) also gives

\[ \sum_{i=1}^{n}\left(y_{i}-\hat{\beta}_{1} x_{i}\right)=0 \]

These two conditions may not hold at the same time.

Maximum Likelihood

biased. TBD.

Gradient Descent

TBD.

Interpretation

Value of Estimated Coefficients

  1. For a model in a standard form,

    • \(\beta_j\) is the expected change in the value of the response variable \(y\) if the value of the covariate \(x_j\) increases by 1, holding other covariates fixed, aka ceteris paribus.

      Warning

      Sometimes other covariates is unlikely to be fixed as we increase \(x_j\), and in these cases the ceteris paribus interpretation is not appropriate. So for interpretation purpose, don’t include

      • multiple measures of the same economic concept,

      • intermediate outcomes or alternative forms of the dependent variable

    • \(\beta_0\) is the expected value of the response variable \(y\) if all covariates have values of zero.

  2. If the covariate is an interaction term, for instance,

    \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \varepsilon \]

    Then \((\hat{\beta}_1 + \hat{\beta}_{12}x_2)\) can be interpreted as the estimated effect of one unit change in \(x_1\) on \(Y\) given a fixed value \(x_2\). Usually \(x_2\) is an dummy variable, say gender.

    In short, we build this model because we believe the effect of \(x_1\) on \(Y\) depends on \(x_2\).

    Often higher-order interactions are added after lower-order interactions are included.

  3. For polynomial covariates, say,

    \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_{3} x_1^3 + \varepsilon \]

    the interpretation of the marginal effect of \(x_1\) is simply the partial derivative \(\beta_1 + 2\beta_2 x_1 + 3 \beta_3 x_1^2\). We build such model because the plot suggests a non-linear relation between \(Y\) and \(x_1\), or we believe the effect of \(x_1\) on \(Y\) depends on the value of \(x_1\). Note in this case the effect can change sign.

  4. For a model that involves log, we need some approximation.

    • \(Y = \beta_0 + \beta_1 \ln(x) + \mu\)

      \[\begin{split}\begin{aligned} \ln(1+0.01)&\approx 0.01\\ \Rightarrow \quad \ln(x + 0.01x) &\approx \ln(x) + 0.01 \quad \forall x\\ \end{aligned}\end{split}\]

      Hence, \(0.01x\) change in \(x\), or \(1\%\) change in \(x\), is associated with \(0.01\beta_1\) change in value of \(Y\).

    • \(\ln(Y) = \beta_0 + \beta_1 x + \mu\)

      \[\begin{split}\begin{aligned} Y ^\prime &= \exp(\beta_0 + \beta_1 (x + 1) + \varepsilon) \\ &= \exp(\beta_0 + \beta_1 + \varepsilon) \exp(\beta_1) \\ &\approx Y (1 + \beta_1) \quad \text{if $\beta_1$ is close to 0} \\ \end{aligned}\end{split}\]

      Hence, \(1\) unit change in \(x\) is associated with \(100\beta_1 \%\) change in \(Y\).

    • \(\ln(Y) = \beta_0 + \beta_1 \ln(x) + \mu\)

      \[\begin{split}\begin{aligned} Y ^\prime &= \exp(\beta_0 + \beta_1 \ln(x + 0.01x) + \varepsilon) \\ &\approx \exp(\beta_0 + \beta_1 (\ln(x) + 0.01) + \varepsilon) \\ &= \exp(\beta_0 + \beta_1 \ln(x) + \varepsilon)\exp(0.01\beta_1) \\ &\approx Y (1 + 0.01\beta_1) \quad \text{if $0.01\beta_1$ is close to 0} \\ \end{aligned}\end{split}\]

      Hence, \(1\%\) change in \(x\) is associated with \(\beta_1 \%\) change in \(Y\), i.e. \(\beta_1\) measures elasticity.

      When to use log?

      Log is often used

      • when the variable has a right skewed distribution, e.g. wages, prices

      • to reduce heteroskedasticity

      not used when

      • the variable has negative values

      • the variable are in percentages or proportions (hard to interpret)

      Also note that logging can change significance tests.

Warning

Linear regression models only reveal linear associations between the response variable and the independent variables. But association does not imply causation. Simple example: in SLR, regress \(X\) over \(Y\), the coefficient has same sign and significance, but causation cannot be reversed.

Only when the data is from a randomized controlled trial, correlation will imply causation.

We can measure if a coefficient is statistically significant by \(t\)-test.

\(R\)-squared

We will introduce \(R\)-squared in detail in next section.

Definition (\(R\)-squared)

\(R\)-squared is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

\[ R^2 = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} \]

Partialling Out Explanation for MLR

We can interpret the coefficients in multiple linear regression from “partialling out” perspective.

When \(p=3\), i.e.,

\[ \hat{y}=\hat{\beta}_{0}+\hat{\beta}_{1} x_{1}+\hat{\beta}_{2} x_{2} \]

We can obtain \(\hat{\beta}_1\) by the following three steps

  1. regress \(x_1\) over \(x_2\) and obtain

    \[\hat{x}_{1}=\hat{\gamma}_{0}+\hat{\gamma}_{1} x_{2}\]
  2. compute the residuals \(\hat{u}_{i}\) in the above regression

    \[ \hat{u}_{i} = x_{1i} - \hat{x}_{1i} \]
  3. regress \(y\) on the the residuals \(\hat{u}_{1}\), and the estimated coefficient equals the required coefficient.

    \[\begin{split}\begin{align} \text{Regress}\quad y_i &\sim \alpha_{0}+\alpha_{1} \hat{u}_i \\ \text{Obtain}\quad\hat{\alpha}_{1} &= \frac{\sum (\hat{u}_i - \bar{\hat{u}}_i)(y_i - \bar{y})}{\sum (\hat{u}_i - \bar{\hat{u}}_i)^2} \\ &= \frac{\sum \hat{u}_{i}y_i}{\sum \hat{u}_{i}^2} \qquad \because \bar{\hat{u}}_i = 0\\ &\overset{\text{claimed}}{=} \hat{\beta}_1 \end{align}\end{split}\]

In this approach, \(\hat{u}\) is interpreted as the part in \(x_1\) that cannot be predicted by \(x_2\), or is uncorrelated with \(x_2\). We then regress \(y\) on \(\hat{u}\), to get the effect of \(x_1\) on \(y\) after \(x_2\) has been “partialled out”.

It can be proved that the above method hold for any \(p\).

Exercise

SLR stands for simple linear regression \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \)

  1. In SLR, can you compute \(\hat{\beta}_1\) from correlation \(r_{X,Y}\) and standard deviations \(s_X\) and \(s_Y\)?

  2. In SLR, can you compute \(\bar{y}\) given \(\hat{\beta}_0,\hat{\beta}_1\) and \(\bar{x}\)?

  3. What if the mean of the error term is not zero? Can you write down an equivalent model?

  4. Assume the intercept \(\beta_0\) in the model \(y=\beta_0 + \beta_1 x + \varepsilon\) is zero. Find the OLS estimate for \(\beta_1\), denoted \(\tilde{\beta}\). Find its mean, variance, and compare them with those of the OLS estimate for \(\beta_1\) when there is an intercept term.

  5. What happen to \(\beta\), its standard error, and its p-value, if we scale the \(j\)-th covariate \(x_j\), or add a constant to \(x_j\)? How about if we change \(Y\)?

  6. True or False: In SLR, exchange \(X\) and \(Y\), the new slope estimate equals the reciprocal of the original one.

  7. True or False: if \(\operatorname{Cov}\left( Y, X_j \right) = 0\) then \(\beta_j= 0\)?

  8. What affect estimation precision?

  9. To compare the effects of two variable \(X_j, X_k\), can we say they have the same effect since the confidence interval of \(\beta_j, \beta_k\) overlaps?

  10. Does the partialling out method holds for \(p \ge 3\)? Yes.

  11. How do you compare two linear models?

  12. What happens if you exclude a relevant regressor \(X_j\)?

  13. What happens if you include an irrelevant regressor \(X_j\)?

  14. Describe missing data problems in linear regression

  15. Does \(\boldsymbol{x}_{p} ^\top \boldsymbol{y} = 0 \Leftrightarrow \hat{\beta}_{p}=0\)?

  16. Given \(R^2=0.3\) for \(Y\sim X_1\), and \(R^2 = 0.4\) for \(Y \sim X_1\), what is \(R^2\) for \(Y \sim X_1 + X_2\)?

  17. Causal?

    313.qz1.q2

    TBD.

  18. Add/Remove an observation

    E(b), Var(b), RSS, TSS, R^2

  19. More

    https://www.1point3acres.com/bbs/thread-703302-1-1.html