Generalized Linear Models

From LM to GLM

Limitation of Linear Models

Recall a linear model is

\[Y_i = \boldsymbol{x} _i^\top \boldsymbol{\beta} + \varepsilon\]

and the linear predictor \(\boldsymbol{x}_i ^\top \boldsymbol{\beta}\) aimes to predict the mean response

\[\operatorname{E}\left( Y_i \right) = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\]

Clearly, the range of RHS is \(\mathbb{R}\), while our LHS response may not be so. It can be binary, discrete non-negatives, or other form. Thus, we need more capable models to model these particular data types.

Model Form

A generalized linear model has the form

\[ g\left( \operatorname{E}\left( Y_{i} \right) \right) =\boldsymbol{x}_i ^\top \boldsymbol{\beta} \]

where

  • \(Y_{i}\) is response, aka random component.

    We assume \(Y_i \overset{ \text{iid}}{\sim} F\) where \(F\) is some distribution, such as normal, binomial, poisson. Thus, we generalize the response \(y_i\) from continuous real values in ordinary linear models, to binary response, counts, categories etc. Usually \(F\) is from an exponential family.

  • \(\boldsymbol{x}_i ^\top \boldsymbol{\beta}\) is a linear predictor, often denoted \(\eta_i\).

    As in ordinary least squared models, \(\boldsymbol{x}_{i}\) can include interactions, non-linear transformations of the observed covariates and the constant term.

  • \(g(\cdot)\) is a link function

    Unlike linear model which use a linear predictor \(\boldsymbol{x}_i ^\top \boldsymbol{\beta}\) to predict the mean response \(\operatorname{E}\left( Y_i \right)\), GLM use the linear predictor to predict a function \(g(\cdot)\) of the mean response. The link function connects the linear predictor with the random components: \(g(\mu_i) = \eta_i\).

There are many ways to choose the link function. Note that if \(Y_i \sim f(y_i; \theta)\) then \(b ^\prime (\theta) = \operatorname{E}\left( Y \right) = \mu\), i.e. \(\theta\) can be rewritten as a function of \(\mu\): \(\theta = (b ^\prime ) ^{-1} (\mu)\) (this form can usually be found in the PDF). If we choose this \((b ^\prime ) ^{-1}\) as the link function \(g = (b ^\prime ) ^{-1}\), then it is called the canonical link function, and we will have \(\theta_i = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\).

Some examples at a glance

  • For linear models, \(g(\mu)=\mu\), such that \(\operatorname{E}\left( Y_i \right) = \boldsymbol{x} _i ^\top \boldsymbol{\beta}\)

  • For logistics regression, \(Y_i \in \operatorname{Ber}(p)\) where \(p=\mu\), then \(\theta = \ln \frac{\mu}{1-\mu}\), hence \(g(\mu) = \ln \frac{\mu}{1-\mu}\), and \(\ln \frac{\boldsymbol{P}\left( Y_i=1\vert \boldsymbol{x} _i \right) }{\boldsymbol{P}\left( Y_i=0 \vert \boldsymbol{x} _i\right) } = \boldsymbol{x} _i^\top \boldsymbol{\beta}\)

  • For poisson regression, \(\theta = \ln(\mu)\), hence \(g(\mu) = \ln \mu\), and \(\ln \operatorname{E}\left( Y_i \right) = \boldsymbol{x} _i^\top \boldsymbol{\beta}\)

Estimation

Usually, we estimate the parameters by maximum likelihood.

First, we consider using the canonical link, such that \(\theta_i = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\).

Score Function

Computation

There are several ways to solve the score equations for \(\widehat{\boldsymbol{\beta}} _{ML}\): Newton’s Method, Fisher Scoring, Iterative Reweighted Least Squares.

Inference

Asymptotic Distribution of MLE

Test \(\boldsymbol{\beta} = \boldsymbol{\beta} _0\)

Wald Test

Likelihood Ratio Test

Score Test

Deviance

Saturated Models

First we define saturated models.

Definition (Saturated models)

For a particular GLM with three components specified, we want to fit means \(\mu_{i}\). A saturated model fits \(\mu_{i}\) by the observation itself, i.e.,

\[\tilde{\mu}_{i}=y_{i}\]

which yields a perfect fit without parsimony. The number of parameter is \(p=n\). It is often use as a baseline for comparison with other models.

Note

In principle, if two observations \(i_{1}\) and \(i_{2}\) share the same covariates values \(\boldsymbol{x}_{i_{1}}=\boldsymbol{x}_{i_{2}}\) then their prediction value \(\hat{y}_{i_{1}},\hat{y}_{i_{2}}\) should be the same. But in the saturated model for GLM, we allow them to be different, i.e. \(\hat{y}_{i_{1}} = y_{i_1},\hat{y}_{i_{2}} = y_{i_{2}}\).

Recall that \(g(\mu_i) = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\). If we write the log-likelihood in terms of \(\boldsymbol{\mu}\), we get

\[ \ell(\boldsymbol{\beta}\vert\boldsymbol{y})=\ell(\boldsymbol{\mu}\vert\boldsymbol{y}) \]

In a saturated model, the value \(\boldsymbol{\mu}\) is estimated by data \(\tilde{\boldsymbol{\mu} } = \boldsymbol{y}\). The likelihood for the saturated model is

\[\begin{split}\begin{aligned} \ell(\tilde{\boldsymbol{\mu} } \vert\boldsymbol{y}) & =\sum_{i}\ell(\tilde{\mu}_i \vert y_{i})\\ & =\sum_{i}\left[y_{i}\tilde{\theta}_{i}-b(\tilde{\theta}_{i})+\log f_{0}(y_{i})\right] \end{aligned}\end{split}\]

where \(\tilde{\theta}_{i}\) is a function of \(\tilde{\mu}_{i}\).

The saturated model achieves the maximum achievable log-likelihood. So for any other fit \(\hat{\boldsymbol{\mu}}\), we have

\[ \ell(\tilde{\boldsymbol{\mu} } \vert\boldsymbol{y})\ge\ell( \hat{\boldsymbol{\mu}}\vert\boldsymbol{y}) \]

Goodness-of-fit by Deviance

As its name suggests, goodness-of-fit measures the goodness of fit of a chosen model. In the test, the null hypothesis is

\[ H_{0}:\text{ the chosen model truly holds} \]

The alternative hypothesis is

\[ H_{1}:\text{ other models are better} \]

Defined in this way, sometimes it’s called global alternatives.

Definition (Deviance)

We want to measure the difference between a chosen model with fit \(\hat{\boldsymbol{\mu}}\), and the sacturated model with fit \(\tilde{\boldsymbol{\mu}}=\boldsymbol{y}\). Deviance is a statistic defined as

\[\begin{split} \begin{aligned} D(\tilde{\boldsymbol{\mu}},\hat{\boldsymbol{\mu}}) & =-2\left[\ell(\hat{\boldsymbol{\mu}}\vert\boldsymbol{y})-\ell(\tilde{\boldsymbol{\mu}}\vert\boldsymbol{y})\right]\\ & =-2\sum_{i}\log\frac{f(y_{i}\vert\hat{\mu})}{f(y_{i}\vert\tilde{\mu}_{i})}\\ & \sim\chi_{n-p}^2 \end{aligned}\end{split}\]

where \(p\) is the number of parameters in the fit \(\hat{\boldsymbol{\mu}}\). Note that in gerneral \(D(\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2})\ne D(\boldsymbol{\mu}_{2},\boldsymbol{\mu}_{1})\).

We can denote deviance by \(G^{2}\). In this course, it has the form

\[ G^2 := {D(\boldsymbol{y},\hat{\boldsymbol{\mu}})={\color{red}2}\sum}\text{observed}{\times\log \left( \frac{\text{observed}}{\text{fitted}} \right)} \]

Since the saturated model is the perfect fit, the larger the deviance \(D(\boldsymbol{y},\hat{\boldsymbol{\mu}})\), the poor the fit \(\hat{\boldsymbol{\mu}}\).

To maximize log-likelihood \(\ell(\hat{\boldsymbol{\beta}})\) is equivalent to minimize deviance \(D(\boldsymbol{y},\hat{\boldsymbol{\mu}})\).

Definition (Null deviance)

Null deviance is defined as

\[ D(\tilde{\boldsymbol{u}},\bar{\boldsymbol{y}})=-2\left[\ell(\bar{\boldsymbol{y}}\vert\boldsymbol{y})-\ell(\tilde{\boldsymbol{\mu} }\vert\boldsymbol{y})\right] \]

That is, all mean estimates has the same value \(\hat{\mu}_i = \bar{y}\).

In sum, deviance is the \(2\) times the distance from log-likelihood of some estimate \(\hat{\boldsymbol{\mu}}\) to the log-likelihood of the saturated estimate \(\tilde{\boldsymbol{\mu} } = \boldsymbol{y}\), as illustrated below.

Fig. 71 Likelihood curve and deviance computation for saturated model \(\boldsymbol{\mu}_{\text{saturated} } = \tilde{\boldsymbol{\mu}} = \boldsymbol{y}\), some model \(\hat{\boldsymbol{\mu} }\), and null model \(\boldsymbol{\mu}_{\text{null} } = \bar{\boldsymbol{y}}\)

Example (Deviance for Normal with \(\sigma=1\))

The deviance for normal distribution is

\[ D(\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2})=\sum_{i}(\mu_{1i}-\mu_{2i})^{2} \]

So for an OLS model, the deviance is

\[ D(\boldsymbol{y}, \hat{\boldsymbol{\mu}})=\sum_{i}(y_{i}-\hat{\mu}_{i})^{2}=\sum_{i}(y_{i}-\hat{y})^{2}=RSS \]

Thus we see deviance can be interpreted as a generalization of sum of squared residuals in OLS.

Moreover, the null deviance is

\[ D(\boldsymbol{y},\bar{\boldsymbol{y} })=\sum_{i}(y_{i}-\bar{y}_{i})^{2}=TSS \]

As suggested by the example, we can define the generalized \(R^{2}\) in GLM as

\[ R^{2}=1-\frac{RSS}{TSS}=1-\frac{D(\boldsymbol{y},\hat{\boldsymbol{\mu}})}{D(\boldsymbol{y},\bar{\boldsymbol{y}})} \]

Compare Two Nested Models by Deviance

To compare two nested models \(M_{\text{reduced} }:\boldsymbol{\beta}=\boldsymbol{\beta}_{\text{reduced} }\) v.s. \(M_{\text{full} }:\boldsymbol{\beta}=\boldsymbol{\beta}_{\text{full} }\), in linear models we use \(F\)-test, in GLM we can can compare their deviances.

Suppose the corresponding fits are \(\hat{\boldsymbol{\mu}}_{r}\) and \(\hat{\boldsymbol{\mu}}_{f}\), and the numbers of parameters are \(p_{r}\) and \(p_{f}\), then the test statistic is essentially a likelihood-ratio statistic, defined as

\[\begin{split}\begin{aligned} G^{2}(M_{r}\vert M_f) & =D(\boldsymbol{y},\hat{\boldsymbol{\mu}}_{r})-D(\boldsymbol{y},\hat{\boldsymbol{\mu}}_f)\\ & =-2\left[\ell(\hat{\boldsymbol{\mu}}_r\vert\boldsymbol{y})-\ell(\hat{\boldsymbol{\mu}}_f\vert\boldsymbol{y})\right]\\ & =-2\sum_{i}\log\frac{f(y_{i}\vert\hat{\mu}_{r ,i})}{f(y_{i}\vert\hat{\mu}_{f,i})}\\ & \sim\chi_{p_f-p_r}^2 \end{aligned}\end{split}\]

Note the reduced model has a larger deviance so \(D(\boldsymbol{y},\hat{\boldsymbol{\mu}}_{r})>D(\boldsymbol{y},\hat{\boldsymbol{\mu}}_{f})\). This is analogous to the fact that a reduced model has a larger \(RSS\) in linear models.

Fig. 72 The likelihood test statistic \(G^{2}(M_{r}\vert M_f)\) is the difference in log-likelihood, or deviance.

When we want to compare multiple models, in linear models we use ANOVA. In GLM, we can use deviance analysis table.

Generalized Pearson Statistic

In addition to deviance, generalized Pearson statistic is another quantity used to check goodness-of-fit.

Definition (Generalized Pearson statistic)

A useful statistic to check goodness-of-fit is the generalized Pearson statistic, which is defined as

\[X^{2}:=\sum_{i}\frac{(y_{i}-\hat{\mu}_{i})^{2}}{v(\hat{\mu}_{i})}\]

In some GLM, it has the form

\[ X^2 = \sum\frac{(\text{observed}-\text{fitted})^{2}}{\text{fitted}} \]

It is an alternative to the deviance for testing the fit of certain GLMs.

Residuals

To detect a model’s lack of fit, any particular type of residual \(\hat{\boldsymbol{\varepsilon}}\) below can be plotted against the fitted values \(\hat{\boldsymbol{\mu}}\) and against each explanatory variables.

Asymptotically Uncorrelated with \(\hat{\boldsymbol{\mu} }\)

In LM, residuals \(\hat{\boldsymbol{\varepsilon}} =\boldsymbol{y}-\hat{\boldsymbol{y}}\) and fitted values \(\hat{\boldsymbol{y}}\) are orthogonal, or uncorrelated, regardless of sample size \(n\).

But in GLM, \(\hat{\boldsymbol{\varepsilon}} = \boldsymbol{y}-\hat{\boldsymbol{\boldsymbol{\mu}}}\) and \(\hat{\boldsymbol{\boldsymbol{\mu}}}\) are asymptotically uncorrelated. See Section 4.4.5 for details.

Pearson Residuals

The Pearson residual for observation \(i\) is defined as

\[e_{i}=\frac{y_{i}-\hat{\mu}_{i}}{\sqrt{v(\hat{\mu}_{i})}}\]

Their squared values sum to the generalized Pearson statistic.

Deviance Residual

Let \(D(\boldsymbol{y},\boldsymbol{\hat{\boldsymbol{\mu}}})=\sum_{i}d_{i}\). The deviance residual for observation \(i\) is defined as

\[e_{i}=\sqrt{d_{i}} \operatorname{sign} (y_{i}-\hat{\mu}_{i})\]

Their squared values sum to the deviance.

Standardized Residuals

The standardized residual divides each raw residual \((y_{i}-\hat{\mu}_{i})\) by its standard error. Let \(h_{ii}\) be the diagonal element of the generalized hat matrix for observation \(i\), which is called its leverage. Then, the standardized residual for observation \(i\) is defined as

\[ r_{i}=\frac{y_{i}-\hat{\mu}_{i}}{\sqrt{v(\hat{\mu}_{i})(1-\hat{h}_{ii})}}=\frac{e_{i}}{\sqrt{1-\hat{h}_{ii}}} \]

Their squared values sum to the deviance.

Summary

Recall definitions

  • Link function

    \[ g\left(\mu_i\right)=\boldsymbol{x}_{i}^{\top} \boldsymbol{\beta} \]
  • Deviance

    \[D(\tilde{\boldsymbol{\mu}}, \hat{\boldsymbol{\mu}})=-2[\ell(\hat{\boldsymbol{\mu}} \mid \boldsymbol{y})-\ell(\tilde{\boldsymbol{\mu}} \mid \boldsymbol{y})] \sim \chi ^2 _{n-\#p}\]

    Usual form

    \[ G^{2}:=D(\boldsymbol{y}, \hat{\boldsymbol{\mu}})=2 \sum \text { observed } \times \log \left(\frac{\text { observed }}{\text { fitted }}\right) \]
  • Generalized Pearson Statistic (for some GLM)

    \[ X^{2}:=\sum_{i} \frac{\left(y_{i}-\hat{\mu}_{i}\right)^{2}}{v\left(\hat{\mu}_{i}\right)} \]

    Usual form

    \[ X^{2}=\sum \frac{(\text { observed }-\text { fitted })^{2}}{\text { fitted }} \]

Summary of common models

Data and assumption

Model

Form

#Parm

\(\quad \text{Interpretation} \quad\)
if \(x_{ik}\) increases 1

\(\qquad \qquad \text{Remarks}\qquad \qquad\)

Normal with \(\sigma^2 =1\)

OLS

\(\mu_i = \boldsymbol{x} ^{\top} _i \boldsymbol{\beta}\)

\(p\)

\(Y_i\) increases \(\beta_j\)

Deviance \(=RSS\)

Binary
\(Y_{i} \sim \frac{1}{n i} \operatorname{Bin}\left(n_{i}, \pi_{i}\right)\)

Logistic

\(\log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\boldsymbol{x}_{i}^{\top} \boldsymbol{\beta}\)

\(p\)

Odds of \(Y_{i}=1\) is multiplied by \(e^{\beta_k}\)

\(\operatorname{logit}(\pi_i) = \log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)\)
\(\pi_i = \operatorname{CDF} (\boldsymbol{x}_i ^{\top} \boldsymbol{\beta} )\)

Multinomial
\(Y_{ij} \sim \operatorname{M}(\pi_{i1}, \ldots, \pi_{ic})\)

Baseline-category logit

\(\log \left( \frac{\pi_{i j}}{\pi_{i c}} \right) =\boldsymbol{x}_{i}^{\top} \boldsymbol{\beta}_{j}\)
\(j=1,2, \ldots, c-1\)

\((c-1)\times p\)

Ratio of probabilities \(\frac{\operatorname{P}\left( Y_i = k \right)}{\operatorname{P}\left( Y_i = c \right)}\) is multiplied by \(e^{\beta_k}\)

When \(c=2\), reduces to logistic.

Ordinal
\(\operatorname{P}\left(Y_{i} \leq j\right)=\pi_{i 1}+\cdots+\pi_{i j}\)

Cumulative logit

\(\log \left( \frac{\operatorname{P} \left(Y_{i} \leq j\right)}{1- \operatorname{P} \left(Y_{i} \leq j\right)} \right) = \alpha_j + \boldsymbol{x}_i ^{\top} \boldsymbol{\beta}\)
\(j=1,2, \ldots, c-1\)

\((c-1) + p\)

Odds of \(Y_{i}<k\) is multiplied by \(e^{\beta_k}\)

No intercept in \(\boldsymbol{\beta}\).
\(\alpha_j\) acts as class-specific intercept

Count
\(Y_i \sim \operatorname{Poi}(\mu_i)\)

Poisson log-linear

\(\log \left( \mu_i \right) = \boldsymbol{x}_i ^{\top} \boldsymbol{\beta}\)

\(p\)

\(\mu_i\) is multiplied by \(e^{\beta_k}\)

Multiplicative effect.
Mean = variance