Generalized Linear Models¶
From LM to GLM¶
Limitation of Linear Models¶
Recall a linear model is
and the linear predictor \(\boldsymbol{x}_i ^\top \boldsymbol{\beta}\) aimes to predict the mean response
Clearly, the range of RHS is \(\mathbb{R}\), while our LHS response may not be so. It can be binary, discrete non-negatives, or other form. Thus, we need more capable models to model these particular data types.
Model Form¶
A generalized linear model has the form
where
\(Y_{i}\) is response, aka random component.
We assume \(Y_i \overset{ \text{iid}}{\sim} F\) where \(F\) is some distribution, such as normal, binomial, poisson. Thus, we generalize the response \(y_i\) from continuous real values in ordinary linear models, to binary response, counts, categories etc. Usually \(F\) is from an exponential family.
\(\boldsymbol{x}_i ^\top \boldsymbol{\beta}\) is a linear predictor, often denoted \(\eta_i\).
As in ordinary least squared models, \(\boldsymbol{x}_{i}\) can include interactions, non-linear transformations of the observed covariates and the constant term.
\(g(\cdot)\) is a link function
Unlike linear model which use a linear predictor \(\boldsymbol{x}_i ^\top \boldsymbol{\beta}\) to predict the mean response \(\operatorname{E}\left( Y_i \right)\), GLM use the linear predictor to predict a function \(g(\cdot)\) of the mean response. The link function connects the linear predictor with the random components: \(g(\mu_i) = \eta_i\).
There are many ways to choose the link function. Note that if \(Y_i \sim f(y_i; \theta)\) then \(b ^\prime (\theta) = \operatorname{E}\left( Y \right) = \mu\), i.e. \(\theta\) can be rewritten as a function of \(\mu\): \(\theta = (b ^\prime ) ^{-1} (\mu)\) (this form can usually be found in the PDF). If we choose this \((b ^\prime ) ^{-1}\) as the link function \(g = (b ^\prime ) ^{-1}\), then it is called the canonical link function, and we will have \(\theta_i = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\).
Some examples at a glance
For linear models, \(g(\mu)=\mu\), such that \(\operatorname{E}\left( Y_i \right) = \boldsymbol{x} _i ^\top \boldsymbol{\beta}\)
For logistics regression, \(Y_i \in \operatorname{Ber}(p)\) where \(p=\mu\), then \(\theta = \ln \frac{\mu}{1-\mu}\), hence \(g(\mu) = \ln \frac{\mu}{1-\mu}\), and \(\ln \frac{\boldsymbol{P}\left( Y_i=1\vert \boldsymbol{x} _i \right) }{\boldsymbol{P}\left( Y_i=0 \vert \boldsymbol{x} _i\right) } = \boldsymbol{x} _i^\top \boldsymbol{\beta}\)
For poisson regression, \(\theta = \ln(\mu)\), hence \(g(\mu) = \ln \mu\), and \(\ln \operatorname{E}\left( Y_i \right) = \boldsymbol{x} _i^\top \boldsymbol{\beta}\)
Estimation¶
Usually, we estimate the parameters by maximum likelihood.
First, we consider using the canonical link, such that \(\theta_i = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\).
Score Function¶
Canonical Link¶
As discussed above, the log-likelihood of parameter \(\theta\) is
Since \(\theta = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\) and \(\boldsymbol{x} _i\)’s are known, the log-likelihood of coefficients \(\boldsymbol{\beta}\) can be written in the same way by substituting \(\theta_i = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\).
Taking derivative w.r.t. to \(\beta_j\) gives
In matrix, form, we have
The score equation is, therefore,
Note
Note that
\[ \frac{\partial^{2}\ell(\boldsymbol{\beta})}{\partial\beta_{h}\partial\beta_{j}}=-\sum_{i}b^{\prime\prime}\left(\boldsymbol{x}_{i}^\top\boldsymbol{\beta}\right)x_{ih}x_{ij}=-\text{Var}(Y_{i})x_{ih}x_{ij} \]So the Hessian is
\[ \frac{\partial^{2}\ell(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\top}}=-\boldsymbol{X} ^\top \boldsymbol{V} \boldsymbol{X} \preceq \boldsymbol{0} \quad \text{where } \boldsymbol{V} =\text{diag}(\operatorname{Var}\left(Y_{i}\right)) \]with equality iff \(X\) has full rank.
That is, the likelihood function \(\ell(\boldsymbol{\beta})\) is concave in \(\boldsymbol{\beta}\).
There is no randomness in the Hessian matrix since it does not involve \(y\). This holds for all exponential families
For the log-likelihood, when \(\boldsymbol{\beta}\) is the parameter, we see the part involving both data and parameter is
\[ \sum_{j}\left(\sum_{i}y_{i}x_{ij}\right)\beta_{j} \]So the sufficient statistics for \(\beta_{j}\) is \(\sum_{i}y_{i}x_{ij}=\boldsymbol{x}_{j}^\top\boldsymbol{y}\).
The overall score equations \(\boldsymbol{X} ^\top\boldsymbol{y}= \boldsymbol{X} ^\top\boldsymbol{\mu}\) equate the sufficient statistics to their expected values.
General Link¶
If we use general link, then the score function is
where
Derivation
By the chain rule, we have
Since
Then
where $\( \frac{\partial\mu_{i}}{\partial\eta_{i}}{=\frac{\partial\mu_{i}}{\partial g\left(\mu_{i}\right)}=\frac{1}{g^{\prime}\left(\mu_{i}\right)}} \)$
Hence in matrix form, the score equations are
and
where
Note that if \(g\) is the canonical link then
hence \(\boldsymbol{D} =\boldsymbol{V}\) and the score equations reduce to
More on \(\operatorname{Var}\left( Y_i \right)\)
That is, the variance is a function of \(\mu\). The relation \(v(\cdot)\) uniquely characterized an exponential family distribution. For instance \(v(\mu_{i})=\mu\) for Poisson and \(v(\mu_{i})=\sigma^{2}\) (constant) for the normal. If we do not assume \(Y_{i}\) is from the exponential families, can we specify our own mean-variance relation? We will see the answer in Quasi-Likelihood section.
Computation¶
There are several ways to solve the score equations for \(\widehat{\boldsymbol{\beta}} _{ML}\): Newton’s Method, Fisher Scoring, Iterative Reweighted Least Squares.
Inference¶
Asymptotic Distribution of MLE¶
Deviance¶
Saturated Models¶
First we define saturated models.
- Definition (Saturated models)
For a particular GLM with three components specified, we want to fit means \(\mu_{i}\). A saturated model fits \(\mu_{i}\) by the observation itself, i.e.,
\[\tilde{\mu}_{i}=y_{i}\]which yields a perfect fit without parsimony. The number of parameter is \(p=n\). It is often use as a baseline for comparison with other models.
Note
In principle, if two observations \(i_{1}\) and \(i_{2}\) share the same covariates values \(\boldsymbol{x}_{i_{1}}=\boldsymbol{x}_{i_{2}}\) then their prediction value \(\hat{y}_{i_{1}},\hat{y}_{i_{2}}\) should be the same. But in the saturated model for GLM, we allow them to be different, i.e. \(\hat{y}_{i_{1}} = y_{i_1},\hat{y}_{i_{2}} = y_{i_{2}}\).
Recall that \(g(\mu_i) = \boldsymbol{x}_i ^\top \boldsymbol{\beta}\). If we write the log-likelihood in terms of \(\boldsymbol{\mu}\), we get
In a saturated model, the value \(\boldsymbol{\mu}\) is estimated by data \(\tilde{\boldsymbol{\mu} } = \boldsymbol{y}\). The likelihood for the saturated model is
where \(\tilde{\theta}_{i}\) is a function of \(\tilde{\mu}_{i}\).
The saturated model achieves the maximum achievable log-likelihood. So for any other fit \(\hat{\boldsymbol{\mu}}\), we have
Goodness-of-fit by Deviance¶
As its name suggests, goodness-of-fit measures the goodness of fit of a chosen model. In the test, the null hypothesis is
The alternative hypothesis is
Defined in this way, sometimes it’s called global alternatives.
- Definition (Deviance)
We want to measure the difference between a chosen model with fit \(\hat{\boldsymbol{\mu}}\), and the sacturated model with fit \(\tilde{\boldsymbol{\mu}}=\boldsymbol{y}\). Deviance is a statistic defined as
\[\begin{split} \begin{aligned} D(\tilde{\boldsymbol{\mu}},\hat{\boldsymbol{\mu}}) & =-2\left[\ell(\hat{\boldsymbol{\mu}}\vert\boldsymbol{y})-\ell(\tilde{\boldsymbol{\mu}}\vert\boldsymbol{y})\right]\\ & =-2\sum_{i}\log\frac{f(y_{i}\vert\hat{\mu})}{f(y_{i}\vert\tilde{\mu}_{i})}\\ & \sim\chi_{n-p}^2 \end{aligned}\end{split}\]where \(p\) is the number of parameters in the fit \(\hat{\boldsymbol{\mu}}\). Note that in gerneral \(D(\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2})\ne D(\boldsymbol{\mu}_{2},\boldsymbol{\mu}_{1})\).
We can denote deviance by \(G^{2}\). In this course, it has the form
Since the saturated model is the perfect fit, the larger the deviance \(D(\boldsymbol{y},\hat{\boldsymbol{\mu}})\), the poor the fit \(\hat{\boldsymbol{\mu}}\).
To maximize log-likelihood \(\ell(\hat{\boldsymbol{\beta}})\) is equivalent to minimize deviance \(D(\boldsymbol{y},\hat{\boldsymbol{\mu}})\).
- Definition (Null deviance)
Null deviance is defined as
\[ D(\tilde{\boldsymbol{u}},\bar{\boldsymbol{y}})=-2\left[\ell(\bar{\boldsymbol{y}}\vert\boldsymbol{y})-\ell(\tilde{\boldsymbol{\mu} }\vert\boldsymbol{y})\right] \]That is, all mean estimates has the same value \(\hat{\mu}_i = \bar{y}\).
In sum, deviance is the \(2\) times the distance from log-likelihood of some estimate \(\hat{\boldsymbol{\mu}}\) to the log-likelihood of the saturated estimate \(\tilde{\boldsymbol{\mu} } = \boldsymbol{y}\), as illustrated below.
- Example (Deviance for Normal with \(\sigma=1\))
The deviance for normal distribution is
\[ D(\boldsymbol{\mu}_{1},\boldsymbol{\mu}_{2})=\sum_{i}(\mu_{1i}-\mu_{2i})^{2} \]So for an OLS model, the deviance is
\[ D(\boldsymbol{y}, \hat{\boldsymbol{\mu}})=\sum_{i}(y_{i}-\hat{\mu}_{i})^{2}=\sum_{i}(y_{i}-\hat{y})^{2}=RSS \]Thus we see deviance can be interpreted as a generalization of sum of squared residuals in OLS.
Moreover, the null deviance is
\[ D(\boldsymbol{y},\bar{\boldsymbol{y} })=\sum_{i}(y_{i}-\bar{y}_{i})^{2}=TSS \]
As suggested by the example, we can define the generalized \(R^{2}\) in GLM as
Compare Two Nested Models by Deviance¶
To compare two nested models \(M_{\text{reduced} }:\boldsymbol{\beta}=\boldsymbol{\beta}_{\text{reduced} }\) v.s. \(M_{\text{full} }:\boldsymbol{\beta}=\boldsymbol{\beta}_{\text{full} }\), in linear models we use \(F\)-test, in GLM we can can compare their deviances.
Suppose the corresponding fits are \(\hat{\boldsymbol{\mu}}_{r}\) and \(\hat{\boldsymbol{\mu}}_{f}\), and the numbers of parameters are \(p_{r}\) and \(p_{f}\), then the test statistic is essentially a likelihood-ratio statistic, defined as
Note the reduced model has a larger deviance so \(D(\boldsymbol{y},\hat{\boldsymbol{\mu}}_{r})>D(\boldsymbol{y},\hat{\boldsymbol{\mu}}_{f})\). This is analogous to the fact that a reduced model has a larger \(RSS\) in linear models.
When we want to compare multiple models, in linear models we use ANOVA. In GLM, we can use deviance analysis table.
Generalized Pearson Statistic¶
In addition to deviance, generalized Pearson statistic is another quantity used to check goodness-of-fit.
- Definition (Generalized Pearson statistic)
A useful statistic to check goodness-of-fit is the generalized Pearson statistic, which is defined as
\[X^{2}:=\sum_{i}\frac{(y_{i}-\hat{\mu}_{i})^{2}}{v(\hat{\mu}_{i})}\]In some GLM, it has the form
\[ X^2 = \sum\frac{(\text{observed}-\text{fitted})^{2}}{\text{fitted}} \]It is an alternative to the deviance for testing the fit of certain GLMs.
Residuals¶
To detect a model’s lack of fit, any particular type of residual \(\hat{\boldsymbol{\varepsilon}}\) below can be plotted against the fitted values \(\hat{\boldsymbol{\mu}}\) and against each explanatory variables.
Pearson Residuals¶
The Pearson residual for observation \(i\) is defined as
Their squared values sum to the generalized Pearson statistic.
Deviance Residual¶
Let \(D(\boldsymbol{y},\boldsymbol{\hat{\boldsymbol{\mu}}})=\sum_{i}d_{i}\). The deviance residual for observation \(i\) is defined as
Their squared values sum to the deviance.
Standardized Residuals¶
The standardized residual divides each raw residual \((y_{i}-\hat{\mu}_{i})\) by its standard error. Let \(h_{ii}\) be the diagonal element of the generalized hat matrix for observation \(i\), which is called its leverage. Then, the standardized residual for observation \(i\) is defined as
Their squared values sum to the deviance.
Summary¶
Recall definitions
Link function
\[ g\left(\mu_i\right)=\boldsymbol{x}_{i}^{\top} \boldsymbol{\beta} \]Deviance
\[D(\tilde{\boldsymbol{\mu}}, \hat{\boldsymbol{\mu}})=-2[\ell(\hat{\boldsymbol{\mu}} \mid \boldsymbol{y})-\ell(\tilde{\boldsymbol{\mu}} \mid \boldsymbol{y})] \sim \chi ^2 _{n-\#p}\]Usual form
\[ G^{2}:=D(\boldsymbol{y}, \hat{\boldsymbol{\mu}})=2 \sum \text { observed } \times \log \left(\frac{\text { observed }}{\text { fitted }}\right) \]Generalized Pearson Statistic (for some GLM)
\[ X^{2}:=\sum_{i} \frac{\left(y_{i}-\hat{\mu}_{i}\right)^{2}}{v\left(\hat{\mu}_{i}\right)} \]Usual form
\[ X^{2}=\sum \frac{(\text { observed }-\text { fitted })^{2}}{\text { fitted }} \]
Summary of common models
Data and assumption |
Model |
Form |
#Parm |
\(\quad \text{Interpretation} \quad\) |
\(\qquad \qquad \text{Remarks}\qquad \qquad\) |
---|---|---|---|---|---|
Normal with \(\sigma^2 =1\) |
OLS |
\(\mu_i = \boldsymbol{x} ^{\top} _i \boldsymbol{\beta}\) |
\(p\) |
\(Y_i\) increases \(\beta_j\) |
Deviance \(=RSS\) |
Binary |
Logistic |
\(\log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)=\boldsymbol{x}_{i}^{\top} \boldsymbol{\beta}\) |
\(p\) |
Odds of \(Y_{i}=1\) is multiplied by \(e^{\beta_k}\) |
\(\operatorname{logit}(\pi_i) = \log \left(\frac{\pi_{i}}{1-\pi_{i}}\right)\) |
Multinomial |
Baseline-category logit |
\(\log \left( \frac{\pi_{i j}}{\pi_{i c}} \right) =\boldsymbol{x}_{i}^{\top} \boldsymbol{\beta}_{j}\) |
\((c-1)\times p\) |
Ratio of probabilities \(\frac{\operatorname{P}\left( Y_i = k \right)}{\operatorname{P}\left( Y_i = c \right)}\) is multiplied by \(e^{\beta_k}\) |
When \(c=2\), reduces to logistic. |
Ordinal |
Cumulative logit |
\(\log \left( \frac{\operatorname{P} \left(Y_{i} \leq j\right)}{1- \operatorname{P} \left(Y_{i} \leq j\right)} \right) = \alpha_j + \boldsymbol{x}_i ^{\top} \boldsymbol{\beta}\) |
\((c-1) + p\) |
Odds of \(Y_{i}<k\) is multiplied by \(e^{\beta_k}\) |
No intercept in \(\boldsymbol{\beta}\). |
Count |
Poisson log-linear |
\(\log \left( \mu_i \right) = \boldsymbol{x}_i ^{\top} \boldsymbol{\beta}\) |
\(p\) |
\(\mu_i\) is multiplied by \(e^{\beta_k}\) |
Multiplicative effect. |