Multinomial Logistic Regression¶
Aka multinomial GLM, or multinomial logit in short.
A generalization of binomial GLM is multinomial GLM, where we have \(c\) categories. In particular,
if the categories have internal order, we use ordinal logistic regression
if the categories are nested (train/car/bus and then red/blue bus), we use nested logit models.
Data Structure¶
In this chapter we express models in terms of such ungrouped data. As with binary data, however, with discrete explanatory variables it is better to group the \(N\) observations according to their multicategory trial indices \(\left\{ n_{i}\right\}\) before forming the deviance and other goodness-of-fit statistics and residuals.
For subject \(i\), let \(\pi_{ij}\) denote the probability of response in category \(j\), with \({\sum_{j=1}^{c}\pi_{ij}=1}\). The observation is a vector of binary entries \({\boldsymbol{y}_{i}=\left(y_{i1},\ldots,y_{ic}\right)}\), where \(y_{ij}=1\) when the response is in category \(j\) and \(y_{ij}=0\) otherwise, then \({\sum_{j}y_{ij}=1}\). The probability distribution is
If there is no order in the categories, the response is called nominal response. Otherwise, it’s called ordinal response.
Baseline-Category Logit Model¶
Link Function¶
We construct a multinomial logistic model by pairing each response category with a baseline category, such as the last category \(c\). Like in logistic regression We use a linear predictor to model the log odds:
Note for each category \(j\), there is a separate \(\boldsymbol{\beta}_{j}\), so there are \((c-1)\times p\) parameters in total.
Suppose \(\boldsymbol{\beta} _c = \boldsymbol{0}\). We can obtain
If \(c=2\), this formula simplifies to the probability for logistic regression.
By convention \(\boldsymbol{\beta}_{c}=\boldsymbol{0}\) and \(\pi_{ic}=\frac{1}{1+\sum_{h=1}^{c-1}\exp(\boldsymbol{x}_{i}^{\top}\boldsymbol{\beta}_{h})}\).
Interpretation¶
Recall there are \((c-1)\times p\) number of \(\beta\)’s. Interpretation of the overall effects for \(\boldsymbol{\beta}_{j}\) is not simple, since in the above formula we see \(\pi_{ij}\) also depends on other \(\boldsymbol{\beta}_{h}\). The derivative of \(\pi_{ij}\) w.r.t. the \(k\)-th covariate \(x_{ik}\) is
which depends on \(\beta_{hk}\) in other \(\boldsymbol{\beta}_{h}\). So the derivative need not have the same sign as \(\beta_{jk}\).
Note that \(\pi_{ij}\) is a function of \(\boldsymbol{x}_{i}\). In fact, the derivative may change sign as \(x_{ik}\) increases (Exercise 6.4).
However, we can interpret the effect of \(\beta_{jk}\) on the odds \(\frac{\pi_{ij}}{\pi_{ic}}\) for setting \(i\). By the link function,
If \(x_{ik}\) increases 1 unit, the odds is multiplied by \(\exp(\beta_{jk})\), similar to the meaning in logistic regression.
Multivariate Exponential Families¶
Multivariate here means \(\boldsymbol{y}\) is a vector. The PDF is
Let \(\boldsymbol{g}\) be a vector of link functions, the multivariate GLM has the form
or equivalently,
Note the \(c\)-th category is redundant.
The baseline-category logit model is a multivariate GLM with
and canonical link functions
Estimation¶
Score Equations¶
The log-likelihood for subject \(i\) is
Substituting \(\log\left(\frac{\pi_{ij}}{\pi_{ic}}\right)=\boldsymbol{x}_i ^\top \boldsymbol{\beta} _{j}\), the total log-likelihood is
The sufficient statistic for \(\beta_{jk}\) is \(\sum_{i=1}^{N}x_{ik}y_{ij}\). In particular, for the intercept \(\beta_{j1}\), it is \(\sum_{i=1}^{N}y_{ij}\), which is the total number of observations in category \(j\).
The partial derivatives are
for \(j=1, \ldots, c-1\) and \(k=1\ldots,p\)
So the score equations are
for \(j=1, \ldots, c-1\) and \(k=1\ldots,p\).
Again, the score equations equate the sufficient statistics to their expected value.
Computation¶
We derive the Hessian matrix.
Note that for two coefficients \(\beta_{jk}\) and \(\beta_{jh}\) in \(\boldsymbol{\beta}_{j}\),
for categories \(j\ne h\),
So the Fisher Information matrix consists of \((c-1)^{2}\) blocks of size \(p\times p\),
The Hessian is negative-definite, so the log-likelihood function is concave and has a unique maximum. The observed and expected information are identical, so the Newton-Raphson method is equivalent to Fisher scoring for finding the ML parameter estimates, a consequence of the link function being the canonical one. Convergence is usually fast unless at least one estimate is infinite or does not exist (see Note 6.2).
Hypothesis Testing¶
Test the Effect of \(x_{k}\)¶
To test the effect of the \(k\)-th covariate is to test all the coefficients of it, namely \(\beta_{jk}\) for category \(j=1,2,\ldots,{c-1}\). There are \(k\) constraints
The likelihood ratio test can be applied, with \(df=c-1\). The likelihood-ratio test statistic equals the difference in the deviance values for comparing the models.
Deviance¶
We will use grouped data for summary of goodness-of-fit. Now \(\boldsymbol{y}_{i}\sim\frac{1}{n_{i}} \operatorname{Multinomial} (n_{i},\boldsymbol{\pi}_{i})\). That is, \(y_{ij}\) is the proportion of the observations in category \(j\).
The deviance compares log-likelihood of a fit \(\left\{ \hat{\pi}_{ij}\right\}\) and the saturated model \(\left\{ \tilde{\pi}_{ij}=y_{ij}\right\}\), which is
where \(df=\#\text{probabilities}-\#\text{parameters}\). Usually \(\#\text{probabilities}=N(c-1)\) and \(\#\text{parameters}=p(c-1)\).
It satisfies the general form
For ungrouped data, the above formula remains valid and is used to compare nested unsaturated models.
Pearson Statistic¶
The Pearson statistic is
where \(df=\#\text{probabilities}-\#\text{parameters}\). Normally \(\#\text{probabilities}=N(c-1)\) and \(\#\text{parameters}=p(c-1)\).
It satisfies the general form
Note
For both deviance and Pearson statistic
they have approximate chi-squared null distributions when the expected cell counts mostly exceed about 5.
the sums are taken over all observed counts \({\left\{ n_{i}y_{ij}\right\} }\). This explains why the sums are taken over \(2N\) success and failures in the deviance and Pearson statistic for logistic regression.