Maximum Likelihood Estimator¶
We consider a distribution function \(p(x; \theta)\) of \(x\) parameterized by \(\theta\). Our goal is to construct an estimator for the parameter. Maximum likelihood estimator, aka. MLE, as it name suggests, is an estimator for the parameter that is constructed by maximizing the likelihood function.
Good example of likelihood principle: 245.ps1.q3
Likelihood Function¶
- Definition (Likelihood Function)
Given \(n\) observations \(\boldsymbol{x} = (x_1, x_2, \ldots, x_n)\), the likelihood function for parameter \(\theta\), denoted \(L(\theta ; \boldsymbol{x})\), is defined as
\[ L(\theta ; \boldsymbol{x})=p(\boldsymbol{x} ; \theta)=\prod_{x \in \boldsymbol{x}} p(x ; \theta) \]- Definition (Maximum Likelihood Estimator)
The MLE for \(\theta\), denoted \(\theta _ {MLE}\), is defined as
\[\begin{split} \begin{aligned} \theta_{M L} &=\underset{\theta}{\arg \max } L(\theta ; \boldsymbol{x}) \\ &=\underset{\theta}{\arg \max } \prod_{x \in \mathrm{X}} p(x ; \theta) \end{aligned} \end{split}\]
However, it is typically hard to compute the derivative of a product. We instead maximize the log-likelihood.
- Definition (Log-likelihood)
Given \(n\) observations \(\boldsymbol{x} = (x_1, x_2, \ldots, x_n)\), the log-likelihood function for parameter \(\theta\), denoted \(\ell(\theta ; \boldsymbol{x})\), is defined as
\[ \ell(\theta ; \boldsymbol{x})=\log \left( L(\theta ; \boldsymbol{x}) \right) \]
Therefore, the MLE can be defined as
Equating the derivative of \(\theta\) to zero, we have
and we can solve for \(\theta_{MLE}\).
Properties¶
Invariance: if MLE of \(\theta\) is \(\hat{\theta}\), then MLE of \(\phi=h(\theta)\) is \(\hat{\phi} = h(\hat{\theta})\), provided that \(h(\cdot)\) is a one-to-one function.
Examples¶
Gaussian¶
The log-likelihood function for multivariate Gaussian is
The MLE for \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) are
Derivation
By the method of completing squares,
We therefore have
Now only the last tern involves \(\boldsymbol{\mu}\). Since \((\overline{\boldsymbol{x}}-\boldsymbol{\mu})^{\prime} \boldsymbol{\Sigma}^{-1}(\overline{\boldsymbol{x}}-\boldsymbol{\mu}) \geq 0\) with equality iff \(\boldsymbol{\mu} = \bar{\boldsymbol{x}}\), we have \(\hat{\boldsymbol{\mu}} = \bar{\boldsymbol{x}}\). Now
Since \(\boldsymbol{\Sigma}\) and \(\boldsymbol{W} /n\) are p.d., the function \(g(\boldsymbol{\Sigma})\) attains its minimum at \(\boldsymbol{\Sigma} = \boldsymbol{W} /n\).