Computation Issues¶
Problem¶
Given a probability distribution \(p(x_1, \ldots, x_n)\), two natural questions are to
compute marginal distributions \(p(x_i)\) (inference).
compute mean \(\mathbb{E} [\boldsymbol{x}]\).
Both computation involves integration in high dimensions. For arbitrary distribution \(p\), theses are difficult. MCMC is an approximation method.
Alternatively, we can convert these integration problem to some optimization problem, and then solve by optimization algorithms. Instead of doing integration where we ‘enumerate’ every single possible \(x\), after converting it to an optimization problem, we can have some efficient greedy or dynamic-programming algorithm.
Exponential Random Graphs¶
In exponential random graphs, the joint probability is from exponential families
where
\(\boldsymbol{\theta}\) are the parameters of the distribution, in parameter space \(\Omega\)
\(\boldsymbol{\phi} (\boldsymbol{x})\) is a function of \(\boldsymbol{x} = (x_1, \ldots, x_n)\). Its entries can be some functions of \((x_1, \ldots, x_n)\), e.g.
\[ \boldsymbol{\phi} (\boldsymbol{x} ) = [f_1(x_1), f_{23}(x_2, x_3), \ldots] \]usually \(f\) corresponds to some objects in graph, e.g. \(f_1\) for node-level, \(f_{23}\) for edge-level, and so on.
\(b(\boldsymbol{\theta})\) is a normalizing constant/function such that
\[ \int p_{\theta}(\boldsymbol{x})\mathrm{~d} \boldsymbol{x}= \frac{1}{\exp(b(\boldsymbol{\theta}))} \int \exp (\boldsymbol{\theta} ^{\top} \boldsymbol{\phi} (\boldsymbol{x} )) \mathrm{~d} \boldsymbol{x} = 1 \]The denominator \(\exp(b(\boldsymbol{\theta}))\) is also called the partition function, and \(b(\boldsymbol{\theta})\) is often called the log-partition function.
To compute \(b(\boldsymbol{\theta})\), we need integration, which is sometimes intractable.
Equivalently, we can write the distribution \(p_\theta(\boldsymbol{x})\) as
Examples
Multivariate Gaussian \(\boldsymbol{x} \in \mathbb{R} ^{n}\)
\[ p _\theta (\boldsymbol{x} ) \propto \exp (- \boldsymbol{x} ^{\top} \Theta \boldsymbol{x}) \]1-d Bernoulli \(x \in \left\{ 0, 1 \right\}\) parameterized by \(\lambda\).
\[ p_{\theta }(x) = \exp ( x\underbrace{\ln \frac{\lambda}{1-\lambda}}_{\theta}+\ln (1-\lambda)) = \frac{1}{\exp (b(\theta ))} \exp (\theta x) \]The mean is \(\lambda_\theta=\frac{e^\theta}{e^\theta + 1}\).
The normalizing function is \(b(\theta) = -\ln (1-\lambda) = \ln ( 1 + e ^\theta)\).
For other distributions, computing mean via computing \(b(\boldsymbol{\theta})\) is difficult. Can we turn computing \(b(\boldsymbol{\theta})\) to an optimization problem?
Conjugate Functions¶
- Definition (conjugate function)
The conjugate function \(f^* (\lambda)\) of a function \(f(\theta)\) is defined as
\[ f^*(\lambda) = \sup_\theta \left\{ \langle \theta, \lambda \rangle - f(\theta) \right\} \]where \(\lambda \in \mathbb{R}\).
- Properties
\(f^*(\lambda)\) is convex.
Proof
\[\begin{split}\begin{aligned} f^*(t \lambda_1 + (1-t) \lambda _2) &= \sup_\theta \left\{ t \langle \theta, \lambda_1 \rangle - t f(\theta) + (1 - t) \langle \theta, \lambda_2 \rangle - (1-t) f(\theta) \right\}\\ &\le t \sup_\theta \left\{ \langle \theta, \lambda_1 \rangle - f(\theta) \right\} + (1-t) \sup_\theta \left\{ \langle \theta, \lambda_2 \rangle - f(\theta) \right\}\\ &= t f ^* (\lambda_1) + (1 - t)f ^* (\lambda_2)\\ \end{aligned}\end{split}\]If \(f\) is convex, then \(f=(f^*)^*\). In other words,
\[ f(\theta) = \sup_\lambda \left\{ \langle \theta, \lambda \rangle - f^* (\lambda)\right\} \]
It can be shown that the normalizing function \(b(\boldsymbol{\theta})\) is convex. Hence,
Thus, we turn computing \(b(\boldsymbol{\theta})\) to an optimization problem over \(\boldsymbol{\lambda}\).
Example: Bernoulli
In Bernoulli case, the normalizing function is \(b(\theta) = \ln ( 1 + e ^\theta)\), then its conjugate function is
Taking derivative and set to 0 gives
Substituting the optimizer \(\theta^+\) gives the conjugate function
On the other hand, since \(b(\theta)\) is convex, we have \(b = (b^*)^*\), i.e.
Taking derivative and set to 0 gives
Substituting the optimizer gives \(b(\theta) = \ln ( 1 + e^\theta)\).
The above prove-by-example process show how to compute \(b(\theta)\) as an optimization problem. We can also find that
The optimizer \(\lambda ^+= \frac{e ^\theta}{1 + e^\theta}\) is the mean of the Bernoulli distribution
The conjugate function \(b^*(\lambda)\) is the negative entropy of the Bernoulli distribution
\[b^*(\lambda) = -\operatorname{H} (p_\lambda) = - \sum_x p_\lambda(x) \ln p_\lambda(x)\]
In general, these two statements always hold.
In general,
The optimizer \(\lambda ^+\) is the mean of the distribution
\[ \lambda^+ = \mathbb{E} [\boldsymbol{x}] \]The conjugate function \(b^*(\lambda)\) is the negative entropy of the distribution
\[b^*(\lambda) = -\operatorname{H} (p_\lambda)\]
Proof
Recall the definition
Hence setting the first order derivative to 0 gives
Hence \(\boldsymbol{\lambda}\) is the mean of \(\boldsymbol{\phi} (\boldsymbol{x})\) w.r.t. distribution \(\boldsymbol{x} \sim p_{\boldsymbol{\theta}}\). This equation gives a relation between \(\boldsymbol{\lambda}\) and \(\boldsymbol{\theta}\).
Suppose (and indeed) this relation is one-one relation, then we can solve for \(\boldsymbol{\theta} = \boldsymbol{\theta} (\boldsymbol{\lambda})\). Then the maximum is
which is negative likelihood.
Then substituting this result to the optimization problem of \(b(\boldsymbol{\theta})\)
Claim:
where \(\mathcal{P}\) is the set of \(n\)-dim distribution and \(\boldsymbol{\lambda} = \mathbb{E}_p [\boldsymbol{\phi} (\boldsymbol{\theta})]\).
To claims to justify.
\(\boldsymbol{\lambda} = \mathbb{E} _{\boldsymbol{\theta}}[\boldsymbol{\phi} (\boldsymbol{x} )]\) defines a 1-1 map between \(\boldsymbol{\theta}\) and \(\boldsymbol{\lambda}\) \(\Leftrightarrow\) Exponential family is minimal
The image of \(\boldsymbol{\lambda} = \mathbb{E} _{p _\theta}[\boldsymbol{\phi} (\boldsymbol{x})]\) for \(p_{\theta} = e\) is the interior of \(M:=\left\{ \boldsymbol{\lambda} \mid \boldsymbol{\lambda} = \mathbb{E} _p[ \boldsymbol{\phi} (\boldsymbol{x} )], p \in \mathcal{P} \right\}\)
Proposition :
for all \(\boldsymbol{\lambda} \in \operatorname{Int}(M)\), with equality when \(\boldsymbol{\lambda} = \mathbb{E}_{\theta} [\boldsymbol{\phi} (\boldsymbol{x} )]\)
Key: when \(f\) is strictly convex, then \(y = \partial f(x) \Leftrightarrow x = \partial f ^* (y)\). Thus,
Any mean parameter \(\mathbb{E}_{p} [\boldsymbol{\phi} (\boldsymbol{x} )]\) for \(p \in \mathcal{P}\) gives a lower-bound of \(b(\boldsymbol{\theta})\). But \(\mathcal{P}\) has exponential size. Can we find a smaller set \(\tilde{\mathcal{P}} \subset \mathcal{P}\)?
Meanfield approximation¶
Assume mutual independence among the variables, then
Thus,
Consider an Ising model
where \(x_i \in \left\{ 0,1 \right\}\). We have
Hence
where \(\sum_i \operatorname{H} (p_i) = \sum \mu _i \log \mu _i + (1 - \mu_i) \log (1- \mu _i)\).
Let \(AMF\) be the objective function, then
where \(\sigma(z) = \frac{1}{1 + \exp(-z)}\) is the sigmoid function.
Repeat until convergence.
Another Look¶
Consider some distribution
We want to approximate \(p(\boldsymbol{x} )\) with \(b(\boldsymbol{x})\). A measure of goodness of approximatation is KL divergence
Write \(p(\boldsymbol{x}) = \frac{1}{Z} \exp(\boldsymbol{\theta} ^{\top} \boldsymbol{\phi} (\boldsymbol{x}))\). Define the energy function
Then we can write
Hence the KL divergenve is
Let \(\mathcal{B}\) be the space of distribution. Our objective is then
where \(\boldsymbol{\mu} = \mathbb{E} _b[\boldsymbol{\phi} (\boldsymbol{x})]\).
The objective can also be written as
Hence we want to minimize average energy, and maximize entropy.
Consider meanfield
pairwise:
\(p(\boldsymbol{x}) = \prod_{i,j} \psi_{ij}(x_i, x_j)/Z\)
\(E[\boldsymbol{x}] = \sum_{i,j} \ln \psi_{ij}(x_i, x_j) = \sum_{ij}E_{ij}(x_i, x_j)\)
The objective is then
where \(b_{ij}(x_i, x_j)\) is marginalization.
Relaxation:
Change the objective of optimization from \(b\) to \(\left\{ b_{ij} \right\}_{ij}\) and \(\left\{ b_i \right\}_{i}\).
Enforce smaller set of constraints
New constraints
Distribution constraint
\(b_{ij} \ge 0, \boldsymbol{1} ^{\top} b_{ij} \boldsymbol{1} = 1\)
\(b_{i} \ge 0, \boldsymbol{1} ^{\top} b_{i} = 1\)
Consistency constraint
\(b_{ij}\boldsymbol{1} = b_i\)
\(b_{ij} ^{\top} \boldsymbol{1} = b_j\)
The relaxed problem is convex.
Bethe approxiamtion:
where \(q_i\) is the number of neighbors of node \(i\). The second term is used to deal with over-counting.
Belief Propagation¶
Define Gibbs free energy
Then
subject to the distribution constraints and consistency constraints.
…
.
.
.
.
.
.
.
.
Tensor Network¶
In graphical models, we use nodes to represent variables and edges to represent function. There is another representation: matrix product state tensor network (1990~), aka tensor train (2011). Advantage: we can perform algebraic operation of functions.
Note that in tensor network it is more convenient to represent functions by nodes and variables by edges.