Application to Density Fitting¶
Discriminators and Generators¶
Discriminator¶
Consider two \(p\)-dimensional random variables \(X \sim P_X, Y \sim P_Y\), we want to find a discriminator function \(D\) to discriminate the two distributions, by the objective
where \(D\) satisfies certain constraints, e.g. the above cost is bounded.
Generators¶
Consider a random variable \(X \sim P_X\), where \(P_X\) is the ‘push-forward’ measure of base measure \(P_0\) on \(\mathcal{Z}\), via transformation \(T: \mathcal{Z} \rightarrow \mathcal{X}\).
In this case, \(T\) is a generator. Usually spaces \(\mathcal{Z}\) and \(\mathcal{X}\) are high-dimensional. Benefits of representing \(P\) as a pushforward measure is that we can calculate expectation of \(X\) efficiently.
We can represent \(D\) and \(T\) as neural networks \(D_\phi\), \(T_\psi\) with some parameters \(\phi, \psi\).
Normalizing Flow¶
If \(T\) is invertible, then
where \(J _{T ^{-1}}\) is the Jacobbian of \(T ^{-1}\).
To sample from \(\mathcal{X}\), we can sample from \(\mathcal{Z}\). To comput the expectation of a function \(f(X)\), we can use the approximation
Density fitting¶
Given a sample distribution \(P_X\), we want to fit it by \(P_T\), by minimizing the distance between them. The distance measure can be KL diverggence, or 1-Wasserstein distance.
Note that we can write \(P_X(x)\) as (generalize) density using delta function and the observations \(x_1, x_2, \ldots, x_n\).
where \(\delta_{x_i}(x) = \delta(x - x_i)\). An useful property of delta function is that \(\int \delta_{x_i}(x) g(x)\mathrm{~d} x = g(x_i)\).
By KL Divergence¶
The objective is
where
The first term is a constant, hence the problem is to minimize the second term, which is
The last equality is from normalizing flow. We can then represetn the function \(T ^{-1}\) by a NN and solve this optimization problem by SGD.
By 1-Wasserstein Distance¶
To approximate \(P\) by \(P_T\), the objective is
where
It can be shown that the dual form is
We can then parameterize \(D \leftarrow D_\phi\), \(T \leftarrow T_\psi\) by NN. Substituting the generalized density function \(P(x) = \sum_{i=1} \delta _{x_i}(x)\) and \(P_{T_\psi}(x) = \sum_{i=1} \delta _{T_\psi(z_i)}(x)\), the problem becomes
The parameters \(\phi\) and \(\psi\) can be updated alternatively.
vs MCMC
If we parameterize the function \(P_X(x)\) by \(P_\theta (x)\), e.g. \(P_\theta(x) = \sigma(\boldsymbol{\theta} ^{\top} x + \theta_0)\), to sample from \(P_\theta (x)\), we need MCMC, but it is hard to know when it converges.
Using a generator to approximate \(P_X\) is a popular alternative recently. We can efficiently sample from \(\mathcal{Z}\), and use normalizing flow to compute \(\mathbb{E} _{X \sim P_X}[X]\).
Scientific Applications¶
Learning¶
Suppose there is a physical process model by PDE with some parameter \(a\). Given \(a\), there is a solution \(u_a\). We want to learn a forward mapping \(F: a \rightarrow u_a\), or a backward mapping \(B: u_a \rightarrow a\).
Applying NN, we can parameterize \(F \leftarrow F_\phi\), \(B \leftarrow B_\psi\).
Current research problem:
how to specify the NN architecture?
generalizable? scalable?
Solving¶
Suppose there is some high-dimensional PDE, e.g. Fokker-Planck, many body shrodinger, etc. We can parameterize some functions therein as NN.