Dimensionality Reduction¶

Aka representation learning.

Motivation¶

Although the data may appear high dimensional, there may only be a small number of degrees of variability, corresponding to latent factors. Low dimensional representations are useful for enabling fast nearest neighbor searches and 2-d projections are useful for visualization. It is also useful for compression and downstream learning tasks.

The question to ponder before making dimensionality reduction: If I take away some information, am I able to reconstruct it?

Objective¶

We want to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the essence of the data, i.e. find a projection $\mathcal{X} \rightarrow \mathcal{Z}$ such that $\mathrm{dim} \left( \mathcal{Z} \right) \ll \mathrm{dim} \left( \mathcal{X} \right)$

More specifically, we seek a projection

\[ P(\boldsymbol{x} ; \boldsymbol{\theta}): \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}, \quad k \ll d \]

by optimizing some goodness of fit criteria $J$

\[ \min _{\boldsymbol{\theta}} J\left(\boldsymbol{x}_{1}, \ldots, \boldsymbol{x}_{n}, P\left(\boldsymbol{x}_{1}\right), \ldots, P\left(\boldsymbol{x}_{n}\right)\right) \]

For linear dimensionality models, the projection $P(\boldsymbol{x}; \boldsymbol{\theta} )$ can be represented by a $d \times k$ matrix $\boldsymbol{W}$. The low dimensional representation $\boldsymbol{z}$ is computed by

\[ \boldsymbol{z} = \boldsymbol{W} ^\top \boldsymbol{x} \]

and the reconstruction of $\boldsymbol{x}$ is a weighted combination of the vectors in $\boldsymbol{W}$:

\[ \hat{\boldsymbol{x}} = \boldsymbol{W} \boldsymbol{z} = \boldsymbol{W} \boldsymbol{W} ^\top \boldsymbol{x} \]

For the whole data matrix

\[\begin{split}\begin{aligned} \boldsymbol{Z} _{n \times k} &= \boldsymbol{X} \boldsymbol{W} \\ \widehat{\boldsymbol{X}} &= \boldsymbol{X} \boldsymbol{W} \boldsymbol{W} ^\top \\ \end{aligned}\end{split}\]
For non-linear dimensionality reduction models, projection $P$ can be quite different.

Metrics¶

The performance of a dimensionality reduction method can be measured by

qualitative evaluation
- does the visualization looks better?
evaluation on supervised downstream tasks
intrinsic evaluation
- cluster purity
- degree of compression
evaluate it on labeled data
reconstruction error
- need to evaluate on hold out data
- but some model do not work with out-of-sample data
KL distance between kernel density estimators on training set and embedding data set.

Multi-view Dimensionality Reduction¶

Unsupervised dimensionality reduction is very challenging. If we happen to have multiple data views, it can be easier.

View means the number of ways we describe a data object.

picture: pixel value + captions
picture: one with rotation noise, the other with pixel noise
wiki webpage: page texts + hyper link structure
articles: two languages
speech: voice wave + video of mouth

For instance, in two-view representation learning, training data consists of samples of a $d$-dimensional random vector that has some natural split into two sub-vectors.

\[\begin{split} \left[\begin{array}{l} \boldsymbol{x} \\ \boldsymbol{y} \end{array}\right], \boldsymbol{x} \in \boldsymbol{R}^{d_{x}}, \boldsymbol{y} \in \boldsymbol{R}^{d_{y}}, d_{x}+d_{y}=d \end{split}\]

The task is to learn useful features/subspaces from such two-view data. Typically it involves learning representations of one view that are predictive of the other. The idea is, if some transformation of the two views to make them “similar”, then the transformation is meaningful, and the transformed vectors are representations.

One example of linear multi-view representation learning is canonical correlation analysis. Other variants include Kernel CCA, variational CCA, etc.

Summary¶

Linear Models¶

Model	Input	Objective	Solution	$\qquad \qquad \text{Remarks}\qquad \qquad$
PCA	$\boldsymbol{X} ^\top \boldsymbol{X}$	$\underset{\boldsymbol{w}}{\max} \boldsymbol{w}^{\top} \boldsymbol{\Sigma} \boldsymbol{w}$ $\underset{\boldsymbol{W}}{\operatorname{min}} \sum_{i}^{n}\left\\|\boldsymbol{x}_{i}-\boldsymbol{W} \boldsymbol{z}_{i}\right\\|^{2}$	$\boldsymbol{X} ^\top \boldsymbol{X} = \boldsymbol{U} \boldsymbol{D} \boldsymbol{U} ^\top$ $\boldsymbol{z} = \boldsymbol{U}_{[:k]} ^\top \boldsymbol{x}$	Standardize $\boldsymbol{X}$ before running
Kernel PCA	$\boldsymbol{X}$, kernel	$\underset{\boldsymbol{\alpha}}{\max} \boldsymbol{\alpha}^{\top} \boldsymbol{K}^2 \boldsymbol{\alpha}$	$\boldsymbol{K} \boldsymbol{\alpha}_{j}=n \lambda_{j} \boldsymbol{\alpha}_{j}$ $\boldsymbol{z} = \boldsymbol{A} ^\top _{[:k]} \boldsymbol{\Phi} \boldsymbol{\phi} (\boldsymbol{x}), z_j = \sum_{i=1}^n \alpha_{ji} k(\boldsymbol{x}_i , \boldsymbol{x})$	1. Center $\boldsymbol{K}$ 2. Linear kernel reduces to PCA
Probabilistic PCA	$\boldsymbol{X} ^\top \boldsymbol{X}$	$\boldsymbol{z} \sim \mathcal{N}( \boldsymbol{0}, \boldsymbol{I})$ $\boldsymbol{x} \mid \boldsymbol{z} \sim \mathcal{N}\left( \boldsymbol{W} \boldsymbol{z}+\boldsymbol{\mu} , \sigma^{2} \boldsymbol{I}\right)$ ML $\boldsymbol{W} , \boldsymbol{\mu} , \sigma$	$\boldsymbol{W}_{M L} =\boldsymbol{U}_{d \times k}\left(\boldsymbol{\Lambda} _{k}-\sigma^{2} \boldsymbol{I}_k\right)^{1 / 2} \boldsymbol{R}_k$ $\sigma_{M L}^{2} =\frac{1}{d-k} \sum_{j=k+1}^{d} \lambda_{j}$ $\widehat{\operatorname{E}}\left( \boldsymbol{z} \mid \boldsymbol{x} \right) = \boldsymbol{M} ^{-1} _{ML} \boldsymbol{W} ^\top _{ML}(\boldsymbol{x} - \bar{\boldsymbol{x}})$	1. $\boldsymbol{W} _{ML}$ not unique 2. $\sigma^2 \rightarrow 0$ reduces to PCA
CCA	$\boldsymbol{X} , \boldsymbol{Y}$	$\max _{\boldsymbol{\alpha}, \boldsymbol{\beta} } \operatorname{Corr}\left(\boldsymbol{\alpha}^{\prime} \boldsymbol{x} , \boldsymbol{\beta}^{\prime} \boldsymbol{y} \right)$	$\boldsymbol{\Sigma}_{x x}^{-1} \boldsymbol{\Sigma}_{x y} \boldsymbol{\Sigma}_{y y}^{-1} \boldsymbol{\Sigma}_{y x} \boldsymbol{\alpha} = \rho^2 \boldsymbol{\alpha}$ $\boldsymbol{\beta} \propto \boldsymbol{\Sigma}_{y y}^{-1} \boldsymbol{\Sigma}_{y x} \boldsymbol{\alpha}$	$\rho$ is invariant of scaling of linear transformation of $\boldsymbol{X}$ or $\boldsymbol{Y}$
Regularized CCA	‘’	‘’	$\boldsymbol{\Sigma}_{x x} \leftarrow\boldsymbol{\Sigma}_{x x}+r_{x} I$ $\boldsymbol{\Sigma}_{y y} \leftarrow \boldsymbol{\Sigma}_{y y}+r_{y} I$	Add spherical noise $rI$ to the covariance matrices
(Regularized) Kernel CCA	$\boldsymbol{X}, \boldsymbol{Y}$, $r$, kernel	$\max _{\boldsymbol{\alpha}, \boldsymbol{\beta}} \frac{\boldsymbol{\alpha} ^{\top} \boldsymbol{K} _{x} \boldsymbol{K} _{y} \boldsymbol{\beta} }{\sqrt{\boldsymbol{\alpha} ^{\top} \boldsymbol{K} _{x}^{2} \boldsymbol{\alpha} \cdot \boldsymbol{\beta} ^{\top} \boldsymbol{K} _{y}^{2} \boldsymbol{\beta} }}$	$\left(\boldsymbol{K}_{x}+r I\right)^{-1} \boldsymbol{K}_{y}\left(\boldsymbol{K}_{y}+r I\right)^{-1} \boldsymbol{K}_{x} \boldsymbol{\alpha}=\lambda^{2} \boldsymbol{\alpha}$ $\boldsymbol{\beta} = \left(\boldsymbol{K}_{y}+r I\right)^{-1} \boldsymbol{K}_{x} \boldsymbol{\alpha} /\lambda$	To avoid trivial solution
MDS	$\boldsymbol{X} \boldsymbol{X} ^\top$ or $\boldsymbol{F}$	$\min \sum_{i, j}\left(\boldsymbol{x}_{i} ^\top \boldsymbol{x}_{j}-\boldsymbol{z}_{i} ^\top\boldsymbol{z}_{j}\right)^2$	$\boldsymbol{G} = \boldsymbol{X} \boldsymbol{X} ^\top = \boldsymbol{V} \boldsymbol{\Lambda} \boldsymbol{V}$ $\boldsymbol{Z}=\boldsymbol{V}_{[: k]} \boldsymbol{\Lambda}_{[: k, k]}^{1 / 2}$	1. Retain inner product 2. Gives exactly the embeddings as PCA: $\boldsymbol{Z}_{P C A}=\boldsymbol{Z}_{M D S}$

Factor Analysis vs PPCA

Name	Model	Assumptions	Estimation
PCFA	$\boldsymbol{x} = \boldsymbol{\mu} + \boldsymbol{L} \boldsymbol{f} + \boldsymbol{\varepsilon}$	$\mathbb{E} [\boldsymbol{f} ] = \boldsymbol{0} _k$, $\operatorname{Cov}\left( \boldsymbol{f} \right) = \boldsymbol{I} _k$, $\mathbb{E} [\boldsymbol{\varepsilon} ] = \boldsymbol{0}$, $\operatorname{Cov}\left( \boldsymbol{\varepsilon} \right) = \boldsymbol{\Psi} = \operatorname{diag}(\psi_1, \ldots, \psi_d)$	EVD
MLFA	same as above	$\boldsymbol{f}\sim \mathcal{N}(0, \boldsymbol{I})$, $\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \boldsymbol{\Psi})$	ML, up to rotation
PPCA	$\boldsymbol{x} = \boldsymbol{\mu} + \boldsymbol{W} \boldsymbol{z} + \boldsymbol{\varepsilon}$	$\boldsymbol{z}\sim \mathcal{N}(0, \boldsymbol{I})$, $\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \sigma^2 \boldsymbol{I})$	ML, up to rotation

Review

Model	Pros	Cons
PCA		1. Sensitive to variable scale 2. Principal direction may not be discriminative
Probabilistic PCA	1. Fewer variance parameters 2. Enable sampling
Kernel PCA	1. Enable non-linear feature 2. Enable out-of-sample projection 3. Work well when the distribution fits the kernel	Computational issue (sol: subset, approximation $\boldsymbol{K} \approx \boldsymbol{F} ^\top \boldsymbol{F}$, random Fourier)
CCA	Have discriminative power in some cases that PCA doesn’t	Easy to overfit tiny signals
Regularized CCA	Avoid overfitting of CCA
(Regularized) Kernel CCA	1. Enable non-linear feature 2. Enable out-of-sample projection
MDS	Can be obtain from Euclidean distance matrix $\boldsymbol{F}$. For instance, from a survey “how do you compare the two items?”

Non-linear Models¶

In previous linear models, the lower-dimensional linear projection preserves distances between all points. In non-linear models below, they only preserve local distance/neighborhood information along nonlinear manifold.

In general, there are three steps

Define some similarity/distance measure between data points $d(\boldsymbol{x}_i ,\boldsymbol{x}_j)$.
Induce a graph from the $n \times d$ data set $\boldsymbol{X}$
- nodes are the data points
- add edge $e(i,j)$ if the distance $d(\boldsymbol{x}_i,\boldsymbol{x}_j)$ satisfy certain criteria, e.g. $d<\varepsilon$, $k$-NN, or mutual $k$-NN.
- edge weights are the distance measures $w_{ij} = d_{ij}$
Perform spectral methods on some graph matrices, such as adjacency matrix, weights matrix, Laplacian matrix, etc, to obtain the embedding $\boldsymbol{Z} _{n \times d}$ that preserve some property in the original space.

Model	Similarity/Distance Measure	Objective	Solution	$\qquad \qquad \text{Remarks}\qquad \qquad$
Isomap	geodesic distance along manifold	retain geodesic distances $\left\\|\boldsymbol{z}_{i}-\boldsymbol{z}_{j}\right\\|^{2} \approx \Delta_{i j}^{2}$	Run MDS with geodesic distance $\Delta$ matrix	Unfold manifold
Laplacian Eigenmaps	Gaussian kernel similarity $\exp \left(-\left\\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right\\|^{2} / t\right)$	total weights of a node $d_{ii}= \sum_j w_{ij}$ $\min \sum_{i j} \frac{w_{i j} \Vert \boldsymbol{z}_{i}-\boldsymbol{z}_{j}\Vert^{2}}{\sqrt{d_{i i} d_{j j}}}$	$k$ bottom eigenvectors of $\boldsymbol{L}=\boldsymbol{I}-\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{W} \boldsymbol{D}^{-\frac{1}{2}}$	Retain weighted distance
SNE	$p_{j\mid i} = \frac{\exp \left( \left(-		\boldsymbol{x}{i}-\boldsymbol{x}{j} \|^{2}\right) / 2 \sigma_{i}^{2} \right)}{\sum_{k \neq i} \exp \left( \left(-
$t$-SNE	$p_{i j}=\frac{p_{j \mid i}+p_{i \mid j}}{2 n}$ $q_{i j}=\frac{\left(1+\left\\|\boldsymbol{z}_{i}-\boldsymbol{z}_{j}\right\\|^{2}\right)^{-(df+1)/2}}{\sum_{k \neq l}\left(1+\left\\|\boldsymbol{z}_{k}-\boldsymbol{z}_{l}\right\\|^{2}\right)^{-(df+1)/2}}$	Retain neighborhood joint probability $\min \operatorname{KL}(P, Q)=\sum_{i,j} p_{i j} \log \frac{p_{i j}}{q_{i j}}$	‘’	1. Use joint probability 2. Use $t$-distribution, usually $df=1$