Dimensionality Reduction¶
Aka representation learning.
Motivation¶
Although the data may appear high dimensional, there may only be a small number of degrees of variability, corresponding to latent factors. Low dimensional representations are useful for enabling fast nearest neighbor searches and 2-d projections are useful for visualization. It is also useful for compression and downstream learning tasks.
The question to ponder before making dimensionality reduction: If I take away some information, am I able to reconstruct it?
Objective¶
We want to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the essence of the data, i.e. find a projection \(\mathcal{X} \rightarrow \mathcal{Z}\) such that \(\mathrm{dim} \left( \mathcal{Z} \right) \ll \mathrm{dim} \left( \mathcal{X} \right)\)
More specifically, we seek a projection
by optimizing some goodness of fit criteria \(J\)
For linear dimensionality models, the projection \(P(\boldsymbol{x}; \boldsymbol{\theta} )\) can be represented by a \(d \times k\) matrix \(\boldsymbol{W}\). The low dimensional representation \(\boldsymbol{z}\) is computed by
\[ \boldsymbol{z} = \boldsymbol{W} ^\top \boldsymbol{x} \]and the reconstruction of \(\boldsymbol{x}\) is a weighted combination of the vectors in \(\boldsymbol{W}\):
\[ \hat{\boldsymbol{x}} = \boldsymbol{W} \boldsymbol{z} = \boldsymbol{W} \boldsymbol{W} ^\top \boldsymbol{x} \]For the whole data matrix
\[\begin{split}\begin{aligned} \boldsymbol{Z} _{n \times k} &= \boldsymbol{X} \boldsymbol{W} \\ \widehat{\boldsymbol{X}} &= \boldsymbol{X} \boldsymbol{W} \boldsymbol{W} ^\top \\ \end{aligned}\end{split}\]For non-linear dimensionality reduction models, projection \(P\) can be quite different.
Metrics¶
The performance of a dimensionality reduction method can be measured by
qualitative evaluation
does the visualization looks better?
evaluation on supervised downstream tasks
intrinsic evaluation
cluster purity
degree of compression
evaluate it on labeled data
reconstruction error
need to evaluate on hold out data
but some model do not work with out-of-sample data
KL distance between kernel density estimators on training set and embedding data set.
Multi-view Dimensionality Reduction¶
Unsupervised dimensionality reduction is very challenging. If we happen to have multiple data views, it can be easier.
View means the number of ways we describe a data object.
picture: pixel value + captions
picture: one with rotation noise, the other with pixel noise
wiki webpage: page texts + hyper link structure
articles: two languages
speech: voice wave + video of mouth
For instance, in two-view representation learning, training data consists of samples of a \(d\)-dimensional random vector that has some natural split into two sub-vectors.
The task is to learn useful features/subspaces from such two-view data. Typically it involves learning representations of one view that are predictive of the other. The idea is, if some transformation of the two views to make them “similar”, then the transformation is meaningful, and the transformed vectors are representations.
One example of linear multi-view representation learning is canonical correlation analysis. Other variants include Kernel CCA, variational CCA, etc.
Summary¶
Linear Models¶
Model |
Input |
Objective |
Solution |
\(\qquad \qquad \text{Remarks}\qquad \qquad\) |
---|---|---|---|---|
PCA |
\(\boldsymbol{X} ^\top \boldsymbol{X}\) |
\(\underset{\boldsymbol{w}}{\max} \boldsymbol{w}^{\top} \boldsymbol{\Sigma} \boldsymbol{w}\) |
\(\boldsymbol{X} ^\top \boldsymbol{X} = \boldsymbol{U} \boldsymbol{D} \boldsymbol{U} ^\top\) |
Standardize \(\boldsymbol{X}\) before running |
Kernel PCA |
\(\boldsymbol{X}\), kernel |
\(\underset{\boldsymbol{\alpha}}{\max} \boldsymbol{\alpha}^{\top} \boldsymbol{K}^2 \boldsymbol{\alpha}\) |
\(\boldsymbol{K} \boldsymbol{\alpha}_{j}=n \lambda_{j} \boldsymbol{\alpha}_{j}\) |
1. Center \(\boldsymbol{K}\) |
Probabilistic PCA |
\(\boldsymbol{X} ^\top \boldsymbol{X}\) |
\(\boldsymbol{z} \sim \mathcal{N}( \boldsymbol{0}, \boldsymbol{I})\) |
\(\boldsymbol{W}_{M L} =\boldsymbol{U}_{d \times k}\left(\boldsymbol{\Lambda} _{k}-\sigma^{2} \boldsymbol{I}_k\right)^{1 / 2} \boldsymbol{R}_k\) |
1. \(\boldsymbol{W} _{ML}\) not unique |
CCA |
\(\boldsymbol{X} , \boldsymbol{Y}\) |
\(\max _{\boldsymbol{\alpha}, \boldsymbol{\beta} } \operatorname{Corr}\left(\boldsymbol{\alpha}^{\prime} \boldsymbol{x} , \boldsymbol{\beta}^{\prime} \boldsymbol{y} \right)\) |
\(\boldsymbol{\Sigma}_{x x}^{-1} \boldsymbol{\Sigma}_{x y} \boldsymbol{\Sigma}_{y y}^{-1} \boldsymbol{\Sigma}_{y x} \boldsymbol{\alpha} = \rho^2 \boldsymbol{\alpha}\) |
\(\rho\) is invariant of scaling of linear transformation of \(\boldsymbol{X}\) or \(\boldsymbol{Y}\) |
Regularized CCA |
‘’ |
‘’ |
\(\boldsymbol{\Sigma}_{x x} \leftarrow\boldsymbol{\Sigma}_{x x}+r_{x} I\) |
Add spherical noise \(rI\) to the covariance matrices |
(Regularized) Kernel CCA |
\(\boldsymbol{X}, \boldsymbol{Y}\), \(r\), kernel |
\(\max _{\boldsymbol{\alpha}, \boldsymbol{\beta}} \frac{\boldsymbol{\alpha} ^{\top} \boldsymbol{K} _{x} \boldsymbol{K} _{y} \boldsymbol{\beta} }{\sqrt{\boldsymbol{\alpha} ^{\top} \boldsymbol{K} _{x}^{2} \boldsymbol{\alpha} \cdot \boldsymbol{\beta} ^{\top} \boldsymbol{K} _{y}^{2} \boldsymbol{\beta} }}\) |
\(\left(\boldsymbol{K}_{x}+r I\right)^{-1} \boldsymbol{K}_{y}\left(\boldsymbol{K}_{y}+r I\right)^{-1} \boldsymbol{K}_{x} \boldsymbol{\alpha}=\lambda^{2} \boldsymbol{\alpha}\) |
To avoid trivial solution |
MDS |
\(\boldsymbol{X} \boldsymbol{X} ^\top\) or \(\boldsymbol{F}\) |
\(\min \sum_{i, j}\left(\boldsymbol{x}_{i} ^\top \boldsymbol{x}_{j}-\boldsymbol{z}_{i} ^\top\boldsymbol{z}_{j}\right)^2\) |
\(\boldsymbol{G} = \boldsymbol{X} \boldsymbol{X} ^\top = \boldsymbol{V} \boldsymbol{\Lambda} \boldsymbol{V}\) |
1. Retain inner product |
Factor Analysis vs PPCA
Name |
Model |
Assumptions |
Estimation |
---|---|---|---|
PCFA |
\(\boldsymbol{x} = \boldsymbol{\mu} + \boldsymbol{L} \boldsymbol{f} + \boldsymbol{\varepsilon}\) |
\(\mathbb{E} [\boldsymbol{f} ] = \boldsymbol{0} _k\), \(\operatorname{Cov}\left( \boldsymbol{f} \right) = \boldsymbol{I} _k\), |
EVD |
MLFA |
same as above |
\(\boldsymbol{f}\sim \mathcal{N}(0, \boldsymbol{I})\), \(\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \boldsymbol{\Psi})\) |
ML, up to rotation |
PPCA |
\(\boldsymbol{x} = \boldsymbol{\mu} + \boldsymbol{W} \boldsymbol{z} + \boldsymbol{\varepsilon}\) |
\(\boldsymbol{z}\sim \mathcal{N}(0, \boldsymbol{I})\), \(\boldsymbol{\varepsilon}\sim \mathcal{N}(0, \sigma^2 \boldsymbol{I})\) |
ML, up to rotation |
Review
Model |
Pros |
Cons |
---|---|---|
PCA |
1. Sensitive to variable scale |
|
Probabilistic PCA |
1. Fewer variance parameters |
|
Kernel PCA |
1. Enable non-linear feature |
Computational issue (sol: subset, approximation \(\boldsymbol{K} \approx \boldsymbol{F} ^\top \boldsymbol{F}\), random Fourier) |
CCA |
Have discriminative power in some cases that PCA doesn’t |
Easy to overfit tiny signals |
Regularized CCA |
Avoid overfitting of CCA |
|
(Regularized) Kernel CCA |
1. Enable non-linear feature |
|
MDS |
Can be obtain from Euclidean distance matrix \(\boldsymbol{F}\). For instance, from a survey “how do you compare the two items?” |
Non-linear Models¶
In previous linear models, the lower-dimensional linear projection preserves distances between all points. In non-linear models below, they only preserve local distance/neighborhood information along nonlinear manifold.
In general, there are three steps
Define some similarity/distance measure between data points \(d(\boldsymbol{x}_i ,\boldsymbol{x}_j)\).
Induce a graph from the \(n \times d\) data set \(\boldsymbol{X}\)
nodes are the data points
add edge \(e(i,j)\) if the distance \(d(\boldsymbol{x}_i,\boldsymbol{x}_j)\) satisfy certain criteria, e.g. \(d<\varepsilon\), \(k\)-NN, or mutual \(k\)-NN.
edge weights are the distance measures \(w_{ij} = d_{ij}\)
Perform spectral methods on some graph matrices, such as adjacency matrix, weights matrix, Laplacian matrix, etc, to obtain the embedding \(\boldsymbol{Z} _{n \times d}\) that preserve some property in the original space.
Model |
Similarity/Distance Measure |
Objective |
Solution |
\(\qquad \qquad \text{Remarks}\qquad \qquad\) |
---|---|---|---|---|
Isomap |
geodesic distance along manifold |
retain geodesic distances \(\left\|\boldsymbol{z}_{i}-\boldsymbol{z}_{j}\right\|^{2} \approx \Delta_{i j}^{2}\) |
Run MDS with geodesic distance \(\Delta\) matrix |
Unfold manifold |
Laplacian Eigenmaps |
Gaussian kernel similarity |
total weights of a node \(d_{ii}= \sum_j w_{ij}\) |
\(k\) bottom eigenvectors of \(\boldsymbol{L}=\boldsymbol{I}-\boldsymbol{D}^{-\frac{1}{2}} \boldsymbol{W} \boldsymbol{D}^{-\frac{1}{2}}\) |
Retain weighted distance |
SNE |
$p_{j\mid i} = \frac{\exp \left( \left(- |
\boldsymbol{x}{i}-\boldsymbol{x}{j} |^{2}\right) / 2 \sigma_{i}^{2} \right)}{\sum_{k \neq i} \exp \left( \left(- |
||
\(t\)-SNE |
\(p_{i j}=\frac{p_{j \mid i}+p_{i \mid j}}{2 n}\) |
Retain neighborhood joint probability \(\min \operatorname{KL}(P, Q)=\sum_{i,j} p_{i j} \log \frac{p_{i j}}{q_{i j}}\) |
‘’ |
1. Use joint probability |
Review
Model |
Pros |
Cons |
---|---|---|
Isomap |
Recovers the manifold well |
1. Sensitive to neighborhood size / noise |
Laplacian Eigenmaps |
||
SNE |
1. Difficult to optimize |
|
\(t\)-SNE |
Works especially well for data with clustering nature |
1. Perplexity hyperparameter is important |
Parametric \(t\)-SNE |
Enable out-of-sample projection |