Graph-based Spectral Methods

PCA, CCA, MDS are linear dimensionality reduction methods. The lower-dimensional linear projection preserves distances between all points.

If our data lies on a nonlinear manifold (a topological space locally resembling Euclidean space), then we need non-linear dimensionality reduction methods. Many of them are extended from MDS, that extends to a variety of distance/similarity measures. They only preserve local distance/neighborhood information along nonlinear manifold.

In general, there are three steps:

  1. Define some similarity/distance measure between data points \(d(\boldsymbol{x}_i ,\boldsymbol{x}_j)\).

  2. Induce a graph from the \(n \times d\) data set \(\boldsymbol{X}\)

    • nodes are the data points

    • add edge \(e(i,j)\) if the distance \(d(\boldsymbol{x}_i,\boldsymbol{x}_j)\) satisfy certain criteria, e.g. \(d<\varepsilon\), \(k\)-NN, or mutual \(k\)-NN.

  3. Perform spectral methods on some graph matrices, such as adjacency matrix, weights matrix, Laplacian matrix, etc, to obtain the embedding \(\boldsymbol{Z} _{n \times d}\) that preserve some property in the original space.

Examples: Isomap, maximum variance unfolding, locally linear embedding, Laplacian eigenmaps.

The obtained embeddings can then be used as input for clustering, e.g. \(k\)-means. Essentially, we look for clustering on the manifold, rather than the original space, which makes more sense.

Isomap

[Tenenbaum et al. 2000]

For a manifold in a high-dimensional space, it may NOT be reasonable to use Euclidean distance to measure the pairwise dissimilarity of points. Geodesic distance along the manifold may be more appropriate.

Isomap is a direct extension of MDS where Euclidean distances are replaced with estimated geodesic distances along the manifold, i.e. it tries to “unfold” a manifold and preserve the pairwise geodesic distances.

Learning

Isomap algorithm:

  1. Construct a graph where node \(i\) corresponds to input example \(\boldsymbol{x}_i\), and nodes \(i\) and \(j\) are connected by an edge if \(\boldsymbol{x}_i, \boldsymbol{x}_j\) are nearest neighbors (by some definition) with edge weight being the Euclidean distance \(d_{ij} = \left\| \boldsymbol{x}_i -\boldsymbol{x}_j \right\|\).

  2. For each \(i,j\) in the graph, compute pairwise distance \(\Delta_{i j}\) along the shortest path in the graph (e.g., using Dijkstra’s shortest path algorithm) as the estimated geodesic distance.

  3. Perform MDS for some dimensionality \(k\), taking the estimated geodesic distances \(\Delta_{i j}\) as input in place of Euclidean distances.

The output is the \(k\)-dimensional projections of all of the data points \(\boldsymbol{z}_i\) such that the Euclidean distances in the projected space approximate the geodesic distances on the manifold.

\[ \left\|\boldsymbol{z}_{i}-\boldsymbol{z}_{j}\right\|^{2} \approx \Delta_{i j}^{2} \]

Fig. 103 Isomap with \(d=3,k=2\). The blue line is the real geodesic distance and the red line is estimated. [Livescu 2021]

Pros Cons

Pros

  • As the data set size increases, isomap is guaranteed to converge to the correct manifold that the data was drawn from, under certain conditions (e.g. no holes)

Cons

  • Can be sensitive to the neighborhood size used in graph construction, or equivalently to the noise in the data.

    Fig. 104 Isomap fails when there are noises [Livescu 2021]

  • Can’t handle holes in the manifold. Geodesic distance computation can break. This is because the two points (even with the same color/label) sit in opposite to the hole have large geodesic distance on the manifold, which leads to large Euclidean distance in the projected space.

    Fig. 105 Isomap fails when there are holes [Livescu 2021]

Laplacian Eigenmaps

[Belkin & Niyogi 2003]

Unlike isomap where the edge weights are local Euclidean distances, Laplacian eigenmaps define edge weights in another way.

Learning

  1. Construct an graph \(G = (V, E)\) from data \(\boldsymbol{X}\). Add edge \((i, j)\) if \(\boldsymbol{x}_i\) and \(\boldsymbol{x}_j\) are close, in the sense that

    • \(\left\| \boldsymbol{x}_i - \boldsymbol{x}_j \right\| < \epsilon\), or

    • \(n\)-nearest-neighbors

  2. Define edge weights as

    \[\begin{split} w_{i j}=\left\{\begin{array}{ll} \exp \left(-\left\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right\|^{2} / t\right), & (i, j) \in E \\ 0 & \text { otherwise } \end{array}\right. \end{split}\]

    where \(t\) is a hyperparameter like temperature. As \(t = \infty\), \(w_{ij} = a_{ij}\).

  3. Define a diagonal matrix \(\boldsymbol{D}\) with \(d_{ii} = \sum_j w_{ij}\). This can be seen as the density around the node \(i\). The graph Laplacian is \(\boldsymbol{L} = \boldsymbol{D} - \boldsymbol{W}\). The \(k\)-dimensional representation \(\boldsymbol{Z}\) is given by the \(k\) bottom eigenvectors (excluding the smallest one, which is \(\boldsymbol{1}\)) for the generalized eigenvector problem

    \[ \boldsymbol{L} \boldsymbol{v} = \lambda \boldsymbol{D} \boldsymbol{v} \]

    If \(G\) is not connected, run this step for each connected component in \(G\).

Fig. 106 Laplacian eigenmap with varing \(N\)-nearest-neighbors and temperature \(t\) [Livescu 2021]

Relation to Spectral Clustering

For the normalized cut problem in spectral clustering, the cluster label is given by the 2nd smallest eigenvector of the generalized eigen problem

\[ \boldsymbol{L} \boldsymbol{v} = \lambda \boldsymbol{D} \boldsymbol{v} \]

which is exactly the 1-dimensional representation of Laplacian eigenmaps. In this sense, the local approach to dimensionality reduction imposes a natural clustering of data.

Seriation Problem

Wikipedia. Fogel.

Suppose there are \(n\) archeological pieces \(i = 1, \ldots, n\). We want to time order them by finding some latent ordering \(\pi(i)\), given some similarity measure \(w_{ij}\). The problem can be formulated as

\[ \min_{\pi: [n] \rightarrow [n] }\ \sum_{i,j=1}^n w_{ij} \left\| \pi(i) - \pi(j) \right\| ^2 \]

Solving permutation is hard. Spectral relaxation drop the permutation constraint and solve

\[ \min_{\boldsymbol{f} \in \mathbb{R} ^{n} }\ \sum_{i,j=1}^n w_{ij} \left\| f_i - f_j \right\| ^2 \quad \text{s.t. } \left\| \boldsymbol{f} \right\| = 1, \boldsymbol{f} ^{\top} \boldsymbol{1} = 0 \]

Hope that the relative order in \(\boldsymbol{f} ^*\) can tell information about \(\pi^*\).

An example of applying MDS to seriation problem is shown below. The data is the “distances” between 9 archaeological sites from different periods, based upon the frequencies of different types of potsherds found at the sites. The MDS algorithm successfully separate years (1137, 1131) from the rest (918~1062). Moreover, we can see that in the group (918~1062), there is roughly a upward line, that aligns these seven sites chronologically. All th ese observations suggest that MDS successfully preserves the distance between among the 9 sites.

Fig. 107 Applying MDS to similarity matrix of sites.

What if we have some information, say some \(i\) is known to be in some year? Can we use this information?

Locally Linear Embedding

Locally linear embedding learns a mapping in which each point can be expressed as a linear function of its nearest neighbors.

Maximum Variance Unfolding

Maximum variance unfolding tries to maximize the variance of the data (like PCA) while respecting neighborhood relationships.

ref: https://www.youtube.com/watch?v=DW3lSYltfzo

.

.

.

.

.

.

.

.