Kernels¶

Definitions¶

Definition (Kernel): A kernel is a function \(k: \mathcal{S} \times \mathcal{S} \rightarrow \mathbb{C}\) where \(\mathbb{S}\) is an arbitrary set.
Definition (Positive semi-definite kernels): A kernel \(k\) is a positive semi-definite kernel if \(\forall x_{1}, \ldots, x_{n} \in \mathcal{S}\), the matrix \(\boldsymbol{C}\) defined below is positive semi-definite,

\[\begin{split} \boldsymbol{C}=\left(\begin{array}{l} k\left(x_{1}, x_{1}\right), \ldots, k\left(x_{1}, x_{n}\right) \\ k\left(x_{2}, x_{1}\right), \ldots, k\left(x_{2}, x_{n}\right) \\ \vdots \\ k\left(x_{n}, x_{1}\right), \ldots, k\left(x_{n}, x_{n}\right) \end{array}\right) \in \mathbb{C}^{n \times n} \end{split}\]

Examples

Dot product \(k(\boldsymbol{x}, \boldsymbol{y})=\boldsymbol{x}^{*} \boldsymbol{y}\), where \(\boldsymbol{x}, \boldsymbol{y} \in \mathbb{C}^{d}\), is a PSD kernel, since the matrix \(\boldsymbol{X}\) is PSD because \(\boldsymbol{u} ^* \boldsymbol{C} \boldsymbol{u} = \boldsymbol{u} ^* \boldsymbol{X} ^* \boldsymbol{X} \boldsymbol{u} \ge 0\).
In general, \(k(x, y) \triangleq \boldsymbol{\phi}(x) ^\top \boldsymbol{\phi}(y)\), with arbitrary map \(\boldsymbol{\phi}: \mathcal{S} \rightarrow \mathbb{R}^{d}\), is a PSD kernel,
More generally, \(k(x, y) \triangleq\langle\boldsymbol{\phi}(x), \boldsymbol{\phi}(y)\rangle_{\mathcal{H}}\), with \(\boldsymbol{\phi}: \mathcal{S} \rightarrow \mathcal{H}\), is a PSD kernel.

Construction of PSD Kernels¶

If \(k, k_1, k_2\) are real PSD kernels on some arbitrary set \(\mathcal{S}\), the following ones are also real PSD kernels.

\(\alpha k(x,y)\), where \(\alpha \in \mathbb{R} _{\ge 0}\)
\(k_1(x,y) + k_2 (x,y)\)
\(k_1(x,y) k_2 (x,y)\)
\(p(k(x,y))\) where \(p\) is a polinomial that has non-negative coefficients
\(\exp (k(x,y ))\)
\(f(x)k(x,y)f(y), \forall \, f: \mathcal{S} \rightarrow \mathbb{R}\)
\(k\left(\xi\left(x^{\prime}\right), \xi\left(y^{\prime}\right)\right), \forall \xi: \mathcal{S}^{\prime} \rightarrow \mathcal{S}\)

Popular Kernels¶

Linear kernel (dot product): \(k(\boldsymbol{x}, \boldsymbol{y})=\boldsymbol{x}^{\top} \boldsymbol{y}\)
Polynomial kernel: \(k(\boldsymbol{x}, \boldsymbol{y})=p\left(\boldsymbol{x}^{\top} \boldsymbol{y}\right)\) with non-negative coefficients
Gaussian (RBF) kernel: \(k\left(\boldsymbol{x}, \boldsymbol{y} ; \sigma^{2}\right)=\exp \left(-\frac{\|\boldsymbol{x}-\boldsymbol{y}\|^{2}}{2 \sigma^{2}}\right) = \exp \left(-\gamma \|\boldsymbol{x}-\boldsymbol{y}\|^{2}\right)\) (two parameterization)
Laplacian kernel: \(k\left(\boldsymbol{x}, \boldsymbol{y}\right)=\exp \left(-\|\boldsymbol{x}-\boldsymbol{y}\|_{p}/\sigma\right)\), usually \(p=2\), sometimes \(1\).
Cauchy kernel: \(k\left(\boldsymbol{x}, \boldsymbol{y}\right) = (1 + \left\| \boldsymbol{x} - \boldsymbol{y} \right\|^2 ) ^{-1}\)

All these kernels are PSD kernels, and we can use them to construct PSD kernels.

For the last three, the entry value is in range \((0, 1]\), and is larger if \(\boldsymbol{x}\) is close to \(\boldsymbol{y}\). Hence the corresponding kernel matrix work like a similarity matrix.

Computation Problems¶

Compute sum of matrix \(\sum_{i,j}^n K_{ij}\). Exists \((1+\epsilon)\)-approximate algorithm
Compute kernel alignment of two kernel matrices \(\boldsymbol{K} _1\) and \(\boldsymbol{K} _2\).

\[ \frac{\langle \boldsymbol{K} _1 , \boldsymbol{K} _2 \rangle}{ \sqrt{ \langle \boldsymbol{K} _1, \boldsymbol{K} _1 \rangle \times \langle \boldsymbol{K} _2, \boldsymbol{K} _2 \rangle}} \]

Exists sub-linear time for Gaussian or Laplacian \(\boldsymbol{K}\)
Top eigenvalue of \(\boldsymbol{K}\). Exists \((1+\epsilon)\)-approximate algorithm.
Kernel density estimation: \(F(\boldsymbol{y}) = \frac{1}{m} \sum_i k(\boldsymbol{x} _i, \boldsymbol{y})\). Exists \((1+\epsilon)\)-approximate algorithm.

See Faster Kernel Matrix Algebra via Density Estimation.

Why Using Kernels¶

Kernels are popular. Why? Let’s first see two theorems.

Mercer’s Theorem¶

Given a kernel function \(k(x, y)\), can we find a map \(\boldsymbol{\phi}: \mathcal{S} \rightarrow \mathcal{H}\) to approximate the function value: \(k(x, y)\approx \langle\boldsymbol{\phi}(x), \boldsymbol{\phi}(y)\rangle_{\mathcal{H}}\)?

Theorem (Mercer’s, 1909): Under fairly general conditions, for any PSD kernel \(k: \mathcal{S} \times \mathcal{S} \rightarrow \mathbb{C}\), there exists a map \(\boldsymbol{\phi}: \mathcal{S} \rightarrow \mathcal{H}\) such that \(k(x, y)=\langle\boldsymbol{\phi}(x), \boldsymbol{\phi}(y)\rangle_{\mathcal{H}}\).

Note that \(\boldsymbol{\phi}\) can have infinite dimensions.

Representer Theorem¶

Representer Theorem (Simplified)

Consider an optimization problem on a data set \(\mathcal{D} = \left\{ \boldsymbol{x}_i ,y_i \right\} _{i=1}^n, \boldsymbol{x}_i \in \mathcal{S}\) of the following canonical form

\[ \boldsymbol{w}^{*}=\underset{\boldsymbol{w}}{\arg \min } \sum_{i=1}^{n} \mathcal{L}\left(\left\langle\boldsymbol{w}, \boldsymbol{\phi} \left(\boldsymbol{x}_i \right)\right\rangle, y_{i}\right) + \lambda \|\boldsymbol{w}\|^{2} \]

where

\(\boldsymbol{\phi} : \mathcal{S} \rightarrow \mathbb{R} ^d\) (or \(\mathcal{S} \rightarrow \mathcal{H}\) in general)
\(\boldsymbol{w} \in \mathbb{R} ^d\) is the coefficients in the model to be solved
\(\mathcal{L}(\cdot, \cdot): \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\) on some arbitrary loss function.
\(\lambda > 0\)

Then there \(\exists\left\{\alpha_{i} \in \mathbb{R}\right\}_{i=1}^{n}\) such that the solution \(\boldsymbol{w} ^*\) is a linear combination of the \(n\) transformed data vectors.

\[\boldsymbol{w}^{*}=\sum_{i=1}^{n} \alpha_{i} \boldsymbol{\phi} \left(\boldsymbol{x}_i \right) = \boldsymbol{X} ^\top \boldsymbol{\alpha}\]

Example (Linear regression OLS)

In linear regression, we can write \(\boldsymbol{\beta}\) as a linear combination of data vectors \(\boldsymbol{x}_i\) (identical transformation \(\boldsymbol{\phi}\)).

\[\begin{split}\begin{aligned} && \boldsymbol{X} ^\top \boldsymbol{\alpha} &= (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \boldsymbol{X} ^\top \boldsymbol{y} \\ &\Rightarrow& (\boldsymbol{X} \boldsymbol{X} ^\top )(\boldsymbol{X} \boldsymbol{X} ^\top) \boldsymbol{\alpha} &=\boldsymbol{X} \left( \boldsymbol{X} ^\top \boldsymbol{X} (\boldsymbol{X} ^\top \boldsymbol{X} )^{-1} \right)\boldsymbol{X} ^\top \boldsymbol{y} \\ &\Rightarrow& \boldsymbol{K}^2 \boldsymbol{\alpha} &= \boldsymbol{K} \boldsymbol{y} \quad \text{let } \boldsymbol{K} = \boldsymbol{X} \boldsymbol{X} ^\top \\ &\Rightarrow& \boldsymbol{\alpha} &= \boldsymbol{K} ^{-1} \boldsymbol{y} \\ \end{aligned}\end{split}\]

Note that usually \(\boldsymbol{K} ^{-1}\) is not invertible. But \(\boldsymbol{\alpha}\) always exists, which can be proved by gradient descent, see this note. (but why \(\alpha_{i}^{(t)}=-s \sum_{r=0}^{t-1} \gamma_{i}^{(r)}\) converges??)

Implication

Substituting \(\boldsymbol{w} ^*\) back into the loss function, we have, for the loss of a data point \((\boldsymbol{x}, y)\),

\[\begin{split}\begin{aligned} \mathcal{L}\left(\left\langle\boldsymbol{w}^{*}, \boldsymbol{\phi}(\boldsymbol{x})\right\rangle, y\right) &=\mathcal{L}\left(\left\langle\sum_{i=1}^{n} \alpha_{i} \boldsymbol{\phi}\left(\boldsymbol{x}_{i}\right), \boldsymbol{\phi}(\boldsymbol{x})\right\rangle, y\right) \\ &=\mathcal{L}\left(\sum_{i=1}^{n} \alpha_{i}\left\langle\boldsymbol{\phi}\left(\boldsymbol{x}_{i}\right), \boldsymbol{\phi}(\boldsymbol{x})\right\rangle, y\right) \\ &=\mathcal{L}\left(\sum_{i=1}^{n} \alpha_{i} k\left(\boldsymbol{x}_{i}, \boldsymbol{x}\right), y\right) \end{aligned}\end{split}\]

The last line implies that

We don’t even need to know what \(\boldsymbol{\phi}\) is exactly. We just need to care about kernels and there is a corresponding \(\boldsymbol{\phi}\) under the wood.
We can convert the problem of minimization over \(\boldsymbol{w}\) to that over \(\boldsymbol{\alpha}\).

Logic¶

Now we can summarize the logic of using kernels.

People find that feature transformation \(\boldsymbol{\phi}\) help improve model performance. But there are many choices of \(\boldsymbol{\phi}: S \rightarrow \mathbb{R} ^d\), hard to enumerate and handcraft an optimal one.
People find using kernels can improve model performance too, and more importantly kernels are related to feature transformation:
- For every \(\boldsymbol{\phi} (\boldsymbol{x}_i)\) in \(\mathcal{H}\), there is a corresponding kernel \(k(\boldsymbol{x}_i , \boldsymbol{x}_j) = \boldsymbol{\phi}(\boldsymbol{x}_i ) ^\top \boldsymbol{\phi}(\boldsymbol{x}_j )\). In particular, if there is an optimal \(\boldsymbol{\phi} ^*\), then there is a kernel \(k = \boldsymbol{\phi} ^{* \top} \boldsymbol{\phi} ^*\).
- For every PSD kernel \(k(\cdot, \cdot)\), there exists a feature transformatin \(\boldsymbol{\phi}(\cdot)\) by Mercer’s Theorem. Note that \(\boldsymbol{\phi}(\cdot)\) can be infinite dimensional for some \(k(\cdot, \cdot)\).
- A advantage of kernels over \(\boldsymbol{\phi}\) is that we do not need to compute the dot product, which can be intractable in high-dimensional spaces.
Therefore, instead of handcrafting feature transformations \(\boldsymbol{\phi}\), people change to choosing (PSD) kernels to improve model performance.

Kernelized model class.

Some model classes have kernelized version, e.g. kernel PCA, kernel CCA, kernel SVM. But it is not easy to identity if a model class has a corresponding kernelized model class.

Data Science Handbook