Fourier Transform-based Representations

Why study Fourier transform in ML?

  • Staple of feature representation for many types of data (audio, images, others) for many decades, still useful in many settings

  • Also useful for deriving new algorithms in machine learning

Before we do any learning, we start with some data representation. Typically, this involves converting raw data into feature vectors. These days, much of feature engineering is replaced by neural network-based feature learning. But not all…choice of initial representation can still be important for

  • Visualization of raw data

  • Qualitative understanding of the data

  • Better input to a learning algorithm

  • When number of observations is small, handcraft features are also useful than complicated models, e.g. neural networks

Fourier methods include

  • Discrete-time and continuous-time Fourier series

  • Discrete-time and continuous-time Fourier transforms

  • Discrete Fourier transform (most common for digital signals)

Basics

Main idea

Decompose a signal \(\left\{ \boldsymbol{x} _t \right\}=x_0, x_1, \ldots, x_{N-1}\) as a sum of multiple sinusoids at different frequencies

\[x_t=\sum_{k} a_{k} f_{k}(t)\]

where

  • Signal = function of a discrete “time” variable \(n\)

  • Subscript notation reminds us that the time variable is discrete

  • \(f_k(t)\) sinusoid (sine/cosine function) at frequency indexed by \(k\)

  • \(a_k =\) “amount” of frequency \(k\) present in \(\left\{ \boldsymbol{x} _t \right\}\), aka “spectrum” of the signal (but this word has other meanings).

  • Fourier transform algorithms: ways of finding \(a_k\) given \(\left\{ \boldsymbol{x} _t \right\}\)

Demo:

  • http://www.falstad.com/dfilter/

  • https://www.youtube.com/watch?v=spUNpyF58BY&ab_channel=3Blue1Brown

Motivation

Physiology of hearing suggest that the structure of human ears are doing Fourier transform.

Fig. 56 Hairs in the cochlea (3) have different frequency responses [image link]

Examples and Applications

  • Speech

    Fig. 57 Separate components from a speech (apply FT to rolling windows)

  • Financial market data

  • Weather data

  • Medical imaging, other scientific imaging

  • Image compression (e.g. JPEG)

Applications to machine learning

  • Feature extraction, compression and de-noising of speech/images: Can be an important precursor to unsupervised (or supervised) learning

  • Approximating kernels [Rahimi & Recht 2007]

  • Speeding up convolutional neural networks [Mathieu et al. 2013]

  • Analyzing and regularizating neural networks [Aghazadeh et al. 2020]

Discrete Fourier Transform (DFT)

We often start with a very long signal and compute its spectrum over sub-sequences (“windows”) of fixed length \(N\) starting at sample \(t\)

\[ \left\{ \boldsymbol{x}_t \right\} = x_t, x_{t+1}, \ldots x_{t+N-1} \]

Transformation

The discrete Fourier transform (DFT) transforms \(\left\{ \boldsymbol{x}_t \right\}\) into another sequence \(\left\{ \boldsymbol{X} _k \right\} = X_0, X_1, \ldots, X_{N-1}\) where

\[ X_k=\sum_{n=t}^{t+N-1} x_n e^{-j 2 \pi k n / N}, \quad k=0, \ldots, N-1 \]
  • The DFT \(\left\{ \boldsymbol{X} _k \right\}\) is also called the spectrum. \(X_k\) is the value of the spectrum at the \(k\)-th frequency

  • Equivalently, we can consider \(k = −N/2,...,0,...,N/2\), i.e. the sequence is \(N\)-periodic. Sometimes people write \(\sum_{n=<N>}\) for simplicity

  • The fast Fourier transform (FFT) is an algorithm used to compute DFT for window length \(M = 2m\) for some \(m\).

  • After doing this for all frames (windows) of a signal, the result is a spectrogram

  • \(X_k\) is in general complex-valued

    • often only the real part of it is used. (Why complex numbers? Real-world signals are real… but complex signals are often much easier to analyze)

    • we often use only its magnitude or phase

  • units:

    • Each time sample n corresponds to a time in seconds

    • Each frequency sample k corresponds to a frequency \(f(k) = \frac{k}{N} R\) in Hz, where R is the sampling rate.

Important property: Spectra of real signals are conjugate-symmetric

  • Magnitude is symmetric about \(k = 0\) (equivalently about \(N/2\))

  • Phase is anti-symmetric about \(k = 0\) (equivalently about \(N/2\))

  • So we need only think about positive frequencies

Relation between \(X_k\) and \(a_k\)

For historical reasons we will define \(X_k=N a_k\), i.e.

\[ a_{k}=\frac{1}{N} \sum_{n=<N>} x_n e^{-j k \omega_{0} n} \]

which is called the analysis equation.

The decomposition is called the synthesis equation

\[ x_n=\sum_{k=<N>} a_{k} e^{j k \omega_{0} n} \]

Note that

  • \(ω0 = 2π/N\)

  • Notation: Sometimes written \(x_n \leftrightarrow a_k\)

  • Convenient to think of \(a_k\) as being defined for all \(k\), although we only need a subset of \(N\) of them: \(a_{k+N} = a_k\)

  • Since \(x_n\) is periodic, it is specified uniquely by only \(N\) numbers, either in time or in frequency domain

Examples

Cosine function

\[\begin{split} \begin{aligned} x_n &=\cos \left(\frac{\pi}{4} n\right) \\ &=\frac{1}{2}\left(e^{j \pi n / 4}+e^{-j \pi n / 4}\right) \\ \Longrightarrow \omega_{0} &=\pi / 4, N=8, a_{1}=a_{-1}=1 / 2, X[1]=X[-1]=8 \frac{1}{2}=4 \end{aligned} \end{split}\]

Sine function

\[\begin{split} \begin{aligned} x_n &=\sin \left(\frac{\pi}{4} n\right) \\ &=\frac{1}{2 j}\left(e^{j \pi n / 4}-e^{-j \pi n / 4}\right) \\ \Longrightarrow \omega_{0} &=\pi / 4, N=8, a_{1}=\frac{1}{2 j}, a_{-1}=-\frac{1}{2 j} \end{aligned} \end{split}\]

Continuous-time Fourier Transform

Even though we mainly deal with digitized data, we sometimes wish to reason about continuous-time functions. Continuous-time Fourier transform describes signals as continuous “sums” (integrals) of sinusoids at arbitrary frequency \(\omega\).

\[\begin{split}\begin{aligned} x(t) &=\frac{1}{2 \pi} \int_{-\infty}^{\infty} X(j \omega) e^{j \omega t} \mathrm{~d} \omega &\text { Synthesis equation } \\ X(j \omega) &=\int_{-\infty}^{\infty} x(t) e^{-j \omega t} \mathrm{~d} t & \text { Analysis equation } \end{aligned}\end{split}\]

2-D Discrete-“time” Fourier series/transforms

\[\begin{split} \begin{aligned} X_{k l}=& \frac{1}{M N} \sum_{<N>} \sum_{<M>} x[m, n] e^{-j 2 \pi(n k / N+m l / M)} \\ & \text{where } 0 \leq k \leq N-1,0 \leq l \leq M-1 \\ x[m, n]=& \sum_{k=0}^{N-1} \sum_{l=0}^{M-1} X_{k l} e^{j 2 \pi(n k / N+m l / M)} \end{aligned} \end{split}\]
  • Equivalent to 1-D transforms when one frequency dim is fixed.

  • 2-D fast Fourier transform requires \(M N (\log _{2} M) (\log _{2} N)\) operations.

2-D Discrete-“time” convolution

\[ y\left[n_{1}, n_{2}\right]=x\left[n_{1}, n_{2}\right] * h\left[n_{1}, n_{2}\right]=\sum_{k_{1}=-\infty}^{\infty} \sum_{k_{2}=-\infty}^{\infty} x\left[k_{1}, k_{2}\right] h\left[n_{1}-k_{1}, n_{2}-k_{2}\right] \]
  • This is the operation being done in convolutional neural networks, on the image \(x\) and the filter \(h\).

  • But we typically don’t bother with flipping the filter and state it as a dot product

  • The properties of convolution tell us \(Y_{k l}=X_{k l} H_{k l}\)

Applications to Machine Learning

Random Fourier Features

We introduced that computing kernel matrix is computationally expensive. We can approximate \(\phi(\boldsymbol{x})\) corresponding to a given kernel.

Theorem (Bochner’s)

Shift-invariant kernels (only depends on the difference \(\boldsymbol{x} -\boldsymbol{y} \)) are \(n\)-dimensional continuous Fourier transforms of some probability distribution \(p(\boldsymbol{\omega} )\),

\[ k(\boldsymbol{x} , \boldsymbol{y} )=k(\boldsymbol{x} -\boldsymbol{y} )=\int p(\boldsymbol{\omega} ) e^{j \boldsymbol{\omega} ^{\top} (\boldsymbol{x} -\boldsymbol{y} )} \mathrm{~d} w=\mathbb{E}_{\omega}\left[\xi_{\omega}(\boldsymbol{x} ) \xi_{\omega}(\boldsymbol{y} )^{*}\right] \]

where \(\xi_{\omega}(\boldsymbol{x} )=e^{j \boldsymbol{\omega} ^{\top} \boldsymbol{y} }\).

So, we can estimate the kernel by averaging a bunch of such products for various random drawn values of \(\boldsymbol{\omega}\).

Idea: Use a feature map \(\phi(\boldsymbol{x})\) that is a concatenation of a bunch of \(\xi\)’s. Then we have an explicit nonlinear map, and no need to do large kernel computations!

For Gaussian, Laplacian and Cauchy kernel, \(p(\boldsymbol{\omega})\) is easy to find.

\[\begin{split} \begin{array}{lll} \text { Kernel Name } & k(\Delta) & p(\boldsymbol{\omega}) \\ \hline \text { Gaussian } & e^{-\frac{\|\Delta\|_{2}^{2}}{2}} & (2 \pi)^{-\frac{D}{2}} e^{-\frac{\|\omega\|_{2}^{2}}{2}} \\ \text { Laplacian } & e^{-\|\Delta\|_{1}} & \prod_{d} \frac{1}{\pi\left(1+\boldsymbol{\omega}_{d}^{2}\right)} \\ \text { Cauchy } & \prod_{d} \frac{2}{1+\Delta_{d}^{2}} & e^{-\|\Delta\|_{1}} \end{array} \end{split}\]

Algorithm: Random Fourier Features.


  • Require: A positive definite shift-invariant kernel \(k(\boldsymbol{x}, \boldsymbol{y})=k(\boldsymbol{x}-\boldsymbol{y})\)

  • Ensure: A randomized feature map \(\boldsymbol{z}(\boldsymbol{x}): \mathcal{R}^{d} \rightarrow \mathcal{R}^{2 D}\) so that \(\boldsymbol{z}(\boldsymbol{x})^{\prime} \boldsymbol{z}(\boldsymbol{y}) \approx k(\boldsymbol{x}-\boldsymbol{y})\)

  • Compute the Fourier transform \(p\) of the kernel \(k: p(\boldsymbol{\omega} )=\frac{1}{2 \pi} \int e^{-j \boldsymbol{\omega} ^{\top} \Delta} k(\Delta) \mathrm{~d} \Delta\)

  • Draw \(D \text { iid samples } \boldsymbol{\omega}_{1}, \cdots, \boldsymbol{\omega}_{D} \in \mathcal{R}^{d} \text { from } p\)

  • Let \(\boldsymbol{z}(\boldsymbol{x}) \equiv \sqrt{\frac{1}{D}}\left[\cos \left( \boldsymbol{\omega} _{1}^{\prime} \boldsymbol{x}\right) \cdots \cos \left(\boldsymbol{\omega}_{D}^{\prime} \boldsymbol{x}\right) \sin \left(\boldsymbol{\omega}_{1}^{\prime} \boldsymbol{x}\right) \cdots \sin \left(\boldsymbol{\omega}_{D}^{\prime} \boldsymbol{x}\right)\right]^{\prime}\)


Faster CNN Training

In-between two convolutional layers of depth \(d_1\) and \(d_2\), the number of convolution computation is \(d_1 \times d_2\) in forward propagation. Fourier transform based convolution can speed up this process. [Mathieu, Henaff, LeCun, “Fast Training of Convolutional Networks through FFTs”, arXiv. 2013]

Today there are other methods that speed up CNN training.