Streaming Algorithms

Algorithms for Big Data

Problems

  • Cannot even read the whole data. Sol: sub-linear time algorithms

  • Cannot afford storing the data. Sol: streaming algorithms

  • Single processor does not suffice. Sol: map-reduce algorithms

Tasks

  • Searching through the data, e.g. nearest neighbor

  • Efficient summary, e.g. summary statistics

Techniques

  • Sampling, importance sampling

  • Dimensionality reduction

Reference

  • https://www.sketchingbigdata.org/fall17/lec/lec1.pdf

  • CMPSCI 711: More Advanced Algorithms. link

Terms:

  • Sketching is the compression \(C(x)\) of some data set \(x\) that allows us to query \(f(x)\). We can understand the sketch as a compression of \(x\) which largely reduces the dimension of vector but it still has enough information to give a good estimation for the query we care about

    • For instance, \(x\) is a sequence of observations, and we are interested in \(\operatorname{mean} (x)\). But \(x\) is too large, so we use \(C(x)\) to (sometimes approximately) compute \(\operatorname{mean} (x)\).

    • Linear sketch:

      • \(C(x + y) = C(x) + C(y)\)

      • \(C(ax) = aC(x)\)

    • In real life, if there are two data stream \(x\) and \(y\) in two different places, and the headquarter want to estimate \(f(x + y)\), it’s not necessary to send \(x\) and \(y\) to the headquarter and do the calculation. Instead, we can send \(C(x)\) and \(C(y)\) to headquarter and output \(f(C(x) + C(y))\).

    • It is possible to have binary \(f\), i.e. compute \(f(x, y)\) from \(C(x)\) and \(C(y)\).

Background

Problem

  • Given a sequence of input data, \(y_1, \ldots, y_n\), where \(y \in [m]\), let \(\boldsymbol{x} \in [n]^m\) be a container vector of sample frequency, where \(x_i\) counts the number of \(y\) that has value \(i\), for \(i=[m]\). We are interested in

    • \(\left\| \boldsymbol{x} \right\| _0\): how many distinct \(y_i\) are there? Let it be \(\ell_0\), note that \(\ell_0\le m\).

    • \(\left\| \boldsymbol{x} \right\| _1\): how many \(y_i\) are there, i.e. how large is \(n\)?

    • \(\left\| \boldsymbol{x} \right\| _2\): what is the Euclidean norm \(\ell_2\) of \(\boldsymbol{x}\)?

    • \(x_i\): what is the frequency of \(y\) that has value \(i\)?

Limitation

  • the data set of interest is large, cannot fit into the main memory

  • we only have sequential access to the data (one pass or few passes), i.e. the data comes as a stream.

Goal

  • use little memory, e.g. sub-linear in input parameter or input size

  • can also solve the problem of interest approximately, i.e. find estimator \(\hat{\ell}_p\) for \(\ell_p = \left\| x \right\| _p\) that is

    • unbiased \(\mathbb{E}[\hat{\ell}_p] = \ell_p\)

    • good concentration

      • \((\epsilon, \delta)\) approximation (guarantee): given \(0 < \epsilon, \delta< 1\), want the probability of deviation from true value \(\ell_p\) by \(\epsilon \ell_p\) to be no larger than \(\delta\). This event can be viewed as a failure output, and its probability is called failure probability.

        \[\mathbb{P}[ \hat{\ell}_p \in ( 1 \pm \epsilon) \ell_p ] = \mathbb{P}[ \vert \hat{\ell}_p - \ell_p \vert> \epsilon \ell_p ] < \delta\]

        Often, given required \(\epsilon\), we want to find good \(\hat{\ell}_p\) that gives lower \(\delta\).

      • constant approximation

        \[ \mathbb{P} [\hat{\ell}_p \ge c \ell _p ] \le \delta \]

        we can then repeat the procedures to boost the success probability.

Overall algorithm design intuition:

  • maintain in memory some quantity \(Q\) (can be a scalar, vector, matrix)

  • upon receiving new \(y_i\), update \(Q\) probabilistically, rather than deterministically:

    • update \(Q\) with some probability

      • depending on \(Q\) (Morris for \(\ell_1\))

      • depending on \(y_i\) (FM for \(\ell_0\))

    • update \(Q \mathrel{+}= q\) with some random quantity \(q\) (AMS, CountMin, etc.)

  • output some transformation \(f(Q)\) of \(Q\), such that

    • unbiasedness: \(\mathbb{E} [f(Q)] = \ell_p\).

    • good concentration probability \(\mathbb{P}[ \vert f(Q) - \ell_p \vert> \epsilon \ell_p ] < \delta\).

  • improvement

    • repeat \(s\) instantiations to obtain \(Q_1, \ldots, Q_s\), report their average \(Q_+\) (FM+, Morris+, AMS)

    • repeat \(t\) instantiations of \(Q_+\), report their median (FM++, Morris++, AMS++).

Metrics for reviewing algorithms

  • memory usage

  • number of passes

  • approximation factor

  • (sometimes) query/update time

Types of streams

  • Insertion-only: only \(\texttt{Insert(i)}: x_i \mathrel{+}= 1\) is allowed.

  • Insertion and deletion (dynamic): also allow \(\texttt{Delete(i)}: x_i \mathrel{-}= 1\), e.g. numbers, edges of graphs, etc. Assume that \(\#\operatorname{deletions} (i) \le \#\operatorname{insertions} (i)\) such that \(x_i \ge 0\).

  • Turnstile: more general case for vectors and matrices, with operation \(\texttt{Add}(i,\Delta): x_i \mathrel{+}= \Delta\)

References

Note: section Estimate \(\ell_1\) might be a better appetizer than Estimate \(\ell_0\).

Estimate \(\ell_0\)

How many distinct \(y_i\) are there? Aka distinct element counting problem. For instance, \(m = 10\), input stream is \((3,6,9,3,4,5,4)\), output \(5\).

Reference: https://www.sketchingbigdata.org/fall17/lec/lec2.pdf

Trivial Solution

  • One trivial solution is to keep a bitvector counter \(\boldsymbol{c}\) of size \(m\). For each number \(i \in [m]\), if \(c_i = 0\), then increase \(x_i\) from 0 to 1.

  • Another trivial solution is to just store the stream, which takes \(\min(m, n) \log m\) bits.

These methods are not efficient if \(m, n\) are large.

FM Algorithm

We introduce the FM algorithm.


Algorithms: Flajolet-Martin (FM)


  • Pick a random hash function \(h: [m] \rightarrow [0,1]\).

  • Maintain in memory the smallest hash we’ve seen so far: \(X = \min_{i \in \text{stream} } h(i)\)

  • Upon \(\texttt{query()}\), output \(\frac{1}{X} - 1\)


Intuition

  • Random hash result can be viewed as uniform random variable.

  • We are partitioning the interval \([0, 1]\) into bins of size \(\frac{1}{\ell_0+1}\)

  • Taking advantage of the property of the minimum of \(\ell_0\) uniform r.v.

Correctness

  • Unbiasedness \(\mathbb{E} [X] = \frac{1}{t+1}\), and hope that \(\frac{1}{X} - 1\) is close to \(t\).

    The variance is small \(\operatorname{Var}\left( X \right) = \frac{\ell}{(\ell_0+1)^{2}(\ell_0+2)}\).

FM+

Similarly to the Morris+, we can upgrade our basic algorithm into FM+ by running it \(s = \frac{1}{\epsilon^2\eta}\) times in parallel to obtain \(X_1, X_2, \ldots, X_s\). The output is

\[ \frac{1}{\frac{1}{s} \sum_{i=1}^{s} X_{i}}-1 \]

By Chebyshev’s inequality

\[\begin{split} \begin{aligned} \mathbb{P}\left(\left|\frac{1}{s} \sum_{i=1}^{s} X_{i}-\frac{1}{\ell_0+1}\right|>\frac{\varepsilon}{\ell_0+1}\right) &<\frac{\operatorname{Var}\left[\frac{1}{s} \sum_{i} X_{i}\right]}{\frac{\varepsilon^{2}}{(\ell_0+1)^{2}}} \\ &<\frac{1}{\varepsilon^{2} s}=\eta \end{aligned} \end{split}\]

Total space is \(\mathcal{O} (\frac{1}{\epsilon^2 \eta})\).

FM++

Run \(t = \Theta(\log \frac{1}{\eta} )\) independent copies of FM+, each with \(\eta = \frac{1}{3}\). Output the median of \(t\) FM+ estimates.

Total space is \(\mathcal{O} (\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^2} )\).

Estimate \(\ell_1\)

How many \(y_i\) are there, i.e. How large is \(n\)? Aka counting problem. A trivial solution is to maintain a counter, which requires space \(\mathcal{O}(\log n)\) bits. But when \(n\) is large, can we do better? We introduce Morris counter in insertion-only streams. Morris algorithm was developed in 1970s, when memory was expensive.

Morris

The Morris algorithms is


Algorithms: Morris Counter


  • Initialize \(X = 0\).

  • Upon receiving a new number \(y\), increase \(X \mathrel{+}= 1\) with probability \(\frac{1}{2^X}\)

  • Return \(\hat{n} = 2^X - 1\)


  • \(X\) is attempting to store a value that is roughly \(\log n\). Hence, the amount of space required to store \(X\) is \(O(\log \log n)\)

  • To achieve, this, update the counter \(X\) probabilistically

  • Since \(X \ll \ell_0\), output some large number \(\gg X\), e.g. \(2^X\)

Correctness

  • Unbiasedness: Let \(X_n\) denote \(X\) after \(n\) updates. Then \(\mathbb{E}\left[ 2 ^{X_n} \right] = n + 1\). Hence \(\mathbb{E} [\hat{n}_1] = 2^{\mathbb{E} [X_n]} - 1\) is an unbiased estimator of \(\ell_1 = n\).

  • Good concentration \(\mathbb{P}[ \left\vert \hat{n} - n \right\vert \ge \epsilon n] \le \frac{1}{2\epsilon^2}\)

However, the upper bound \(\delta = \frac{1}{2\epsilon^2}\) is not very meaningful. We only have \(\delta < 1\) when \(\epsilon > 1\), for which we can simply return 0. Is there any way to find a smaller upper bound? Currently \(\operatorname{Var}\left( \hat{n} \right) = \Theta(n^2)\), can we find another estimate that does better?

Morris+

Run in parallel: \(s\) independent instantiations of Morris estimator \(\hat{n}\), then return the average, denoted \(\hat{n}_+\). Hence

\[\operatorname{Var}\left( \hat{n}_+ \right) = \operatorname{Var}\left( \frac{1}{s} \sum_{k=1}^s \hat{n}_k \right) = \frac{1}{s} \operatorname{Var}\left( \hat{n} \right)\]

Note that \(\mathbb{E} [\hat{n}_+]\) is still unbiased. The upper bound now becomes \(\frac{1}{s}\cdot \frac{1}{2\epsilon^2}\). How to choose \(s\)? To ensure failure probability is less than \(\delta\), we can set \(s \ge \frac{1}{2\epsilon ^2 \delta}\). Equivalently, we say that we can set \(s = \Theta \left( \frac{1}{\epsilon^2 \delta} \right)\), for suitable constant.

In total there are \(s\) number of \(X\), each \(X\) takes \(\mathcal{O} (\log \log n)\). The total space for an Morris++ estimator \(\hat{n}\) with \((\epsilon, \delta)\) guarantee is then \(\Theta\left( \frac{1}{\epsilon^{2} \delta} \cdot \log \log n\right)\).

We can further improve the result by using Morris++.

Morris++

Run in parallel: \(t\) independent instantiations of Morris+ estimator \(\hat{n}_{+}\) with failure probability at most \(\frac{1}{3}\), by setting \(s = \Theta \left( \frac{1}{\epsilon^2\frac{1}{3} } \right) = \Theta \left( \frac{1}{\epsilon^2} \right)\). Then return their median, denoted as \(\hat{n}_{+ +}\).

What’s the failure probability of the median \(\hat{n}_{+ +}\)? If it fails, then at least half of \(\hat{n}_{+}\) fails, i.e. at most half succeed. Formally, define

\[\begin{split} Y_{i}=\left\{\begin{array}{ll} 1, & \text { if the } i \text {-th Morris+ instantiation succeeds. } \\ 0, & \text { otherwise } \end{array}\right. \end{split}\]

Let \(S = \sum_{i=1}^t Y_i\). Then, the failure probability is

\[\begin{split}\begin{aligned} \mathbb{P}\left( \hat{n}_{+ +} \text{ fails} \right) &= \mathbb{P}\left( S \le \frac{t}{2} \right) \\ &\le \mathbb{P}\left( S \le \frac{3}{4} \mu_S \right) \quad \because \mu_S \ge \frac{2t}{3} \\ &\le e^{-(1/4)^2 \mu_S/2} \quad \text{by Chernoff Bound} \\ &\le e^{- t/48} \\ \end{aligned}\end{split}\]

To get the failure probability at most \(\delta\), we can set \(t = \Theta(\log \frac{1}{\delta})\).

In total there are \(t \times s\) number of \(X\). The total space of Morris++ with success probability \(\delta\) is then \(\Theta\left(\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^{2}} \cdot \log \log n\right)\). In particular, for constant \(\epsilon, \delta\) (say each 1/100), the total space complexity is \(\mathcal{O} (\log \log n)\) with constant probability. This is exponentially better than the \(\mathcal{O} (\log n)\) space achieved by storing a counter.

Estimate \(\ell_2\)

AMS algorithm.

maintain a single number \(z = \sum_{i=1}^n s_i x_i\).

2-wise independent, 4-wise independent??

\(k\)-wise independent requires \(k\log m\) bits.

Consider turnstile model: a stream of \(n\) input \((i, \Delta)\) to update \(x_i \mathrel{+}=\Delta\). We want to approximate \(\ell_2 = \left\| \boldsymbol{x} \right\| _2\). In statistics, this is related to the second order moment of \(X\), which describe how disperse the distribution is.

We introduce Alon-Matias-Szegedy’96 (AMS) Algorithm to estimate \(\ell_2 ^2 = \left\| \boldsymbol{x} \right\| _2 ^2\). For another method Johnson-Lindenstrauss (JL) Lemma, see here. For \(\ell_p\)-norm approximation, see here [https://www.sketchingbigdata.org/fall17/lec/lec4.pdf].

AMS

Let \(s_i \in \left\{ -1, 1 \right\}\) with equal probability.


Algorithms: (part of) Alon-Matias-Szegedy’96 (AMS)


  • Generate \(s_1, \ldots, s_m\) for each coordinates. Initialize \(Z=0\).

  • Upon \(\texttt{add}(i, \Delta)\), update \(Z \mathrel{+}= s_i \cdot \Delta\)

  • Upon \(\texttt{query()}\), output \(Z^2\)


Intuition

  • \(Z = \sum_{i=1}^m s_i x_i = \boldsymbol{s} ^{\top} \boldsymbol{x}\) where \(s_i = \pm 1\) with equal probability.

Correctness

  • \(\mathbb{E} [Z^2] =\left\| \boldsymbol{x} \right\| _2 ^2\)

  • Good concentration \(\operatorname{Pr}\left[\left|Z^{2}-\ell_2^2\right| \geq \sqrt{2}c \ell_2^2\right] \le \frac{1}{c^{2}}\)

This bound is too loose to be informative. The overall AMS algorithm keep multiple estimators \(Z_1, \ldots, Z_k\). And output the average of squared values, \(Z ^\prime = Z_1 ^2, \ldots, Z_k^2\). It is easy to see that

\[ \mathbb{E}\left[Z^{\prime}\right]=\mathbb{E}\left[\frac{\sum_{i} z_{i}^{2}}{k}\right]=\mathbb{E}\left[Z_{1}\right]=\|x\|_{2}^{2} \]

and

\[ \operatorname{Var}\left(Z^{\prime}\right)=\operatorname{Var}\left(\frac{\sum_{i} Z_{i}^{2}}{k}\right)=\frac{\sum \operatorname{Var}\left(Z_{i}^{2}\right)}{k^{2}}=\frac{\operatorname{Var}\left(Z_{1}^{2}\right)}{k} \le \frac{2\|x\|_{2}^{4}}{k} \]

Hence,

\[\operatorname{Pr}\left[\left|Z^{\prime}-\ell_{2}^{2}\right| \geq c \sqrt{2/k}\ell_{2}^{2}\right] \leq 1 / c ^{2}\]

If we set \(c=\Theta(1)\) and \(k=\Theta\left(1 / \varepsilon^{2}\right)\), we get a \((1 \pm \epsilon)\) approximation with constant probability.

Note that we need 4-wise independence of \(s_1, \ldots, s_m\) to bound \(\mathbb{E} \left[ \left( \sum_{i=1}^m s_i x_i \right)^2 \right]\), which requires \(\mathcal{O} (\log m)\) random bits. (\(p\log m\) random bits for \(p\)-wise independence). Besides, there are \(\mathcal{O} (\frac{1}{\epsilon^2})\) number of \(Z\)’s, with maximum value \(nm\), so total we need \(\mathcal{O} (\frac{1}{\epsilon^2} \log(mn))\) bits.

We can formulate our algorithm in vector form. Let \(\boldsymbol{S}\) be an \(k \times n\) matrix where \(s_{ij} = \pm 1\) with equal probability, then basically we are just maintaining the vector \(\boldsymbol{Z} = \boldsymbol{S} \boldsymbol{y}\) throughout the algorithm and answering the query by outputting \(\left\| \boldsymbol{Z} \right\|_2 ^2 / k\). Here \(\boldsymbol{Z}\) is a linear sketch \(C(\boldsymbol{y})\) of \(\boldsymbol{y}\).

AMS++

To get a \((\epsilon, \delta)\) approximation, we run \(t = \mathcal{O} (\log \frac{1}{\delta})\) independent instantiations of AMS and take the median, similar to FM++ and Morris++.

There are \(\mathcal{O} (\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^2} )\) number of \(Z\)’s, and \(\mathcal{O} (\log m)\) random bits for sign \(s\)’s.

Estimate \(x_i\)

(Review \(k\)-wise independent hash function here)

Now we want to estimate the frequency \(x_i\) of \(i\). Aka coordinate approximation.

Two algorithms:

  • CountMin: keep track of all coordinates with additive error, i.e., for each coordinate report \(\hat{x}_i\) that is within \(x_{i} \pm \frac{\|x\|_{1}}{k}\)

  • CountSketch: keep track of all coordinates with additive error, i.e., for each coordinate report \(\hat{x}_i\) that is within \(x_{i} \pm \frac{\|x\|_{2}}{\sqrt{k}}\)

For details see here.

Count-min is a linear sketch. Suppose data vector \(\boldsymbol{y} \in \mathbb{R} ^n\), then the sketch is \(C(\boldsymbol{y}) = \boldsymbol{M} \boldsymbol{y}\) where \(\boldsymbol{M} \in \mathbb{R} ^{\log \frac{1}{\delta} \cdot \frac{1}{\epsilon} \times n}\)

Each column of \(M_{i} \in \mathbb{R}^{\left(\frac{1}{\epsilon}\right) \times n}\) has a 1 in a random row.

\[\begin{split} \left(\begin{array}{l} 1,0,0,0,1,0,0,0 \\ 0,0,1,1,0,0,0,1 \\ 0,1,0,0,0,1,1,0 \end{array}\right)\left(\begin{array}{l} y_{1} \\ y_{2} \\ \cdot \\ \cdot \\ y_{8} \end{array}\right)=\left(\begin{array}{l} y_{1}+y_{5} \\ y_{3}+y_{4}+y_{8} \\ y_{2}+y_{6}+y_{7} \end{array}\right) \end{split}\]

Count-sketch is a linear sketch

  • Sketch: \(f(\boldsymbol{y})= \boldsymbol{M} \cdot \boldsymbol{x}\) where \(M \in \mathbb{R}^{\left(\log \frac{1}{\delta} \cdot \frac{1}{\epsilon}\right) \times n}\)

  • Each column of \(M_{i} \in \mathbb{R}^{\left(\frac{1}{\epsilon^{2}}\right) \times m}\) has a random \(\pm 1\) in a random row

\[\begin{split} \begin{array}{l} \left(\begin{array}{l} 1,0,0,0,-1,0,0,0 \\ 0,0,1,-1,0,0,0,1 \\ 0,-1,0,0,0,1,-1,0 \end{array}\right)\left(\begin{array}{l} y_{1} \\ y_{2} \\ \cdot \\ \cdot \\ \cdot \\ y_{8} \end{array}\right)=\left(\begin{array}{l} y_{1}-y_{5} \\ y_{3}-y_{4}+y_{8} \\ -y_{2}+y_{6}-y_{7} \end{array}\right) \\ \end{array} \end{split}\]

\(L_p\)-sampler

Suppose we want to sample

\(L_0\)-sampler

Want: sample one of the non-zero coordinates with uniform probability \(\frac{1}{\|x\|_{0}}+n^{-c}\).

Assume each \(x_i \in [-W, W]\) where \(W = \log(n)\).

For \(0 \le i \le \log n\), let \(S_i\) be a set where each element is picked independently w.p. \(\frac{1}{2^i}\). If \(S_i\) only contains a single non-zero element \(x\), we output that element. How to test \(S_i\) has a single non-zero coordinate?

Compute

  • \(A_i = \sum_{j \in s_i} x_j\)

  • \(B_i = \sum_{j \in s_i} j \cdot x_j\)

  • \(C_i = \sum_{j \in s_i} x_j r^j \mod p\) where

    • \(p = \mathcal{O} (\operatorname{poly}(n))\) is a prime number

    • \(r\) is a chosen uniformly at random from \(\left\{ 1, 2, \ldots, p-1 \right\}\)

Then we test

  • \(\frac{B_i}{A_i} \in [n]\)

  • \(C_i = A_i r^{b_i/A_i} \mod p\)

Clearly,

  • If \(S_i\) has one non-zero coordinate, it passes the test

  • else if it has more than one non-zero coordinate, it fails with high probability

How to maintain \(S_i\)? Use \(k\)-wise independent hash function.

Graph Streaming

Consider a graph arrival model: edge arrival or vertex arrival in a stream. It may be insertion-only or dynamic.

For instance, problems include:

  • find a connected component

  • decide: is the graph \(k\)-edge connected?

  • maximum matching

See link.

Other Problems

Set Cover

Subsets come in a stream, cannot store all of them.

Offline: \(\mathcal{O} (\log n)\)-approximation.

  • greedy: store all subsets and output compare largest-cardinality one

  • greedy: \(n\) passes, in each pass select ..

  • \(T = 2^i\)

  • sampling-based

    • Set sampling: un-sampled elements have low degree

    • Element sampling:

Multiplicative Weights Update Method

Random MST

Points over gird coming a stream

  • sol: randomly shifted grid

    • embed points into a collection of trees

SVM Cost

Points coming in a stream