Expectation and Variance

Definitions

We quickly review the definitions of expectation, variance and covariance.

\[\begin{split}\begin{align} &&&\text{Population} && \text{Sample} \\ & \text{Mean} & \mu &= \sum_{i=1}^n x_i p(x_i) \text{ or } \int_{\mathcal{X}} x f(x) \mathrm{~d}x & \bar x &= \frac{1}{n}\sum_i x_i \\ & \text{Variance} & \sigma^2 &= \operatorname{\mathbb{E}}\left[ \left( X-\mu \right)^2 \right] & s^2 &= \frac{1}{n}\sum_i(x_i - \bar x)^2\\ & \text{Standard deviation} & \sigma &= \sqrt{\operatorname{\mathbb{E}}\left[ \left( X-\mu \right)^2 \right]} & s &= \sqrt{\frac{1}{n}\sum_i(x_i - \bar x)^2} \\ & \text{Covariance} & \sigma_{X,Y} &= \operatorname{\mathbb{E}}\left[ (X-\mu_X)(Y-\mu_Y) \right] & s_{X,Y} &= \frac{1}{n}\sum_i \left[ (x_i - \bar x)(y_i - \bar y) \right] \end{align}\end{split}\]

Also recall the definitions of conditional expectation and conditional variance:

\[\begin{split}\begin{align} \operatorname{\mathbb{E}}(X \mid Y=y) &= \sum_{x} x P(X=x \mid Y=y) \\ &\text{or} \int_{-\infty}^{\infty} x f_{X \mid Y}(x \mid y) \mathrm{~d} x \\ \operatorname{Var}\left( X \mid Y=y \right) &= \operatorname{\mathbb{E}}\left[ (X-\mu_{X\mid Y=y})^{2} \mid Y=y \right] \\ \end{align}\end{split}\]

Notations

  • The notation \(X \mid Y=y\) means that \(Y=y\) is observed. In this case, the conditional expectation (variance) is a function of the observed value \(y\), i.e., \(\operatorname{\mathbb{E}}(X \mid Y=y) = g(y)\), which itself is a constant.

  • The notation \(X \mid Y\) means that \(Y\) is a random variable and has not been observed yet. In this case, the conditional expectation (variance) is a function of the random variable \(Y\), i.e., \(\operatorname{\mathbb{E}}(X \mid Y) = g(Y)\), which itself is a random variable.

More:

  • Kurtosis is a measure of peakedness of the probability density function of a random variable X, defined by

    \[ \frac{\mu_{4}}{\mu_{2}^{2}}=\frac{\mathbb{E} [(X-\mu)^4]}{\mathbb{E} [(X-\mu)^{2}] ^{2}} \quad \in[0, \infty) \]

    Normal distribution has kurtosis 3.

  • Excess Kurtosis is defined as the deviation from the normal kurtosis:

    \[ \frac{\mu_{4}}{\mu_{2}^{2}}-3 \quad \in[-3, \infty) \]
  • \(q\)-quantiles are values that partition a finite set of values into \(q\) subsets of (nearly) equal sizes. There are \(q-1\) of the \(q\)-quantiles, one for each integer \(k\) satisfying \(0 < k <q\).

    • \(x\) is a \(k\)-th \(q\)-quantile for a variable \(X\) if

      \[ \mathbb{P} (X < x)\leq k / q \]
    • If, instead of using integers \(k\) and \(q\), the “\(\tau\)-quantile” is based on a real number \(p\) with \(0 < \tau < 1\) then \(\tau\) replaces \(k/q\) in the above formulas. This broader terminology is used when quantiles are used to parameterize continuous probability distributions.

    • computing sample \(\tau\)-quantile can be formulated as an optimization problem

      \[ \tau\text{-th sample quantile} = \arg\min_\xi \sum_{i=1}^n \rho_\tau (x_i - \xi) \]

      where

      \[\begin{split}\begin{aligned} \rho_\tau(u) &= u(\tau - \mathbb{I} \left\{ u < 0 \right\})\\ &= \mathbb{I} \left(u>0\right) \tau\left|u\right|+\mathbb{I} \left(u\le 0\right)(1-\tau)\left|u\right| \\ \end{aligned}\end{split}\]

      In particular, \(\tau = 0.5\) we obtain median = \(\arg\min_\xi \sum_{i=1}^n \left\vert x_i - \xi \right\vert\).

Identities

Basics

In general, we have

\[\begin{split}\begin{align} \operatorname{\mathbb{E}}\left( aX + bY \right) &= a \operatorname{\mathbb{E}}\left( X \right) + b \operatorname{\mathbb{E}}\left( Y \right) \\ \operatorname{Var}\left( aX + bY \right) &= a^2\operatorname{Var}\left( X \right) + b^2\operatorname{Var}\left( Y \right) + 2ab\operatorname{Cov}\left( X, Y \right) \\ \operatorname{Var}\left( X \right) &= \operatorname{\mathbb{E}}\left( X^2 \right) - \left[ \operatorname{\mathbb{E}}\left( X \right) \right]^2\\ \operatorname{\mathbb{E}}\left( X^2 \right) &= \mu^2 + \sigma^2 \\ \operatorname{Cov}\left( X, X \right) &= \operatorname{Var}\left( X \right) \\ \operatorname{Cov}\left( X,Y \right) &= \operatorname{\mathbb{E}}\left( XY \right) - \operatorname{\mathbb{E}}\left( X \right)\operatorname{\mathbb{E}}\left( Y \right) \\ \operatorname{Cov}\left( X, a \right) &= 0 \\ \operatorname{Cov}\left( X, Y+Z \right) &= \operatorname{Cov}\left( X, Y \right) + \operatorname{Cov}\left( X, Z \right) \\ \operatorname{Cov}\left( aX, bY \right) &= ab \operatorname{Cov}\left( X, Y \right) \end{align}\end{split}\]

If \(X\) and \(Y\) are independent,

\[\begin{split}\begin{align} \operatorname{\mathbb{E}}\left( XY \right) &= \operatorname{\mathbb{E}}\left( X \right)\operatorname{\mathbb{E}}\left( Y \right) \\ \operatorname{Cov}\left( X, Y \right) &= 0 \\ \operatorname{Var}\left( aX + bY \right) &= a^2\operatorname{Var}\left( X \right) + b^2\operatorname{Var}\left( Y \right) \\ \end{align}\end{split}\]

Linear Combinations

For \(n\) random variables \(X_1, X_2, \ldots, X_n\), consider a linear combination \(\sum_i^n a_i X_i\). Though we have no information about the dependence between \(X_i\)’s, the expectation of the sum equals to the sum of the expectations

\[ \operatorname{\mathbb{E}}\left( \sum_i^n a_i X_i \right) = \sum_i^n \operatorname{\mathbb{E}}\left(a_i X_i \right) = \sum_i^n a_i\operatorname{\mathbb{E}}\left( X_i \right) \]

In this sense, expectation is a linear operator.

In particular, for independently and identically distributed \(X_1, X_2, \ldots, X_n\) with common mean \(\operatorname{\mathbb{E}}\left( X_i \right)=\mu\), the expectation of the average value is

\[\begin{split}\begin{align} \operatorname{\mathbb{E}}\left( \bar X \right) &= \operatorname{\mathbb{E}}\left( \frac{1}{n}\sum_{i} X_i \right)\\ &= \frac{1}{n}\sum_i \operatorname{\mathbb{E}}\left( X_i \right)\\ &= \mu \\ \end{align}\end{split}\]

In general, the variance of a sum of a linear combination is

\[\begin{split} \begin{aligned} \operatorname{Var}\left(\sum_{i=1}^{n} a_{i} X_{i}\right) &= \operatorname{Cov}\left( \sum_i a_i X_i, \sum_i a_i X_i \right)\\ &=\sum_{i, j=1}^{n} \operatorname{Cov}\left(a_{i}X_{i}, a_{j}X_{j}\right) \\ &=\sum_{i=1}^{n} \operatorname{Var}\left(a_{i}X_{i}\right)+\sum_{i \neq j} \operatorname{Cov}\left(a_{i}X_{i}, a_{j}X_{j}\right) \\ &=\sum_{i=1}^{n} a_{i}^{2} \operatorname{Var}\left(X_{i}\right)+2 \sum_{1 \leq i<j \leq n} a_{i} a_{j} \operatorname{Cov}\left(X_{i}, X_{j}\right) \end{aligned} \end{split}\]

Tip

One can imagine that there is a \(n \times n\) covariance table with the \(i,j\)-th entry being \(\operatorname{Cov}\left( a_i X_i, a_j X_j \right)\), and the required variance is the sum of all the entries, which consists of

  • the sum of the diagonal entries as \(\sum_i\operatorname{Var}\left(a_{i}X_{i}\right)\)

  • the sum of the off-diagonal entries as \(\sum_{i\ne j}\operatorname{Cov}\left(a_{i}X_{i}, a_{j}X_{j}\right)\)

In particular, the variance of the average value of the IID sum is

\[\begin{split}\begin{align} \operatorname{Var}\left( \bar X \right) &= \operatorname{Var}\left( \frac{1}{n}\sum_{i} X_i \right)\\ &= \frac{1}{n^2}\sum_i \operatorname{Var}\left( X_i \right)\\ &= \frac{1}{n} \sigma^2 \\ \end{align}\end{split}\]

Expectation of Nonnegative Random Variables

For nonnegative random variables, the expectation can be computed from the complementary cumulative distribution function \(1 - F(x) = \operatorname{\mathbb{P}}\left( X > x \right)\).

  • discrete case

\[ \operatorname{\mathbb{E}}\left( X \right)=\sum_{n=0}^{\infty} \operatorname{\mathbb{P}}\left( X>n \right) \]
  • continuous case

\[ \operatorname{\mathbb{E}}\left( X \right) = \int_{0}^{\infty} \operatorname{\mathbb{P}}(X \geq x) \mathrm{~d} x \]

Law of Total Expectation

Aka law of iterated expectations, tower rule, smoothing theorem.

Given the conditional expectation \(\operatorname{\mathbb{E}}\left( X \mid Y \right)\), the law of total expectation states that we can obtained the unconditional expectation \(\operatorname{\mathbb{E}}\left( X \right)\) by

\[ \operatorname{\mathbb{E}}\left( X \right)=\operatorname{\mathbb{E}}\left[ \operatorname{\mathbb{E}}(X \mid Y) \right] \]

Note

The inside expectation is taken w.r.t. \(X\) and the outside expectation is taken w.r.t. \(Y\), since the conditional expectation \(\operatorname{\mathbb{E}}\left(X \mid Y \right)\) is a function \(g(Y)\) that depends on the random variables \(Y\). To emphasize this we can write

\[ \operatorname{\mathbb{E}}_X\left( X \right)=\operatorname{\mathbb{E}}_Y\left[ \operatorname{\mathbb{E}}_X(X \mid Y) \right] \]

In general, we can partition the sample space into finite or countably infinite sets \(A_i\), then

\[ \operatorname{\mathbb{E}}(X)=\sum_{i} \operatorname{\mathbb{E}}\left(X \mid A_{i}\right) \operatorname{\mathbb{P}}\left(A_{i}\right) \]

For instance, we can compute the expectation as a weighted sum of the expectation of the positive part and the expectation of the negative part on respective probabilities.

\[ \operatorname{\mathbb{E}}(X)=\operatorname{\mathbb{E}}\left(X \mid X>0\right) \operatorname{\mathbb{P}}\left(X>0\right) + \operatorname{\mathbb{E}}\left(X \mid X<0\right) \operatorname{\mathbb{P}}\left(X<0\right) \]

Law of Total Variance

Aka law of iterated variances, variance decomposition formula, Eve’s law.

\[ \operatorname{Var}(X)=\operatorname{\mathbb{E}}[\operatorname{Var}(X \mid Y)]+\operatorname{Var}(\operatorname{\mathbb{E}}[X \mid Y]) \]

Note

Here both \(\operatorname{Var}\left( X \mid Y \right)\) and \(\operatorname{\mathbb{E}}\left( X \mid Y \right)\) are random. The outside expectation and variance are taken w.r.t. the conditioned variable, \(Y\).

The first and the second term can be interpreted as the unexplained and the explained components of the variance of \(X\) by knowing \(Y\). Imagine that there is a deterministic relation \(X=f(Y)\), then \(\operatorname{Var}\left( X \mid Y \right) = 0\) so that the first term is 0, and the second term becomes \(\operatorname{Var}\left( f(Y) \right) = \operatorname{Var}\left( X \right)\).

Warning

From above we see that the identity that holds for expectation

\[ \operatorname{\mathbb{E}}(X)=\sum_{i} \operatorname{\mathbb{E}}\left(X \mid A_{i}\right) \operatorname{\mathbb{P}}\left(A_{i}\right) \]

does not hold for variance

\[ \operatorname{Var}(X) \ne \sum_{i} \operatorname{Var}\left(X \mid A_{i}\right) \operatorname{\mathbb{P}}\left(A_{i}\right) \]

unless \(\operatorname{Var}(\operatorname{\mathbb{E}}[X \mid A]) = 0\), which implies that \(\operatorname{\mathbb{E}}\left( X \mid A \right) = \text{constant}\), i.e., \(X\) and the partitioning \(A_i\) are independent.

Inequalities

There are important inequalities that connect probability, expectation and variance. For more inequalities that also relates the sum/average of multiple variables, see here.

Markov’s Inequality

Markov’s inequality upper bounds right-tail probability \(\operatorname{\mathbb{P}}\left( X\ge \lambda \right)\) by \(\frac{\mu}{\lambda}\). For a nonnegative random variable \(X\) and \(\lambda>0\),

\[ \operatorname{\mathbb{P}}(X \geq \lambda) \leq \frac{\mu}{\lambda} \]

A more useful form is to substitute \(\lambda \leftarrow \lambda \mu\), then

\[ \operatorname{\mathbb{P}}(X \geq \lambda \mu) \leq \frac{1}{\lambda} \]

That is, the probability of exceeding expectation by more than a factor of \(\lambda\) is at most \(\frac{1}{\lambda}\).

Chebyshev’s Inequality

The probability of deviating from \(\mu\) by more than \(\lambda \sigma\) is at most \(\frac{1}{\lambda^2}\). For any \(\lambda>0\),

\[ \operatorname{\mathbb{P}}(|X-\mu| \geq \lambda \sigma) \leq \frac{1}{\lambda^{2}} \]

For instance, taking \(\lambda=\sqrt{2}\) gives

\[ \operatorname{\mathbb{P}}\left( \mu - \sqrt{2}\sigma \le X \le \mu - \sqrt{2}\sigma \right) > \frac{1}{2} \]

Some variation:

  • Substituting \(\lambda \leftarrow \frac{\epsilon\mu}{\sigma}\) gives

    \[ \operatorname{\mathbb{P}}(|X-\mu| \geq \epsilon \mu ) \leq \frac{\sigma ^2}{(\epsilon \mu)^{2}} \]

    which can be used to construct a \((\epsilon, \delta)\) estimator.

  • Substituting \(\lambda \leftarrow \frac{\lambda}{\sigma}\) gives

    \[ \operatorname{\mathbb{P}}(|X-\mu| \geq \lambda ) \leq \frac{\sigma ^2}{\lambda^{2}} \]

    In general,

    \[ \operatorname{\mathbb{P}}(|\mathrm{X}-\mu|>\lambda) \leq \frac{\mathbb{E}\left[(X-\mu)^{p}\right]}{\lambda^{p}} \]

Cauchy-Schewarz Inequality in Probability

For two random variables \(X, Y\) we have

\[ \left[ \operatorname{Cov}\left( X,Y \right) \right]^2 \le \operatorname{Var}\left( X \right) \operatorname{Var}\left( Y \right) \]

The equality holds iff there is a deterministic linear relation between \(X\) and \(Y\), \(Y = aX + b\).

Exercise

Coin Flips

  • Count trials: What is the expected number of coin flips to get two heads in a row?

  • Count rows: what is the expected number of times to see \(k\) heads in a row, i.e., HH…HH, in \(n\) flips of a coin?

  • Count runs: a coin with a probability \(p\) to get a head is flipped \(n\) times. A “run” is a maximal sequence of consecutive flips that are all the same. For instance, HTHHHTTH has five runs and \(n=8\). What is the expected number of runs?

Incremental Update of Mean and Variance

Suppose you have \(n\) observations \(x_1, x_2, \ldots, x_n\). Now a new value \(x_{n+1}\) is observed. Write down recurrence functions for \(\bar{x}_n\) and \(s^2_n\), and use them to obtain \(\bar{x}_{n+1}\) and \(s^2_{n+1}\).

Miscellaneous

  1. Two groups of data. In group one, sample standard deviation is \(s_1\), in group two it is \(s_2\). After merging them, it is \(s_3\). Do we always have \(s_3 > \max(s_1, s_2)\)?

  2. Find the distribution of the sum of two independent uniform random variable.

  3. Randomly and independently select two points in \([0, \ell]\), find their expected distance.

    • How about two uniform random points in a compact convex subset in \(\mathbb{R} ^n\)? For example, interval, disk, square, cube? See this paper.

  4. Randomly and independent select \(n\) points from an interval of length \(\ell\), let \(D\) be the minimum distance between two points: \(d = \min_{i \ne j \in [n]} \left\vert x_i - x_j \right\vert\), find \(\mathbb{E}\left( D \right)\).

  5. Randomly select three points \(A, B, C\) from a circle, what’s the probability that the center \(O\) is in the triangle \(ABC\)? Extension: probability of four random points on a sphere such that the tetrahedron \(ABCD\) contains \(O\)?