Expectation and Variance¶
Definitions¶
We quickly review the definitions of expectation, variance and covariance.
Also recall the definitions of conditional expectation and conditional variance:
Notations
- The notation \(X \mid Y=y\) means that \(Y=y\) is observed. In this case, the conditional expectation (variance) is a function of the observed value \(y\), i.e., \(\operatorname{\mathbb{E}}(X \mid Y=y) = g(y)\), which itself is a constant. 
- The notation \(X \mid Y\) means that \(Y\) is a random variable and has not been observed yet. In this case, the conditional expectation (variance) is a function of the random variable \(Y\), i.e., \(\operatorname{\mathbb{E}}(X \mid Y) = g(Y)\), which itself is a random variable. 
More:
- Kurtosis is a measure of peakedness of the probability density function of a random variable X, defined by \[ \frac{\mu_{4}}{\mu_{2}^{2}}=\frac{\mathbb{E} [(X-\mu)^4]}{\mathbb{E} [(X-\mu)^{2}] ^{2}} \quad \in[0, \infty) \]- Normal distribution has kurtosis 3. 
- Excess Kurtosis is defined as the deviation from the normal kurtosis: \[ \frac{\mu_{4}}{\mu_{2}^{2}}-3 \quad \in[-3, \infty) \]
- \(q\)-quantiles are values that partition a finite set of values into \(q\) subsets of (nearly) equal sizes. There are \(q-1\) of the \(q\)-quantiles, one for each integer \(k\) satisfying \(0 < k <q\). - \(x\) is a \(k\)-th \(q\)-quantile for a variable \(X\) if \[ \mathbb{P} (X < x)\leq k / q \]
- If, instead of using integers \(k\) and \(q\), the “\(\tau\)-quantile” is based on a real number \(p\) with \(0 < \tau < 1\) then \(\tau\) replaces \(k/q\) in the above formulas. This broader terminology is used when quantiles are used to parameterize continuous probability distributions. 
- computing sample \(\tau\)-quantile can be formulated as an optimization problem \[ \tau\text{-th sample quantile} = \arg\min_\xi \sum_{i=1}^n \rho_\tau (x_i - \xi) \]- where \[\begin{split}\begin{aligned} \rho_\tau(u) &= u(\tau - \mathbb{I} \left\{ u < 0 \right\})\\ &= \mathbb{I} \left(u>0\right) \tau\left|u\right|+\mathbb{I} \left(u\le 0\right)(1-\tau)\left|u\right| \\ \end{aligned}\end{split}\]- In particular, \(\tau = 0.5\) we obtain median = \(\arg\min_\xi \sum_{i=1}^n \left\vert x_i - \xi \right\vert\). 
 
Identities¶
Basics¶
In general, we have
If \(X\) and \(Y\) are independent,
Linear Combinations¶
For \(n\) random variables \(X_1, X_2, \ldots, X_n\), consider a linear combination \(\sum_i^n a_i X_i\). Though we have no information about the dependence between \(X_i\)’s, the expectation of the sum equals to the sum of the expectations
In this sense, expectation is a linear operator.
In particular, for independently and identically distributed \(X_1, X_2, \ldots, X_n\) with common mean \(\operatorname{\mathbb{E}}\left( X_i \right)=\mu\), the expectation of the average value is
In general, the variance of a sum of a linear combination is
Tip
One can imagine that there is a \(n \times n\) covariance table with the \(i,j\)-th entry being \(\operatorname{Cov}\left( a_i X_i, a_j X_j \right)\), and the required variance is the sum of all the entries, which consists of
- the sum of the diagonal entries as \(\sum_i\operatorname{Var}\left(a_{i}X_{i}\right)\) 
- the sum of the off-diagonal entries as \(\sum_{i\ne j}\operatorname{Cov}\left(a_{i}X_{i}, a_{j}X_{j}\right)\) 
In particular, the variance of the average value of the IID sum is
Expectation of Nonnegative Random Variables¶
For nonnegative random variables, the expectation can be computed from the complementary cumulative distribution function \(1 - F(x) = \operatorname{\mathbb{P}}\left( X > x \right)\).
- discrete case 
- continuous case 
Proof by changing the order of summation/integral
- discrete case \[\begin{split}\begin{align} \sum_{n=0}^{\infty} \operatorname{\mathbb{P}}\left( X>n \right) &= \sum_{n=0}^{\infty} \sum_{k=n+1}^{\infty} \operatorname{\mathbb{P}}\left( X=k \right) \\ &= \sum_{k=1}^{\infty} \sum_{n=1}^k \operatorname{\mathbb{P}}\left( X=k \right) \\ &= \sum_{k=1}^{\infty} k \operatorname{\mathbb{P}}\left( X=k \right) \\ &= \sum_{k=0}^{\infty} k \operatorname{\mathbb{P}}\left( X=k \right) \\ &= \operatorname{\mathbb{E}}\left( X \right) \end{align}\end{split}\]
- continuous case \[\begin{split} \begin{aligned} \int_{0}^{\infty} \operatorname{\mathbb{P}}(X \ge x) \mathrm{~d} x &=\int_{0}^{\infty} \int_{x}^{\infty} f_{X}(y) \mathrm{~d} y \mathrm{~d} x \\ &=\int_{0}^{\infty} \int_{0}^{y} f_{X}(y) \mathrm{~d} x \mathrm{~d} y \\ &=\int_{0}^{\infty} f_{X}(y) \int_{0}^{y} 1 \mathrm{~d} x \mathrm{~d} y \\ &=\int_{0}^{\infty} y f_{X}(y) \mathrm{~d} y \\ &=\operatorname{\mathbb{E}}\left( X \right) \end{aligned} \end{split}\]- \(\square\) 
Law of Total Expectation¶
Aka law of iterated expectations, tower rule, smoothing theorem.
Given the conditional expectation \(\operatorname{\mathbb{E}}\left( X \mid Y \right)\), the law of total expectation states that we can obtained the unconditional expectation \(\operatorname{\mathbb{E}}\left( X \right)\) by
Note
The inside expectation is taken w.r.t. \(X\) and the outside expectation is taken w.r.t. \(Y\), since the conditional expectation \(\operatorname{\mathbb{E}}\left(X \mid Y \right)\) is a function \(g(Y)\) that depends on the random variables \(Y\). To emphasize this we can write
In general, we can partition the sample space into finite or countably infinite sets \(A_i\), then
For instance, we can compute the expectation as a weighted sum of the expectation of the positive part and the expectation of the negative part on respective probabilities.
Proof
By definition
\(\square\)
Law of Total Variance¶
Aka law of iterated variances, variance decomposition formula, Eve’s law.
Note
Here both \(\operatorname{Var}\left( X \mid Y \right)\) and \(\operatorname{\mathbb{E}}\left( X \mid Y \right)\) are random. The outside expectation and variance are taken w.r.t. the conditioned variable, \(Y\).
The first and the second term can be interpreted as the unexplained and the explained components of the variance of \(X\) by knowing \(Y\). Imagine that there is a deterministic relation \(X=f(Y)\), then \(\operatorname{Var}\left( X \mid Y \right) = 0\) so that the first term is 0, and the second term becomes \(\operatorname{Var}\left( f(Y) \right) = \operatorname{Var}\left( X \right)\).
Proof
Note that the relation \(\operatorname{Var}\left( X \right) = \operatorname{\mathbb{E}}\left( X^2 \right) - \left[ \operatorname{\mathbb{E}}\left( X \right) \right]^2\) holds in a similar fashion when conditioning on \(Y\)
By the law of total expectation, we can compute \(\operatorname{\mathbb{E}}\left( X^2 \right)\) by
and compute \(\operatorname{\mathbb{E}}\left( X \right)\) by \(\operatorname{\mathbb{E}}\left[ \operatorname{\mathbb{E}}\left( X\mid Y \right) \right]\). Hence,
\(\square\)
Warning
From above we see that the identity that holds for expectation
does not hold for variance
unless \(\operatorname{Var}(\operatorname{\mathbb{E}}[X \mid A]) = 0\), which implies that \(\operatorname{\mathbb{E}}\left( X \mid A \right) = \text{constant}\), i.e., \(X\) and the partitioning \(A_i\) are independent.
Inequalities¶
There are important inequalities that connect probability, expectation and variance. For more inequalities that also relates the sum/average of multiple variables, see here.
Markov’s Inequality¶
Markov’s inequality upper bounds right-tail probability \(\operatorname{\mathbb{P}}\left( X\ge \lambda \right)\) by \(\frac{\mu}{\lambda}\). For a nonnegative random variable \(X\) and \(\lambda>0\),
A more useful form is to substitute \(\lambda \leftarrow \lambda \mu\), then
That is, the probability of exceeding expectation by more than a factor of \(\lambda\) is at most \(\frac{1}{\lambda}\).
Proof
- By the law of total expectation, and since \(\operatorname{\mathbb{E}}(X \mid X<\lambda)\ge 0\) and \(\operatorname{\mathbb{E}}(X \mid X \geq \lambda)\ge \lambda\), we have \[\begin{split}\begin{aligned} \operatorname{\mathbb{E}}(X) & = \operatorname{\mathbb{E}}(X \mid X<\lambda) \cdot \operatorname{\mathbb{P}}(X<\lambda) + \operatorname{\mathbb{E}}(X \mid X \geq \lambda) \cdot \operatorname{\mathbb{P}}(X \geq \lambda)\\ & \ge 0 \cdot \operatorname{\mathbb{P}}(X<\lambda) +\operatorname{\mathbb{E}}(X \mid X \geq \lambda) \cdot \operatorname{\mathbb{P}}(X \geq \lambda) \\ & \geq \lambda \cdot \operatorname{\mathbb{P}}(X \geq \lambda) \end{aligned}\end{split}\]- \(\square\) 
- By the definition of expectation, \[\begin{split}\begin{aligned} \operatorname{\mathbb{E}}(X) &= \int_{0}^{\lambda} x f(x) \mathrm{~d} x+\int_{\lambda}^{\infty} x f(x) \mathrm{~d} x \\ & \geq \int_{\lambda}^{\infty} x f(x) \mathrm{~d} x \\ & \geq \int_{\lambda}^{\infty} \lambda f(x) \mathrm{~d} x \\ & =\lambda \int_{\lambda}^{\infty} f(x) \mathrm{~d} x \\ &=\lambda \operatorname{\mathbb{P}}(X \geq \lambda) \end{aligned}\end{split}\]- \(\square\) 
Chebyshev’s Inequality¶
The probability of deviating from \(\mu\) by more than \(\lambda \sigma\) is at most \(\frac{1}{\lambda^2}\). For any \(\lambda>0\),
For instance, taking \(\lambda=\sqrt{2}\) gives
Proof by the law of total expectation
\(\square\)
Some variation:
- Substituting \(\lambda \leftarrow \frac{\epsilon\mu}{\sigma}\) gives \[ \operatorname{\mathbb{P}}(|X-\mu| \geq \epsilon \mu ) \leq \frac{\sigma ^2}{(\epsilon \mu)^{2}} \]- which can be used to construct a \((\epsilon, \delta)\) estimator. 
- Substituting \(\lambda \leftarrow \frac{\lambda}{\sigma}\) gives \[ \operatorname{\mathbb{P}}(|X-\mu| \geq \lambda ) \leq \frac{\sigma ^2}{\lambda^{2}} \]- In general, \[ \operatorname{\mathbb{P}}(|\mathrm{X}-\mu|>\lambda) \leq \frac{\mathbb{E}\left[(X-\mu)^{p}\right]}{\lambda^{p}} \]
Cauchy-Schewarz Inequality in Probability¶
For two random variables \(X, Y\) we have
The equality holds iff there is a deterministic linear relation between \(X\) and \(Y\), \(Y = aX + b\).
Proof
Recall the Cauchy-Schewarz inequality for vectors \(\boldsymbol{u}, \boldsymbol{v}\) of an innver product space,
Define an inner product on the set of random variables using the expectation of their product
Then the Cauchy-Schewrz inequality becomes
Substituting \(X\) by \(X-\mu_X\) and \(Y\) by \(Y-\mu_Y\) gives
\(\square\)
Exercise¶
Coin Flips¶
- Count trials: What is the expected number of coin flips to get two heads in a row? - Solution 1: Law of Total Expectation - Denote the required number of flips by \(X\). We can partition the sample space into three parts: - \(A_T\): the first flip is a tail 
- \(A_{HT}\): the first two flips are head, tail 
- \(A_{HH}\): the first two flips are head, head 
 - It’s easy to see \[ \operatorname{\mathbb{P}}\left( A_T \right) = \frac{1}{2}, \operatorname{\mathbb{P}}\left( A_{HT} \right) = \operatorname{\mathbb{P}}\left( A_{HH} \right) = \frac{1}{4} \]- But what are \(\operatorname{\mathbb{E}}\left( X \mid A_T \right), \operatorname{\mathbb{E}}\left( X \mid A_{HT} \right), \operatorname{\mathbb{E}}\left( X \mid A_{HH} \right)\)? - If the first flip is T, then we start over, and waste 1 flip 
- If the first two flips are HT, then we start over, and waste 2 flips 
- If the first two flips are HH, then done! We use 2 flips 
 - As a result, we have - \(\operatorname{\mathbb{E}}\left( X \mid A_T \right) = \operatorname{\mathbb{E}}\left( X \right) + 1\) 
- \(\operatorname{\mathbb{E}}\left( X \mid A_{HT} \right) = \operatorname{\mathbb{E}}\left( X \right)+ 2\) 
- \(\operatorname{\mathbb{E}}\left( X \mid A_{HH} \right) = 2\) 
 - Then by the law of total expectation \[\begin{split} \begin{align} \operatorname{\mathbb{E}}\left( X \right) &= \operatorname{\mathbb{E}}\left( X \mid A_T \right) \operatorname{\mathbb{P}}\left( A_T \right) + \operatorname{\mathbb{E}}\left( X \mid A_{HT} \right) \operatorname{\mathbb{P}}\left( A_{HT} \right) + \operatorname{\mathbb{E}}\left( X \mid A_{HH} \right) \operatorname{\mathbb{P}}\left( A_{HH} \right) \\ &= \left[ \operatorname{\mathbb{E}}\left( X \right) + 1 \right]\cdot \frac{1}{2} + \left[ \operatorname{\mathbb{E}}\left( X \right) + 2\right] \cdot \frac{1}{4} + 2 \cdot \frac{1}{4} \\ \end{align} \end{split}\]- Solving the equation gives \(\operatorname{\mathbb{E}}\left( X \right) = 6\) - Note - One may also partition the sample space to two parts \({A_H}\) and \(A_T\), but to compute \(\operatorname{\mathbb{E}}\left( X \mid A_H \right)\), it requires to partition \(A_H\) into \(A_{HT}\) and \(A_{HH}\), and then use the law of total expectation again, which is complicated and easy to make mistakes. So it would be better to partition \(A\) to three parts at the beginning. - In general, what is the expected number of coin flips to get \(n\) heads in a row? In fact, we just need to continue to partition \(A_{HH}\) into \(A_{HHT}\) and \(A_{HHH}\), and so on. By the law of total expectation the equation becomes \[\begin{split} \operatorname{\mathbb{E}}\left( X_n \right) = \left[ \operatorname{\mathbb{E}}\left( X_n \right) + 1 \right]\cdot \frac{1}{2} + \left[ \operatorname{\mathbb{E}}\left( X_n \right) + 2\right] \cdot \frac{1}{4} + \ldots + \left[ \operatorname{\mathbb{E}}\left( X_n \right) + n\right] \cdot \frac{1}{2^n} + n \cdot \frac{1}{2^n} \\ \end{split}\]- The solution is \[ \operatorname{\mathbb{E}}\left( X_n \right) = 2 \left( 2^n-1 \right) \]- Solution 2: Recurrence Relation - One can also derive the solution from a recurrence relation between \(\operatorname{\mathbb{E}}\left( X_n \right)\) and \(\operatorname{\mathbb{E}}\left( X_{n-1} \right)\). - Let \(Y_{n} = X_n - X_{n-1}\) be the number of additional flips required to get \(n\) heads in a row, given that we already got \(n-1\) heads in a row. Then by the law of total expectation, \[\begin{split} \begin{align} \operatorname{\mathbb{E}}\left( Y_{n} \right) &= \operatorname{\mathbb{E}}\left( Y_n \mid \text{the $n$-th flip is H} \right) \operatorname{\mathbb{P}}\left( \text{the $n$-th flip is H} \right) \\ &\ + \operatorname{\mathbb{E}}\left( Y_n \mid \text{the $n$-th flip is T} \right) \operatorname{\mathbb{P}}\left( \text{the $n$-th flip is T} \right) \\ &= 1 \cdot \frac{1}{2} + \left[ 1 + \operatorname{\mathbb{E}}\left( X_n \right) \right] \cdot \frac{1}{2} \end{align} \end{split}\]- Hence, we have the recurrence relation \[ \operatorname{\mathbb{E}}\left( X_n \right) = 2 \operatorname{\mathbb{E}}\left( X_{n-1} \right) + 2 \]- Let \(f(n) = \operatorname{\mathbb{E}}\left( X_n\right) + 2\) then we have \(f(n) = 2f(n-1)\). Since \(f(1) = \operatorname{\mathbb{E}}\left( X_1 \right)+2 = 4\), we have \(f(n) = 2^{n+1}\). Therefore, \[\operatorname{\mathbb{E}}\left( X_n \right) = 2^{n+1}-2\]
- Count rows: what is the expected number of times to see \(k\) heads in a row, i.e., HH…HH, in \(n\) flips of a coin? - Solution - In \(n\) flips of a coin, there are \(n-k+1\) places where the string HH…HH can start to appear, each with a (non-independent) probability \(\frac{1}{2^k} \) of happening. Let \(X\) be the number of times to see the string HH…HH, and \(X_i\) be the indicator variable that is \(1\) if the string starts to appear at the \(i\)-th flip, then \[ X = \sum_{i=1}^{n-k+1} X_i \]- and hence \[\begin{split}\begin{align} \operatorname{\mathbb{E}}\left( X \right) &= \operatorname{\mathbb{E}}\left( \sum_{i=1}^{n-k+1} X_i \right)\\ &= \sum_{i=1}^{n-k+1} \operatorname{\mathbb{E}}\left( X_i \right)\\ &= \frac{n-k+1}{2^k} \\ \end{align}\end{split}\]- The first second last line holds even if \(X_i\)’s are not independent. 
- Count runs: a coin with a probability \(p\) to get a head is flipped \(n\) times. A “run” is a maximal sequence of consecutive flips that are all the same. For instance, HTHHHTTH has five runs and \(n=8\). What is the expected number of runs? - Solution - Let \(X_i\) be the indicator for the event that a run starts at the \(i-th\) toss. Let \(X = \sum_i X_i\) be the total number of runs. It is easy to see \(\operatorname{\mathbb{E}}\left( X_1 \right) = 1\). For \(i>1\), \[\begin{split} \begin{aligned} \operatorname{\mathbb{E}}\left(X_{i}\right)=& \operatorname{\mathbb{P}}\left(X_{i}=1\right) \\ =& \operatorname{\mathbb{P}}\left(i \text { -th toss is } \mathrm{H} \mid(i-1) \text { -th toss is } \mathrm{T}\right) \times \operatorname{\mathbb{P}}\left((i-1) \text { -th toss is } \mathrm{T}\right) \\ &+\operatorname{\mathbb{P}}\left(i \text { -th toss is } \mathrm{T} \mid(i-1)\text {-th} \text { toss is } \mathrm{H}\right) \times \operatorname{\mathbb{P}}\left((i-1)\text {-th } \text { toss is } \mathrm{H}\right) \\ =& p(1-p)+(1-p) p \\ =& 2 p(1-p) \end{aligned} \end{split}\]- Therefore, \[\begin{split} \begin{aligned} \operatorname{\mathbb{E}}(X) &=\operatorname{\mathbb{E}}\left(X_{1}+X_{2}+\cdots+X_{n}\right) \\ &=\operatorname{\mathbb{E}}\left(X_{1}\right)+\operatorname{\mathbb{E}}\left(X_{2}\right)+\cdots+\operatorname{\mathbb{E}}\left(X_{n}\right) \\ &=\operatorname{\mathbb{E}}\left(X_{1}\right)+\left[\operatorname{\mathbb{E}}\left(X_{2}\right)+\cdots+\operatorname{\mathbb{E}}\left(X_{n}\right)\right] \\ &=1+(n-1) \times 2 p(1-p) \\ &=1+2(n-1) p(1-p) \end{aligned} \end{split}\]
Incremental Update of Mean and Variance¶
Suppose you have \(n\) observations \(x_1, x_2, \ldots, x_n\). Now a new value \(x_{n+1}\) is observed. Write down recurrence functions for \(\bar{x}_n\) and \(s^2_n\), and use them to obtain \(\bar{x}_{n+1}\) and \(s^2_{n+1}\).
Solution
To update mean,
The last line is to avoid computing a large number \(n \bar{x}_n\).
The second last line implies that the new sample mean \(\bar{x}_{n+1}\) is a weighted average of the current sample mean \(\bar{x}_{n+1}\) and the new observed value \(x_{n+1}\).
To update variance, we first use the above method to obtain \(\bar{x}_{n+1}\), and let \(S_n = ns_{n}^2\), then
Finally \(s_{n+1}^2 = \frac{1}{n+1} S_{n+1}\).
Tip
The substitution \(S_n = ns_n^2\) avoids the computation that involves \(\frac{1}{n} \) and \(\frac{1}{n+1} \). And the update equation of the mean is also quite useful.
Miscellaneous¶
- Two groups of data. In group one, sample standard deviation is \(s_1\), in group two it is \(s_2\). After merging them, it is \(s_3\). Do we always have \(s_3 > \max(s_1, s_2)\)? - Solution - Let \(\lambda = \frac{n_1}{n_1 + n_2} \in (0,1)\), then \(\bar{x}_3 = \lambda \bar{x}_1 + (1-\lambda)\bar{x}_2\). WLOG assume \(\bar{x}_1 - \bar{x}_2 = d \ge 0\). \[\begin{split}\begin{aligned} s^2 _1 &= \frac{TSS_1}{n_1} \\ s^2 _2 &= \frac{TSS_2}{n_2} \\ s^2 _3 &= \frac{TSS_3}{n_3}\\ s^2 _3 &= \frac{TSS_1 + TTS_2 + n_1 (\bar{x}_1 - \bar{x}_3)^2 + n_2 (\bar{x}_3 - \bar{x}_2)^2 }{n_1 + n_2}\\ &= \lambda s^2 _1 + (1-\lambda) s^2 _2 + \lambda ((1-\lambda)d )^2 + (1-\lambda) (\lambda d)^2 \\ &= \underbrace{\lambda s^2 _1 + (1-\lambda) s^2 _2}_{a} + \underbrace{\lambda(1-\lambda)d^2}_{b} \\ \end{aligned}\end{split}\]- where - \(\lambda = \frac{n_1}{n_1 + n_2} \in (0,1)\) 
- \(d_1 = \bar{x}_1 - \bar{x}_3 = \bar{x}_1 - (\lambda \bar{x}_1 + (1-\lambda)\bar{x}_2) = (1-\lambda)(\bar{x}_1 - \bar{x}_2) = (1-\lambda)c \in [0, c)\) 
- \(d_2 = \bar{x}_3 - \bar{x}_2 = \lambda \bar{x}_1 + (1-\lambda)\bar{x}_2 - \bar{x}_2 = \lambda(\bar{x}_1 - \bar{x}_2) = \lambda c \in [0, c)\) 
- \(d_1 + d_2 = c\) 
 - We have - \(\min(s^2_1, s^2_2) \le a \le \max(s^2_1, s^2_2)\) with equalities iff \(s_2^2 = s_2^2\). 
- \(0 \le b < d^2\) with equality iff \(d=0\). 
 - Since \(a\) and \(b\) are independent, we can know for sure that \(\min(s^2_1, s^2_2) \le s_3^2\). The other comparison \(\max(s^2_1, s^2_2) \text{ vs } s_3^2\) is uncertain, depending on \(d\). 
- Find the distribution of the sum of two independent uniform random variable. - Proof - Let \(X, Y \overset{\text{iid}}{\sim} \mathcal{U} (0,1)\) and their sum be \(Z = X + Y\). Then \[\begin{split}\begin{aligned} f_{Z}(z) &=\int_{-\infty}^{\infty} f_{X}(x) f_{Y}(z-x) \mathrm{~d} x \\ &=\int_{-\infty}^{\infty} \boldsymbol{1} _{x \in (0,1)} \boldsymbol{1} _{z-x \in (0,1)} \mathrm{~d} x \\ &=\int_{0}^{1} \boldsymbol{1} _{z-x \in (0,1) } \mathrm{~d} x \\ &= \left\{\begin{array}{ll} \int_{0}^{z} 1 \mathrm{~d} x = z, & \text { if } 0<z<1 \\ \int_{z-1}^{1} 1 \mathrm{~d} x = 2 - z, & \text { if } 1\le z<2 \\ \end{array}\right.\\ \end{aligned}\end{split}\]- Hence, \(Z\) follows a triangular distribution with lower limit \(0\), upper limit \(2\), and mode \(1\). That is, it’s more likely to see \(Z\) around \(1\), which equals the sum of two expected values. In real life, the sum of two dices is probably around 3.5. 
- Randomly and independently select two points in \([0, \ell]\), find their expected distance. - Solution 1: integration - \ellet \(X_1, X_2 \overset{\text{iid}}{\sim} \mathcal{U} [0, \ell]\), their joint density function is \[\begin{split} f_{X_{1} X_{2}}\left(x_{1}, x_{2}\right)=f_{X_{1}}\left(x_{1}\right) f_{X_{2}}\left(x_{2}\right)=\left\{\begin{array}{ll} \frac{1}{\ell^2} & \text { if } \quad x_1, x_2 \in[0, \ell] \\ 0 & \text { otherwise } \end{array}\right. \end{split}\]- Define the distance as \[\begin{split} g\left(x_{1}, x_{2}\right)=\left|x_{1}-x_{2}\right|=\left\{\begin{array}{lll} x_{1}-x_{2} & \text { if } x_{1} \geq x_{2} \\ x_{2}-x_{1} & \text { otherwise } \end{array}\right. \end{split}\]- Hence the expectation is \[\begin{split} \begin{aligned} \mathbb{E}(g(X_1, X_2)) &=\int_{0}^{\ell} \int_{0}^{\ell} g\left(x_{1}, x_{2}\right) f_{X_{1} X_{2}}\left(x_{1}, x_{2}\right) \mathrm{~d} x_{1} \mathrm{~d} x_{2} \\ &=\frac{1}{\ell^{2}} \int_{0}^{\ell} \int_{0}^{\ell}\left|x_{1}-x_{2}\right| \mathrm{~d} x_{1} \mathrm{~d} x_{2} \\ &=\frac{1}{\ell^{2}} \int_{0}^{\ell} \int_{0}^{x_{1}}\left(x_{1}-x_{2}\right) \mathrm{~d} x_{2} \mathrm{~d} x_{1}+\frac{1}{\ell^{2}} \int_{0}^{\ell} \int_{x_{1}}^{\ell}\left(x_{2}-x_{1}\right) \mathrm{~d} x_2 \mathrm{~d} x_1 \\ &=\frac{\ell^3}{6\ell^2} + \frac{\ell^3}{6\ell^2} \\ &=\frac{\ell}{3} \end{aligned} \end{split}\]- Solution 2: random procedure - If we randomly select two points, then we cut the interval of length \(\ell\) into 3 segments, of length \(D_1, D_2, D_3\) respectively. They should be “exchangeable”, so \(\mathbb{E}\left( D_1 \right) = \mathbb{E}\left( D_2 \right) = \mathbb{E}\left( D_3 \right)\). Since \(\mathbb{E}\left( D_1 + D_2 + D_3 \right) = \ell\), we have \(\mathbb{E}\left( D_2 \right) = \frac{\ell}{3}\). - Formally, \((D_1, D_2, D_3) = (\min (X, Y), \max (X, Y)-\min (X, Y), \ell-\max (X, Y))\). One can show that \((D_1, D_2, D_3)\) is an exchangeable sequence, i.e., whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered. - How about two uniform random points in a compact convex subset in \(\mathbb{R} ^n\)? For example, interval, disk, square, cube? See this paper. 
 
- Randomly and independent select \(n\) points from an interval of length \(\ell\), let \(D\) be the minimum distance between two points: \(d = \min_{i \ne j \in [n]} \left\vert x_i - x_j \right\vert\), find \(\mathbb{E}\left( D \right)\). - Solution - Consider two events: - the minimum distance between two points equals \(d\) 
- no points are in the left section of length \((n-1)d\) 
 - The two events have the same probability: imagine that we take \(d\) from each segment \([x_{i}, x_{i+1}]\), and aggregate all \((n-1)\) of them to the left. Hence, \(\mathbb{P}\left( D = d \right) = \left( 1 - \frac{(n-1)d}{\ell} \right)^n\). Integration over \(0 \le d \le \frac{\ell}{n-1}\) gives \(\frac{\ell}{n^2 - 1}\). - Note that if \(n=2\) then this problem reduces to the previous one. 
- Randomly select three points \(A, B, C\) from a circle, what’s the probability that the center \(O\) is in the triangle \(ABC\)? Extension: probability of four random points on a sphere such that the tetrahedron \(ABCD\) contains \(O\)? - Solution: change of random procedure - Here are some hints. For details, see the great youtube video. - Suppose we have already select two points \(A, B\). What’s the range of \(C\) such that \(ABC\) covers \(O\)? 
- Two random points can generated as follows: generate two random diameter lines \(\ell_1, \ell_2\) that pass center \(O\), then randomly choose one endpoint of \(\ell_1\) to be \(B\), and one endpoint of \(\ell_2\) to be \(C\), their are 4 combinations. Now, suppose we already have \(A\), and use this random procedure to generate \(B\) and \(C\), which of the 4 combinations of \((B,C)\) gives a required triangle \(ABC\)?