Multivariate Notations¶
In machine learning models, we often deal with more than one variables at a time. Below are the notations for multivariate case, and their properties.
Data Matrix¶
Suppose there are \(p\) random variables \(X_1, X_2, \ldots, X_p\) and we have \(n\) observed values for each of them. The data matrix is
where
- Column \(j\) contains observations of variable \(j\). 
- Row \(i\) is an observed vector \(\boldsymbol{x}_i\) 
Mean Vector¶
Population Mean Vector¶
The mean vector of a random vector \(\boldsymbol{x}\) is defined as
Properties
- \(\operatorname{\mathbb{E}}\left( \boldsymbol{a}^{\boldsymbol{\top}} \boldsymbol{x} \right)=\boldsymbol{a}^{\boldsymbol{\top}} \boldsymbol{\mu}\) 
- \(\operatorname{\mathbb{E}}\left( \boldsymbol{A x} \right)=\boldsymbol{A} \boldsymbol{\mu}\) 
Sample Mean Vector¶
Let \(\bar{x}_i\) be the sample mean of variable \(X_i\). The sample mean vector is
Covariance Matrix¶
Population Covariance Matrix¶
Aka variance-covariance matrix.
Covariance matrix of a random vector \(\boldsymbol{x}\) summarizes pairwise covariance,
where
- \(\sigma_{ii} = \operatorname{Cov}\left( x_i, x_j \right) = \operatorname{Var}\left( x_i \right)\) 
- \(\sigma_{ij} = \operatorname{Cov}\left( x_i, x_j \right) = \sigma_{ji}\) 
In matrix form,
which is a multivariate extension of \(\mathbb{V} [X] = \mathbb{E} [X^2] - \mathbb{E} [X] ^2\).
Properties
- \(\boldsymbol{\Sigma}\) is positive definite, and hence \(\boldsymbol{\Sigma} ^{-1}\) exists - This holds unless \(x_1, x_2, \ldots, x_p\) is linearly related, in which case we say that \(\boldsymbol{x}\) is a degenerated random vector, i.e. its effective dimension is less than \(p\); in other words, its joint distribution is concentrated in a subspace of lower dimension. 
- Transformation - \(\operatorname{Var}\left( \boldsymbol{a}^\top \boldsymbol{x} + b \right) = \operatorname{Var}\left( \boldsymbol{a} ^\top \boldsymbol{x} \right) = \boldsymbol{a}^\top \boldsymbol{\Sigma} \boldsymbol{a} \ge 0\) - The equality holds iff \(\boldsymbol{a} ^\top \boldsymbol{x} \ne c\), a constant. 
- \(\operatorname{Var}\left( \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b} \right) = \boldsymbol{A} \boldsymbol{\Sigma} \boldsymbol{A} ^\top\) 
 
- Expectation of quadratic form: \(\mathbb{E} [\boldsymbol{x} ^{\top} \boldsymbol{A} \boldsymbol{x} ] = \boldsymbol{\mu} ^{\top} \boldsymbol{A} \boldsymbol{\mu} + \operatorname{tr}(\boldsymbol{A} \boldsymbol{\Sigma} )\). 
- \(\sum_{j=1}^d \lambda_j = \sum_{j=1}^d \sigma_{ii}\): the sum of eigenvalues of \(\boldsymbol{\Sigma}\) equals the sum of variances. 
- The determinant of the covariance matrix \(\left\vert \boldsymbol{\boldsymbol{\Sigma}} \right\vert = \operatorname{det} (\boldsymbol{\boldsymbol{\Sigma}} )\) is called the generalized variance. It changes for scaling of variables like the case of univariate variance. Suppose \(\boldsymbol{x}\) follows multivariate Gaussian \(\boldsymbol{x} \sim \mathcal{N}_p(\boldsymbol{\mu} , \boldsymbol{\Sigma})\), then we have the following interpretation for \(\operatorname{det} (\boldsymbol{\Sigma})\): - \(\operatorname{det}(\boldsymbol{\Sigma})\) is a (indirect) measure of the entropy of the Gaussian density 
 \[ H(\mathcal{N} _p)=\frac{p}{2}(1+\ln (2 \pi))+\frac{1}{2} \ln |\Sigma| \]- \(\operatorname{det} (\boldsymbol{\Sigma})\) is proportional to the squared of the volume of the ellipsoid \(E(\boldsymbol{\mu} , \boldsymbol{\Sigma}, c) = \left\{\boldsymbol{x} \in \mathbb{R} ^p: (\boldsymbol{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}) \le c \right\}\) which measures the disperse of the “data cloud”, i.e. uncertainty. 
- \(\operatorname{det} (\boldsymbol{\Sigma}) = \prod_{j=1}^d \lambda_j = 0\) if at least one variables is degenerate. 
 - The interpretation for other distributions is analogous. 
Sample Covariance Matrix¶
The following sample covariance matrix \(\boldsymbol{S}\) is an unbiased estimate of the population covariance matrix \(\boldsymbol{\Sigma}\)
where
- \(s_{jj} = s_j ^2\) is the sample variance of \(x_j\) 
- \(s_{kj} = \frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i k}-\bar{x}_{k}\right)\left(x_{i j}-\bar{x}_{j}\right)\) is the sample covariance between \(x_k\) and \(x_j\) 
In matrix form,
where
is called the corrected (centered) sums of squares and products matrix (CSSP) or scatter matrix. One can view it as a multivariate generalization of the corrected (centered) sum of squares \(\sum_i \left( x_i - \bar{x} \right)^2\) in the univariate case.
The determinant of the sample covariance \(\left\vert \boldsymbol{S} \right\vert = \operatorname{det} (\boldsymbol{S} )\) is called the generalized sample variance. It changes for scaling of variables like the case of univariate sample variance. Since \(\boldsymbol{S}\) is an estimator for \(\boldsymbol{\Sigma}\), the interpretations of \(\operatorname{det}(\boldsymbol{S} )\) and \(\operatorname{det}(\boldsymbol{\Sigma} )\) are similar. See the above section for \(\operatorname{det} (\boldsymbol{\Sigma})\).
Covariance Matrix of Two Vectors¶
The covariance matrix of two random vectors \(\boldsymbol{x} _{p\times 1}, \boldsymbol{y} _{q \times 1}\) is defined as
Note that the shape is \(p \times q\), which implies the non-symmetry of covariance matrix
Properties
- \(\operatorname{Var}\left( \boldsymbol{x} \right) = \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{x} \right)\) 
- If \(\boldsymbol{x} _1, \boldsymbol{x} _2, \boldsymbol{y}\) are \(p \times 1\) vectors, then \(\operatorname{Var}\left( \boldsymbol{x} + \boldsymbol{y} \right) = \operatorname{Cov}\left( \boldsymbol{x} _1, \boldsymbol{y} \right) + \operatorname{Cov}\left( \boldsymbol{x} _2 + \boldsymbol{y} \right)\) 
- If \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are \(p \times 1\) vectors, then \(\operatorname{Var}\left( \boldsymbol{x} +\boldsymbol{y} \right) = \operatorname{Var}\left( \boldsymbol{x} \right) + \operatorname{Var}\left( y \right) + \operatorname{Cov}\left( \boldsymbol{y} ,\boldsymbol{x} \right) + \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right)\) 
- \(\operatorname{Cov}\left( \boldsymbol{A} \boldsymbol{x} , \boldsymbol{B} \boldsymbol{y} \right) = \boldsymbol{A} \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right) \boldsymbol{B} ^\top\) 
- If \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are independent, then \(\operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right)\). Note that the converse is not always true. 
Correlation Matrix¶
Population Correlation Matrix¶
The correlation matrix of \(p\) random variables \(x_1, x_2, \ldots, x_p\) is
where \(\rho_{ij} = \operatorname{Corr}\left( x_i, x_j \right) = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii} \sigma_{jj}}}\)
In matrix form,
where
In short we will write \(D ^{-1} = \left( \operatorname{diag}\left( \boldsymbol{\Sigma} \right) \right) ^{-\frac{1}{2}}\).
Properties
- \(\rho_{ii} = 1\). \(\rho_{ij} = \rho_{ji}\). \(\rho_{ij} = 0\) iff \(\sigma_{ij} = 0\) 
- Each \(\rho_{ij}\) does not change under re-location or rescaling of \(x_i\) and \(x_j\) 
- \(\operatorname{tr}(\boldsymbol{\rho} )= \sum_{j=1}^d \lambda_j = d\) 
- \(\operatorname{det}(\boldsymbol{\rho} ) = \prod_{j=1}^d \lambda_j \in [0, 1]\): it is 1 if all variables are independent, and 0 if at least one variable is degenerate \(\sigma _{ii} = 0\). The larger the value, higher level of independence, and higher level of uncertainty. 
- \(\operatorname{det}(\boldsymbol{\rho} ) = \operatorname{det} (\boldsymbol{D} ^{-1} \boldsymbol{\Sigma} \boldsymbol{D} ^{-1} ) = \operatorname{det}(\boldsymbol{D} ^{-1}) \operatorname{det} (\boldsymbol{\Sigma} ) \operatorname{det} (\boldsymbol{D} ^{-1} ) = \operatorname{det} (\boldsymbol{\Sigma}) \prod_i {\sigma}_{ii}\) 
Sample Correlation Matrix¶
From the sample covariance matrix, we can obtain the sample correlation matrix as an estimate of the population correlation matrix
where \(r_{kj} = \frac{s_{kj}}{\sqrt{s_{kk} s_{jj}}}\), which is the sample correlation coefficient between \(x_k\) and \(x_j\).
In matrix form,
where
Note
The transform from covariance matrix to correlation matrix
is just a particular application of standardizing a positive definite matrix into a standard form where the diagonal elements all equal to one. The original positive definite matrix can be recovered by means of the the diagonal elements and the standardized matrix.