Multivariate Notations

In machine learning models, we often deal with more than one variables at a time. Below are the notations for multivariate case, and their properties.

Data Matrix

Suppose there are \(p\) random variables \(X_1, X_2, \ldots, X_p\) and we have \(n\) observed values for each of them. The data matrix is

\[\begin{split} \boldsymbol{X}=\left(x_{i j}\right)_{n \times p}=\left[\begin{array}{c} \boldsymbol{x}_{1}^{\top} \\ \vdots \\ \boldsymbol{x}_{i}^{\top} \\ \vdots \\ \boldsymbol{x}_{n}^{\top} \end{array}\right]=\left[\begin{array}{ccccc} x_{11} & \cdots & x_{1 j} & \cdots & x_{1 p} \\ \vdots & & \vdots & & \vdots \\ x_{i 1} & \cdots & x_{i j} & \cdots & x_{i p} \\ \vdots & & \vdots & & \vdots \\ x_{n 1} & \cdots & x_{n j} & \cdots & x_{n p} \end{array}\right] \end{split}\]


  • Column \(j\) contains observations of variable \(j\).

  • Row \(i\) is an observed vector \(\boldsymbol{x}_i\)

Mean Vector

Population Mean Vector

The mean vector of a random vector \(\boldsymbol{x}\) is defined as

\[\begin{split} \operatorname{\mathbb{E}}(\boldsymbol{x})=\left[\begin{array}{c} \operatorname{\mathbb{E}}\left(x_{1}\right) \\ \vdots \\ \operatorname{\mathbb{E}}\left(x_{p}\right) \end{array}\right]=\left[\begin{array}{c} \mu_{1} \\ \vdots \\ \mu_{p} \end{array}\right]=\boldsymbol{\mu} \end{split}\]


  1. \(\operatorname{\mathbb{E}}\left( \boldsymbol{a}^{\boldsymbol{\top}} \boldsymbol{x} \right)=\boldsymbol{a}^{\boldsymbol{\top}} \boldsymbol{\mu}\)

  2. \(\operatorname{\mathbb{E}}\left( \boldsymbol{A x} \right)=\boldsymbol{A} \boldsymbol{\mu}\)

Sample Mean Vector

Let \(\bar{x}_i\) be the sample mean of variable \(X_i\). The sample mean vector is

\[\begin{split} \overline{\boldsymbol{x}}_{p \times 1}=\left[\begin{array}{c} \bar{x}_{1} \\ \vdots \\ \bar{x}_{p} \end{array}\right]=\frac{1}{n} \boldsymbol{X}^{\top} \boldsymbol{1} \end{split}\]

Covariance Matrix

Population Covariance Matrix

Aka variance-covariance matrix.

Covariance matrix of a random vector \(\boldsymbol{x}\) summarizes pairwise covariance,

\[\begin{split} \operatorname{Var}(\boldsymbol{x}) \text { or } \boldsymbol{\Sigma}=\left[\begin{array}{cccc} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1 p} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p 1} & \sigma_{p 2} & \cdots & \sigma_{p p} \end{array}\right] \end{split}\]


  • \(\sigma_{ii} = \operatorname{Cov}\left( x_i, x_j \right) = \operatorname{Var}\left( x_i \right)\)

  • \(\sigma_{ij} = \operatorname{Cov}\left( x_i, x_j \right) = \sigma_{ji}\)

In matrix form,

\[\begin{split} \begin{align} \boldsymbol{\Sigma} &=\operatorname{\mathbb{E}}\left[(\boldsymbol{x}-\operatorname{\mathbb{E}}(\boldsymbol{x}))(\boldsymbol{x}-\operatorname{\mathbb{E}}(\boldsymbol{x}))^{\top}\right] \\ &=\operatorname{\mathbb{E}}\left(\boldsymbol{x} \boldsymbol{x}^{\top}\right)-\operatorname{\mathbb{E}}(\boldsymbol{x}) \operatorname{\mathbb{E}}(\boldsymbol{x}) \\ &=\operatorname{\mathbb{E}}\left(\boldsymbol{x} \boldsymbol{x}^{\top}\right)- \boldsymbol{\mu} \boldsymbol{\mu} ^\top \end{align} \end{split}\]

which is a multivariate extension of \(\mathbb{V} [X] = \mathbb{E} [X^2] - \mathbb{E} [X] ^2\).


  1. \(\boldsymbol{\Sigma}\) is positive definite, and hence \(\boldsymbol{\Sigma} ^{-1}\) exists

    This holds unless \(x_1, x_2, \ldots, x_p\) is linearly related, in which case we say that \(\boldsymbol{x}\) is a degenerated random vector, i.e. its effective dimension is less than \(p\); in other words, its joint distribution is concentrated in a subspace of lower dimension.

  2. Transformation

    • \(\operatorname{Var}\left( \boldsymbol{a}^\top \boldsymbol{x} + b \right) = \operatorname{Var}\left( \boldsymbol{a} ^\top \boldsymbol{x} \right) = \boldsymbol{a}^\top \boldsymbol{\Sigma} \boldsymbol{a} \ge 0\)

      The equality holds iff \(\boldsymbol{a} ^\top \boldsymbol{x} \ne c\), a constant.

    • \(\operatorname{Var}\left( \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b} \right) = \boldsymbol{A} \boldsymbol{\Sigma} \boldsymbol{A} ^\top\)

  3. Expectation of quadratic form: \(\mathbb{E} [\boldsymbol{x} ^{\top} \boldsymbol{A} \boldsymbol{x} ] = \boldsymbol{\mu} ^{\top} \boldsymbol{A} \boldsymbol{\mu} + \operatorname{tr}(\boldsymbol{A} \boldsymbol{\Sigma} )\).

  4. \(\sum_{j=1}^d \lambda_j = \sum_{j=1}^d \sigma_{ii}\): the sum of eigenvalues of \(\boldsymbol{\Sigma}\) equals the sum of variances.

  5. The determinant of the covariance matrix \(\left\vert \boldsymbol{\boldsymbol{\Sigma}} \right\vert = \operatorname{det} (\boldsymbol{\boldsymbol{\Sigma}} )\) is called the generalized variance. It changes for scaling of variables like the case of univariate variance. Suppose \(\boldsymbol{x}\) follows multivariate Gaussian \(\boldsymbol{x} \sim \mathcal{N}_p(\boldsymbol{\mu} , \boldsymbol{\Sigma})\), then we have the following interpretation for \(\operatorname{det} (\boldsymbol{\Sigma})\):

    • \(\operatorname{det}(\boldsymbol{\Sigma})\) is a (indirect) measure of the entropy of the Gaussian density

    \[ H(\mathcal{N} _p)=\frac{p}{2}(1+\ln (2 \pi))+\frac{1}{2} \ln |\Sigma| \]
    • \(\operatorname{det} (\boldsymbol{\Sigma})\) is proportional to the squared of the volume of the ellipsoid \(E(\boldsymbol{\mu} , \boldsymbol{\Sigma}, c) = \left\{\boldsymbol{x} \in \mathbb{R} ^p: (\boldsymbol{x}-\boldsymbol{\mu})^{\top} \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}) \le c \right\}\) which measures the disperse of the “data cloud”, i.e. uncertainty.

    • \(\operatorname{det} (\boldsymbol{\Sigma}) = \prod_{j=1}^d \lambda_j = 0\) if at least one variables is degenerate.

    The interpretation for other distributions is analogous.

Sample Covariance Matrix

The following sample covariance matrix \(\boldsymbol{S}\) is an unbiased estimate of the population covariance matrix \(\boldsymbol{\Sigma}\)

\[\begin{split} \boldsymbol{S}_{p \times p}=\left[\begin{array}{cccc} s_{11} & s_{12} & \cdots & s_{1 p} \\ s_{21} & s_{22} & \cdots & s_{2 p} \\ \vdots & \vdots & \ddots & \vdots \\ s_{p 1} & s_{p 2} & \cdots & s_{p p} \end{array}\right] \end{split}\]


  • \(s_{jj} = s_j ^2\) is the sample variance of \(x_j\)

  • \(s_{kj} = \frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i k}-\bar{x}_{k}\right)\left(x_{i j}-\bar{x}_{j}\right)\) is the sample covariance between \(x_k\) and \(x_j\)

In matrix form,

\[\begin{split} \begin{aligned} \boldsymbol{S} &=\frac{1}{n-1}\left[\boldsymbol{X}^{\top} \boldsymbol{X}-n \overline{\boldsymbol{x}} \overline{\boldsymbol{x}}^{\top}\right] \\ &=\frac{1}{n-1} \sum_{i=1}^{n}\left(\boldsymbol{x}_{i}-\overline{\boldsymbol{x}}\right)\left(\boldsymbol{x}_{i}-\overline{\boldsymbol{x}}\right)^{\top} \\ &=\frac{1}{n-1} \boldsymbol{W} \end{aligned} \end{split}\]


\[ \boldsymbol{W}_{p \times p}=\boldsymbol{X}^{\top} \boldsymbol{X}-n \overline{\boldsymbol{x}} \overline{\boldsymbol{x}}^{\top}=\sum_{i=1}^{n}\left(\boldsymbol{x}_{i}-\overline{\boldsymbol{x}}\right)\left(\boldsymbol{x}_{i}-\overline{\boldsymbol{x}}\right)^{\top} \]

is called the corrected (centered) sums of squares and products matrix (CSSP) or scatter matrix. One can view it as a multivariate generalization of the corrected (centered) sum of squares \(\sum_i \left( x_i - \bar{x} \right)^2\) in the univariate case.

The determinant of the sample covariance \(\left\vert \boldsymbol{S} \right\vert = \operatorname{det} (\boldsymbol{S} )\) is called the generalized sample variance. It changes for scaling of variables like the case of univariate sample variance. Since \(\boldsymbol{S}\) is an estimator for \(\boldsymbol{\Sigma}\), the interpretations of \(\operatorname{det}(\boldsymbol{S} )\) and \(\operatorname{det}(\boldsymbol{\Sigma} )\) are similar. See the above section for \(\operatorname{det} (\boldsymbol{\Sigma})\).

Covariance Matrix of Two Vectors

The covariance matrix of two random vectors \(\boldsymbol{x} _{p\times 1}, \boldsymbol{y} _{q \times 1}\) is defined as

\[ \operatorname{Cov}\left( \boldsymbol{x} _{p \times 1}, \boldsymbol{y} _ {q \times 1} \right) = \operatorname{\mathbb{E}}\left[(\boldsymbol{x}-\boldsymbol{\mu} _x)(\boldsymbol{y}-\boldsymbol{\mu} _y)^{\top}\right]_{p\times q} \]

Note that the shape is \(p \times q\), which implies the non-symmetry of covariance matrix

\[ \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right) \ne \operatorname{Cov}\left( \boldsymbol{y} , \boldsymbol{x} \right) \]


  1. \(\operatorname{Var}\left( \boldsymbol{x} \right) = \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{x} \right)\)

  2. If \(\boldsymbol{x} _1, \boldsymbol{x} _2, \boldsymbol{y}\) are \(p \times 1\) vectors, then \(\operatorname{Var}\left( \boldsymbol{x} + \boldsymbol{y} \right) = \operatorname{Cov}\left( \boldsymbol{x} _1, \boldsymbol{y} \right) + \operatorname{Cov}\left( \boldsymbol{x} _2 + \boldsymbol{y} \right)\)

  3. If \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are \(p \times 1\) vectors, then \(\operatorname{Var}\left( \boldsymbol{x} +\boldsymbol{y} \right) = \operatorname{Var}\left( \boldsymbol{x} \right) + \operatorname{Var}\left( y \right) + \operatorname{Cov}\left( \boldsymbol{y} ,\boldsymbol{x} \right) + \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right)\)

  4. \(\operatorname{Cov}\left( \boldsymbol{A} \boldsymbol{x} , \boldsymbol{B} \boldsymbol{y} \right) = \boldsymbol{A} \operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right) \boldsymbol{B} ^\top\)

  5. If \(\boldsymbol{x}\) and \(\boldsymbol{y}\) are independent, then \(\operatorname{Cov}\left( \boldsymbol{x} , \boldsymbol{y} \right)\). Note that the converse is not always true.

Correlation Matrix

Population Correlation Matrix

The correlation matrix of \(p\) random variables \(x_1, x_2, \ldots, x_p\) is

\[\begin{split} \boldsymbol{\rho}=\left[\begin{array}{cccc} 1 & \rho_{12} & \cdots & \rho_{1 p} \\ \rho_{21} & 1 & \cdots & \rho_{2 p} \\ \vdots & & \ddots & \vdots \\ \rho_{p 1} & \rho_{p 2} & \cdots & 1 \end{array}\right] \end{split}\]

where \(\rho_{ij} = \operatorname{Corr}\left( x_i, x_j \right) = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii} \sigma_{jj}}}\)

In matrix form,

\[\begin{split} \begin{align} \boldsymbol{\rho} &=\left[\begin{array}{cccc} \frac{1}{\sqrt{\sigma_{11}}} & 1 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{\sigma_{22}}} & \cdots & \vdots \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{\sigma_{pp}}} \end{array}\right]\left[\begin{array}{cccc} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1 p} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2 p} \\ \vdots & & \ddots & \vdots \\ \sigma_{p 1} & \sigma_{p 2} & \cdots & \sigma_{p p} \end{array}\right]\left[\begin{array}{cccc} \frac{1}{\sqrt{\sigma}_{11}} & 1 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{\sigma_{22}}} & \cdots & \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{\sigma_{pp}}} \end{array}\right] \\ &= \boldsymbol{D} ^{-1} \boldsymbol{\Sigma} \boldsymbol{D} ^{-1} \end{align} \end{split}\]


\[\begin{split} \boldsymbol{D} = \left[\begin{array}{cccc} \sqrt{\sigma}_{11} & 0 & \cdots & 0 \\ 0 & \sqrt{\sigma}_{22} & \cdots & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & \sqrt{\sigma}_{p p} \end{array}\right] = \left( \operatorname{diag}\left( \boldsymbol{\Sigma} \right) \right) ^{\frac{1}{2}} \end{split}\]

In short we will write \(D ^{-1} = \left( \operatorname{diag}\left( \boldsymbol{\Sigma} \right) \right) ^{-\frac{1}{2}}\).


  • \(\rho_{ii} = 1\). \(\rho_{ij} = \rho_{ji}\). \(\rho_{ij} = 0\) iff \(\sigma_{ij} = 0\)

  • Each \(\rho_{ij}\) does not change under re-location or rescaling of \(x_i\) and \(x_j\)

  • \(\operatorname{tr}(\boldsymbol{\rho} )= \sum_{j=1}^d \lambda_j = d\)

  • \(\operatorname{det}(\boldsymbol{\rho} ) = \prod_{j=1}^d \lambda_j \in [0, 1]\): it is 1 if all variables are independent, and 0 if at least one variable is degenerate \(\sigma _{ii} = 0\). The larger the value, higher level of independence, and higher level of uncertainty.

  • \(\operatorname{det}(\boldsymbol{\rho} ) = \operatorname{det} (\boldsymbol{D} ^{-1} \boldsymbol{\Sigma} \boldsymbol{D} ^{-1} ) = \operatorname{det}(\boldsymbol{D} ^{-1}) \operatorname{det} (\boldsymbol{\Sigma} ) \operatorname{det} (\boldsymbol{D} ^{-1} ) = \operatorname{det} (\boldsymbol{\Sigma}) \prod_i {\sigma}_{ii}\)

Sample Correlation Matrix

From the sample covariance matrix, we can obtain the sample correlation matrix as an estimate of the population correlation matrix

\[\begin{split} \boldsymbol{R}_{p \times p}=\left[\begin{array}{cccc} 1 & r_{12} & \cdots & r_{1 p} \\ r_{21} & 1 & \cdots & r_{2 p} \\ \vdots & & \ddots & \vdots \\ r_{p 1} & r_{p 2} & \cdots & 1 \end{array}\right] \end{split}\]

where \(r_{kj} = \frac{s_{kj}}{\sqrt{s_{kk} s_{jj}}}\), which is the sample correlation coefficient between \(x_k\) and \(x_j\).

In matrix form,

\[ \boldsymbol{R} = \boldsymbol{D} ^{-1} \boldsymbol{S} \boldsymbol{D} ^{-1} \]


\[\begin{split} \boldsymbol{D}=\left[\begin{array}{cccc} \sqrt{s}_{11} & 0 & \cdots & 0 \\ 0 & \sqrt{s}_{22} & \cdots & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \cdots & \sqrt{s}_{p p} \end{array}\right] \end{split}\]


The transform from covariance matrix to correlation matrix

\[\boldsymbol{\rho} = \left( \operatorname{diag}\left( \boldsymbol{\Sigma} \right) \right) ^{-\frac{1}{2}} \boldsymbol{\boldsymbol{\Sigma} } \left( \operatorname{diag}\left( \boldsymbol{\Sigma} \right) \right) ^{-\frac{1}{2}} \]

is just a particular application of standardizing a positive definite matrix into a standard form where the diagonal elements all equal to one. The original positive definite matrix can be recovered by means of the the diagonal elements and the standardized matrix.