# Sample Covariance Matrix

Suppose $$X_1, \dots, X_p$$ are random variables and $$X:=(X_1, \dots, X_p)$$ is the random vector of said random variables. Given a sample $$\mathbf x_1, \dots \mathbf x_N$$ of $$X$$, how do we obtain an estimate for the covariance matrix $$\text{Cov}(X)$$?

To this end, let $$\mathbf X=( \mathbf X_{ij})$$ denote the $$N \times p$$ data matrix whose $$k$$th row is $$\mathbf x_k$$ and let $$\bar x_i$$ denote the sample mean of $$X_i$$. That is, $$\bar x_i$$ is the mean of the $$i$$th column of $$\mathbf X$$. Then the sample covariance matrix $$Q = (Q_{ij})$$ is defined by

$\begin{equation*} Q_{ij} := \frac{1}{N-1} \sum_{k=1}^{N} (\mathbf X_{ki} - \bar x_i)(\mathbf X_{kj} - \bar x_j). \end{equation*}$

We can write this a bit more compactly if we introduce yet more notation. Let $$\bar{\mathbf x}:=(\bar x_1, \dots, \bar x_p)$$ denote the sample mean vector and $$M$$ the $$N \times p$$ matrix with all rows equal to $$\bar{\mathbf x}$$. The matrix $$N:=\mathbf X - M$$ is the data matrix that has been centered about the sample means, and we quickly see that

$\begin{equation*} Q = N^T N. \end{equation*}$