Dimensionality Reduction with Principal Component Analysis
A deep dive into the mathematical foundations of Principal Component Analysis (PCA) for dimensionality reduction.
Working directly with high-dimensional data, such as images, comes with some difficulties: It is hard to analyze, interpretation is difficult, visualization is nearly impossible, and storage of the data vectors can be expensive. However, high-dimensional data often has properties that we can exploit. For example, high-dimensional data is often overcomplete, i.e., many dimensions are redundant and can be explained by a combination of other dimensions. Furthermore, dimensions in high-dimensional data are often correlated so that the data possesses an intrinsic lower-dimensional structure. Dimensionality reduction exploits structure and correlation and allows us to work with a more compact representation of the data, ideally without losing information. We can think of dimensionality reduction as a compression technique, similar to jpeg or mp3, which are compression algorithms for images and music.
In this post, we will discuss principal component analysis (PCA), an algorithm for linear dimensionality reduction. PCA has been around for more than 100 years and is still one of the most commonly used techniques for data compression and data visualization.
Problem Setting
In PCA, we are interested in finding projections (\tilde{x}_n) of data points (x_n) that are as similar to the original data points as possible, but which have a significantly lower intrinsic dimensionality.
More concretely, we consider an i.i.d. dataset (X = {x_1, \dots, x_N}), (x_n \in \mathbb{R}^D), with mean 0 that possesses the data covariance matrix [S = \frac{1}{N} \sum_{n=1}^N x_n x_n^\top]
Furthermore, we assume there exists a low-dimensional compressed representation (code) [z_n = B^\top x_n \in \mathbb{R}^M] of (x_n), where we define the projection matrix [B := [b_1, \dots, b_M] \in \mathbb{R}^{D \times M}]
We assume that the columns of (B) are orthonormal so that (b_i^\top b_j = 0) if and only if (i \neq j) and (b_i^\top b_i = 1). We seek an (M)-dimensional subspace (U \subseteq \mathbb{R}^D, \dim(U) = M < D) onto which we project the data. We denote the projected data by (\tilde{x}_n \in U), and their coordinates (with respect to the basis vectors (b_1, \dots, b_M) of (U)) by (z_n). Our aim is to find projections (\tilde{x}_n \in \mathbb{R}^D) (or equivalently the codes (z_n) and the basis vectors (b_1, \dots, b_M)) so that they are as similar to the original data (x_n) and minimize the loss due to compression.
Maximum Variance Perspective
If we interpret information content in the data as how "space filling" the dataset is, then we can describe the information contained in the data by looking at the spread of the data. We know that the variance is an indicator of the spread of the data, and we can derive PCA as a dimensionality reduction algorithm that maximizes the variance in the low-dimensional representation of the data to retain as much information as possible.
We maximize the variance of the low-dimensional code using a sequential approach. We start by seeking a single vector (b_1 \in \mathbb{R}^D) that maximizes the variance of the projected data, i.e., we aim to maximize the variance of the first coordinate (z_1) of (z \in \mathbb{R}^M) so that [V_1 := \mathbb{V}[z_1] = \frac{1}{N} \sum_{n=1}^N z_{1n}^2] is maximized, where we exploited the i.i.d. assumption of the data and defined (z_{1n}) as the first coordinate of the low-dimensional representation (z_n \in \mathbb{R}^M) of (x_n \in \mathbb{R}^D). Note that first component of (z_n) is given by [z_{1n} = b_1^\top x_n] i.e., it is the coordinate of the orthogonal projection of (x_n) onto the one-dimensional subspace spanned by (b_1). We substitute this into the variance equation, which yields [V_1 = \frac{1}{N} \sum_{n=1}^N (b_1^\top x_n)^2 = b_1^\top S b_1] where (S) is the data covariance matrix.
Notice that arbitrarily increasing the magnitude of the vector (b_1) increases (V_1). Therefore, we restrict all solutions to (|b_1|^2 = 1), which results in a constrained optimization problem in which we seek the direction along which the data varies most.
With the restriction of the solution space to unit vectors the vector (b_1) that points in the direction of maximum variance can be found by the constrained optimization problem [\max_{b_1} b_1^\top S b_1 \quad \text{subject to} \quad |b_1|^2 = 1]
Following the method of Lagrange multipliers, we obtain the Lagrangian [\mathcal{L}(b_1, \lambda_1) = b_1^\top S b_1 + \lambda_1 (1 - b_1^\top b_1)] Setting the partial derivatives to 0 gives us the relations [S b_1 = \lambda_1 b_1] [b_1^\top b_1 = 1]
By comparing this with the definition of an eigenvalue decomposition, we see that (b_1) is an eigenvector of the data covariance matrix (S), and the Lagrange multiplier (\lambda_1) plays the role of the corresponding eigenvalue. This eigenvector property allows us to rewrite our variance objective as [V_1 = b_1^\top S b_1 = \lambda_1 b_1^\top b_1 = \lambda_1] i.e., the variance of the data projected onto a one-dimensional subspace equals the eigenvalue that is associated with the basis vector (b_1) that spans this subspace. Therefore, to maximize the variance of the low-dimensional code, we choose the basis vector associated with the largest eigenvalue of the data covariance matrix. This eigenvector is called the first principal component.