Principal component analysis (PCA)

In this post I will explain the basics of Principal Component Analysis (PCA), when and why to use it, and how to implement a basic example. This post assumes the reader has taken at least one semester of statistics and linear algebra.

What is Principal Component Analysis

Envision a scenario where you and I are scientists conducting an experiment, with the goal of understanding some phenomenon. In order to run our experiment, we will measure some quantities within our system. After collecting the data, we realize the complexity of the data is too great to come to any meaningful conclusion. This is a significant issue affecting modern experimental science every day. There are numerous instances in intricate systems like neuroscience, meteorology, and oceanography where the sheer quantity of variables to be measured can become unmanageable. This is where Principal Component Analysis(PCA) comes in. PCA is a powerful technique that can help reveal information that may be hidden within the complex dynamics of large datasets.



The goal at the end of PCA is to have minimized the number of variables in the dataset, while maximizing the amount of information we can extact from the dataset. You can think of PCA as the Sparknotes version of a lengthy novel. For some, finding the time to read a lengthy novel may be a luxury. Instead, wouldn't it be nice to have a summary of the book that covers the most important aspects of the story in just a couple pages? The novel is our dataset and the goal of PCA is to find the most important aspects of it.

How does it work?

Unfortunetly for us, data does not speak English. So in order to find meaning in our data we will have to use mathematics. A couple questions you might be asking yourself is, how does PCA even understand what is important in our data? And then, how do we even quanitfy the information that lives within our data? The key to PCA is variance.

The greater the variance in our data, the more information we can extract. Why? Lets play a game for a moment. The game is simple, I have picked out three books and painted over the covers of each, so you do not know the titles. I present you with a list that states the title and number of pages in each book. Your job is to try and match the titles to the books, based on the number of pages.

Title# of pages
A Game of Thrones694
The Cat in the Hat61
The Lord of the Rings1178

Now that we have our list, here are the books:

books

It is easy to see that [A] is The Lord of the Rings, [B] is A Game of Thrones, and [C] is The Cat in the Hat

Now lets try a different set of books. Here is your list:

Title# of pages
The Epic of Gilgamesh128
The Old Man and the Sea127
Animal Farm130

And here are the books.

books2

Can you guess which book is which? It is a lot harder when the books are similar in length. In the first example we had no issues identifying the books, becuase the book lengths varied a lot.

This is what I meant earlier when I said that data with more variance, has more information. So through PCA, we can quanitfy the information that lives within our data through variance.

Lets take the first example one step further. Suppose I also give you the weight of the books.

Title# of pagesWeight (oz)
A Game of Thrones69411.4
The Cat in the Hat6110.9
The Lord of the Rings117811.3

Now that we have added weight, does your guessing strategy change? Because the variance of the weights are so small, we would still mostly rely on the number of pages to make our guess. We intrinsically know that we need to place more emphasis on the number of pages than on the weight when making our guess, so in theory we could eliminate the weight column and we would hardly lose any useful information. This is the essence of PCA, minimizing the complexity of the data, while maximizing the amount of information. Ok, but how do we quantify it?

The Algorithm

Lets dig into the math. I will write out the algorithm first, and then we will go through it piece by piece with an example.

  1. Consider the data set X=[x1,x2,x3,...,xk]\Chi = [\vec{x_1},\vec{x_2},\vec{x_3},...,\vec{x_k}] where k represents the total sample size and xi\vec{x_i} is the ithi^{th} sample (Note that xi\vec{x_i} is an m-dimensional vector) so X\Chi is an m×k{m} \times {k} matrix
  2. For each column vector xj\vec{x_j}, define the sample mean:

    xj=1mi=1mxij,j=1,...k\overline{x}_j = \frac{1}{m}\displaystyle\sum_{i=1}^{m} {x_{ij}} \, \, , \, j=1,...{k}

  3. Compute the means of each sample and record them into a 1×k{1} \times {k} vector Ω\Omega where:
    Ωp=xp,p=1,...,k\Omega_{p} = \overline{x}_p \, \, , \, p=1,...,{k}
  4. Subtract from each value in each sample the corresponding mean and record the new vectors dl{d_l} into a new m×k{m} \times {k} matrix Δ\Delta such that:

    Δ=[d1,...,dk]wheredr=[x1rΦ1r...xmrΦ1r]\Delta = [\vec{d_1},...,\vec{d_k}] \, \, where \, \, d_r = \begin{bmatrix} {\vec{x}_{1r} - \Phi_{1r}} \\ . \\ . \\ . \\ {\vec{x}_{mr} - \Phi_{1r}} \end{bmatrix}

  5. Compute the covariance matrix C such that:

    C=1k1Δ×ΔT{C}= \frac{1}{{k-1}}\Delta \times \Delta^{T}

  6. Compute the eigenvectors vi\vec{v_i} and corresponding eigenvalues λi{\lambda_i} of the covariance matrix  C{C}
  7. Sort the eigenvalues from largest to smallest and then choose β{\beta} eigenvectors and arrange them in the same order as their corresponding eigenvalue.
  8. Compile these sorted eigenvectors into m×β{m} \times \beta matrix  D=[v1,v2,v3,...,vβ]{D}= [\vec{v_1},\vec{v_2},\vec{v_3},...,\vec{v_\beta}]. This new matrix D{D} represents our projection space
  9. Transform the samples onto the new subspace Y{Y} such that:

    Y=DTΔ{Y}= {D^{T}}\Delta

An Example

Lets start with a an easy example, a two dimensional data set. This means each data point will have two features (like # of pages and book weight from our previous example). Suppose we are given eight bottles of wine, with each bottle being unique. And we are given a dataset with information about these eight bottles. Suppose the two features are the color of the wine and the sugar content. In order to quantify the feature of color, consider a spectrum of color like you see below where a red wine such as a Cabernet Sauvignon would have a color around 0.5 and where a Chardonnay would have a color around 3.2. Consider the data set X{X} :wine
X{X} =  
Color0.93.53.11.20.52.91.13.2
Sugar (mg)2.122.022.442.342.112.332.122.75
Lets begin by finding the sample means of each feature. We can use step (2) in our algorithm to get :

Ω=[2.052.28]\Omega = \begin{bmatrix}2.05 \\ 2.28 \end{bmatrix}


So from this we can see that the average color of the eight bottles of wine is 2.05 (which is a light pink on the color spectrum) and the average sugar content is 2.28mg. Now subtract the sample mean from each corresponding sample to form our matrix :

Δ=[1.151.451.050.0851.550.850.951.150.160.260.160.060.170.050.160.47]\Delta = \begin{bmatrix}-1.15 & 1.45 & 1.05 & -0.085 & -1.55 & 0.85 & -0.95 & 1.15 \\ -0.16 & -0.26 & 0.16 & 0.06 & -0.17 & 0.05 & -0.16 & 0.47 \end{bmatrix}

Next we will compute the covariance matrix using step (5) to get :

C=[1.5140.1320.1320.057]{C}= \begin{bmatrix}1.514 & 0.132 \\ 0.132 & 0.057 \end{bmatrix}

Lets take a minute to understand what our covariance matrix is telling us. The covariance matrix is as follows:

[VarianceofFirstFeatureCoorelationofBothFeaturesCoorelationofBothFeaturesVarianceofSecondFeature]\begin{bmatrix}Variance \, of \, First \, Feature & Coorelation \, of \, Both \, Features \\ Coorelation \, of \, Both \, Features \, & Variance \, of \, Second \, Feature\end{bmatrix}

Because 1.514>0.057{1.514>0.057} there is more varaince in the first feature (color) than there is in the second feature (sugar content). We can also look at the correlation between the two features 0.132{0.132} and note that they have a postive correlation, though the correlation is so small that it would be inaccurate to conclude there really exists a correlation between color and sugar content.

Now that we have our covariance matrix we can find the eigenvalues and corresponding eigenvectors :

λ1=1.526,λ2=0.0454,v1=[0.9960.0894],v2=[0.08940.996]\lambda_1 = 1.526 \, , \, \lambda_2 = 0.0454 \, , \, \vec{v_1} = \begin{bmatrix}0.996 \\ 0.0894 \end{bmatrix} \, , \, \vec{v_2} = \begin{bmatrix}-0.0894 \\ 0.996 \end{bmatrix}

Now we can extract some underlying information from our dataset. Note that λ1=1.526,λ2=0.0454\lambda_1 = 1.526 \, , \, \lambda_2 = 0.0454. So our first eigenvalue represents 1.5261.526+0.045497.11%\frac{1.526}{1.526+0.0454} \approx 97.11\% of the total eigenvalues. In other words, about 97.11%{97.11\%} of the variance in our dataset comes from the first feature, and only about 2.89%{2.89\%} of the variance comes from the second feature. Thus it follows that the first eigenvector points in the direction of greatest variance and is our first principal component of the PCA space.

These results make sense if we look back at our dataset X{X}. Note that in each sample, every wine has a similar sugar content, regardless of the type. This means that the sugar content of a wine does not tell us very much information about it's type. On the other hand, the color of a wine does reveal a lot about the possible type. Suppose we are given a ninth bottle of wine and again told only it's color and sugar content. From our PCA analysis, we know that to make an optimal guess, we should place about 97.11%{97.11\%} of emphasis on the color of the wine to guess its type. This process is easily repeatable for datasets with more than two features and you can extract the same kind of information.


Now to form our projection space. Because our second feature tells us such little information, we will drop that feature when constructing our projection space (step 7 & 8 in the algorithm) to get :

D=[v1]=[0.9960.0894]{D} = \begin{bmatrix} \vec{v_1} \end{bmatrix} = \begin{bmatrix}0.996 \\ 0.0894 \end{bmatrix}

The last step is to transform our samples onto the new subspace to get :

Y=DTΔ=[1.1601.4211.060.07931.5590.8510.9611.187]{Y} = {D^T}\Delta = \begin{bmatrix} -1.160 & 1.421 & 1.06 & -0.0793 & -1.559 & 0.851 & -0.961 & 1.187 \end{bmatrix}

Note that Y{Y} is a one dimensional dataset, thus the dimension space of our dataset has been reduced from two to one.