Principal component analysis (PCA)
In this post I will explain the basics of Principal Component Analysis (PCA), when and why to use it, and how to implement a basic example. This post assumes the reader has taken at least one semester of statistics and linear algebra.
What is Principal Component Analysis
Envision a scenario where you and I are scientists conducting an experiment, with the goal of understanding some phenomenon. In order to run our experiment, we will measure some quantities within our system. After collecting the data, we realize the complexity of the data is too great to come to any meaningful conclusion. This is a significant issue affecting modern experimental science every day. There are numerous instances in intricate systems like neuroscience, meteorology, and oceanography where the sheer quantity of variables to be measured can become unmanageable. This is where Principal Component Analysis(PCA) comes in. PCA is a powerful technique that can help reveal information that may be hidden within the complex dynamics of large datasets.
The goal at the end of PCA is to have minimized the number of variables in the dataset, while maximizing the amount of information we can extact from the dataset. You can think of PCA as the Sparknotes version of a lengthy novel. For some, finding the time to read a lengthy novel may be a luxury. Instead, wouldn't it be nice to have a summary of the book that covers the most important aspects of the story in just a couple pages? The novel is our dataset and the goal of PCA is to find the most important aspects of it.
How does it work?
Unfortunetly for us, data does not speak English. So in order to find meaning in our data we will have to use mathematics. A couple questions you might be asking yourself is, how does PCA even understand what is important in our data? And then, how do we even quanitfy the information that lives within our data? The key to PCA is variance.
The greater the variance in our data, the more information we can extract. Why? Lets play a game for a moment. The game is simple, I have picked out three books and painted over the covers of each, so you do not know the titles. I present you with a list that states the title and number of pages in each book. Your job is to try and match the titles to the books, based on the number of pages.
Title | # of pages |
---|---|
A Game of Thrones | 694 |
The Cat in the Hat | 61 |
The Lord of the Rings | 1178 |
Now that we have our list, here are the books:
It is easy to see that [A] is The Lord of the Rings, [B] is A Game of Thrones, and [C] is The Cat in the Hat
Now lets try a different set of books. Here is your list:
Title | # of pages |
---|---|
The Epic of Gilgamesh | 128 |
The Old Man and the Sea | 127 |
Animal Farm | 130 |
And here are the books.
Can you guess which book is which? It is a lot harder when the books are similar in length. In the first example we had no issues identifying the books, becuase the book lengths varied a lot.
This is what I meant earlier when I said that data with more variance, has more information. So through PCA, we can quanitfy the information that lives within our data through variance.
Lets take the first example one step further. Suppose I also give you the weight of the books.
Title | # of pages | Weight (oz) |
---|---|---|
A Game of Thrones | 694 | 11.4 |
The Cat in the Hat | 61 | 10.9 |
The Lord of the Rings | 1178 | 11.3 |
Now that we have added weight, does your guessing strategy change? Because the variance of the weights are so small, we would still mostly rely on the number of pages to make our guess. We intrinsically know that we need to place more emphasis on the number of pages than on the weight when making our guess, so in theory we could eliminate the weight column and we would hardly lose any useful information. This is the essence of PCA, minimizing the complexity of the data, while maximizing the amount of information. Ok, but how do we quantify it?
The Algorithm
Lets dig into the math. I will write out the algorithm first, and then we will go through it piece by piece with an example.
- Consider the data set where k represents the total sample size and is the sample (Note that is an m-dimensional vector) so is an matrix
- For each column vector , define the sample mean:
- Compute the means of each sample and record them into a vector where:
- Subtract from each value in each sample the corresponding mean and record the new vectors into a new matrix such that:
- Compute the covariance matrix C such that:
- Compute the eigenvectors and corresponding eigenvalues of the covariance matrix
- Sort the eigenvalues from largest to smallest and then choose eigenvectors and arrange them in the same order as their corresponding eigenvalue.
- Compile these sorted eigenvectors into matrix . This new matrix represents our projection space
- Transform the samples onto the new subspace such that:
An Example
Color | 0.9 | 3.5 | 3.1 | 1.2 | 0.5 | 2.9 | 1.1 | 3.2 |
---|---|---|---|---|---|---|---|---|
Sugar (mg) | 2.12 | 2.02 | 2.44 | 2.34 | 2.11 | 2.33 | 2.12 | 2.75 |
So from this we can see that the average color of the eight bottles of wine is 2.05 (which is a light pink on the color spectrum) and the average sugar content is 2.28mg. Now subtract the sample mean from each corresponding sample to form our matrix :
Because there is more varaince in the first feature (color) than there is in the second feature (sugar content). We can also look at the correlation between the two features and note that they have a postive correlation, though the correlation is so small that it would be inaccurate to conclude there really exists a correlation between color and sugar content.
These results make sense if we look back at our dataset . Note that in each sample, every wine has a similar sugar content, regardless of the type. This means that the sugar content of a wine does not tell us very much information about it's type. On the other hand, the color of a wine does reveal a lot about the possible type. Suppose we are given a ninth bottle of wine and again told only it's color and sugar content. From our PCA analysis, we know that to make an optimal guess, we should place about of emphasis on the color of the wine to guess its type. This process is easily repeatable for datasets with more than two features and you can extract the same kind of information.
Note that is a one dimensional dataset, thus the dimension space of our dataset has been reduced from two to one.