# Understanding Principal Component Analysis

No one can master every algorithm. However, there is a basic toolbox that every Data Scientist understands and uses. One of...

No one can master every algorithm. However, there is a basic toolbox that every Data Scientist understands and uses. One of the algorithms in this toolbox is Principal Component Analysis – P.C.A for short. It is an unsupervised learning technique used in many different fields to handle high dimensional data, like genomics and quantitative finance, to name two.

P.C.A identifies patterns in data by noting correlations between features. It then eliminates structural redundancies through a process known as dimensionality reduction. The algorithm builds a lower dimensional subspace to project the data. P.C.A constructs the axes of the subspace by calculating the orthogonal directions of greatest variance in the data. The chosen size of this subspace is a balance between the total amount of variance explained and the number of principal components.

These principal components are a set of linear combinations of the original features. Given features ( X_1, X_2, …X_p ) the first principal component is:

(PC_1 = beta_{11}{X_1}+ beta_{12}{X_2} + … + beta_{1p}X_p)

Where (beta_{1} = beta_{11} + beta_{12} + … + beta_{1p}) is called the direction of (PC_1). The goal is to find ({beta}_1) to maximize the variance of (PC_1) under the constraint (sum_{i=1}^{p} {beta_{1i}}^2) = 1. The variance could be infinite without this constraint. The second principal component, P2, is represented similarly:

(PC_2 = beta_{21}X_1 + beta_{22}X_2 + … + beta_{2p}X_p)

The constraint (sum_{i=1}^{p} {beta_{2i}}^2) = 1 exists again to maximize the variance of this component, but there is also another. (PC_1) and (PC_2) are orthogonal to each other and thus uncorrelated. Subsequent principal components are also orthogonal to each other. Each new component explains less variance than those which precede it.

Constructing these components starts with normalizing the data. Without this step, features on larger scales would dominate the process. The eigenvectors and corresponding eigenvalues of the data’s covariance matrix are then calculated. (Covariance being a measure of how two variables change together.) The biggest eigenvalue is the variance captured by the first principal component. Its partnering eigenvector represents the direction of the first principal component. The algorithm constructs the projection matrix from the sorted eigenvalue-eigenvector pairs. (The size of this matrix is user-supplied.) Finally, P.C.A uses the projection matrix to project the data into the new subspace.

Although this process reduces complexity and structural redundancy and is useful for visualizing high dimensional data, it is often computationally intensive and is hard to interpret.

To cement what has come before, let’s consider the three dimensional matrix X:

(begin{bmatrix}
-1 & -1 & -1\
-2 & -1 & -1.5 \
-3 & -2 & -2
end{bmatrix})
P.C.A can transform this data into a two dimensional subspace of orthogonal axes and maximized variance. Again, here are the steps:

1. Normalize the data

1. Calculate the data’s covariance matrix and find its eigenvalues and eigenvectors.

1. Select the k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace.

1. Construct a projection matrix from the k eigenvectors collected, and use it to transform the original data into its representation in the new k-dimensional space.

Here is a manual implementation of P.C.A in Python:

Python’s popular Machine Learning library scikit-learn also contains Principal Component Analysis. Leveraging its A.P.I simplifies the process.

A check to see that the two procedures are identical is straightforward.

It’s time to use P.C.A on real data. The UCI Machine Learning Repository provides many free datasets. One of these relates to wines grown in a region of Italy. The goal is to use chemical data to predict which of three different types of wine each sample belongs to.

The data has 178 samples and 13 dimensions. If the most common type of wine was always predicted then the accuracy would be about 40%. Let’s use Logistic Regression as the classification algorithm of choice in this example. A naive model produces above 98% accuracy on the training data.

Projecting the data into two dimensions with P.C.A explains 56% of its variance. A Logistic Regression model on this pre-processed training data ‘only’ has 96.77% accuracy.

What then is the purpose of going through the trouble of pre-processing the data? The issue here is that the use of P.C.A lacked nuance. Two dimensions is good for visualizing high dimensional data. However, it not always the optimal choice when it comes to building an accurate model. One should choose the size of the new feature space with care.

One method is to analyze a graph of the cumulative variance captured by the components.

Here one might choose to explain 80% of the data’s variance. That corresponds to four principal components. There are other method available. One is the Kaiser-Harris Criterion. This metric says that only eigenvectors corresponding to eigenvalues greater than one are useful.

This criterion chooses four principal components. Now we apply P.C.A again.

The accuracy is above 99%, beating that of the naive Logistic Regression model.

Finally, the real worth of any model comes from its performance on unseen data. The difference between these two workflows on the testing data is stark.

Logistic Regression on the raw testing data has 94% accuracy. That’s a significant fall from its 98% accuracy on the training set. The other model’s accuracy is over 99% on both the training and testing sets.

This is the power of Principal Component Analysis. You can play around with the code above analysing the wine dataset by clicking the ‘launch|binder’ button here.