Bigger isn’t always better. High-dimensional data can quickly become untenable based on resource constraints and what you plan to use it for. Data with 150 columns isn’t going to get you very far if you don’t have the computational space to analyze it, or if you don’t even know what you’re looking for. As a remedy, we use a technique called dimensionality reduction in order to shrink our data down to a size that’s manageable and that will capture its most salient embedded information.
This gives us a few benefits. Namely, the data is smaller, which is incredibly important when dealing with costly algorithmic approaches. The second, arguably more valuable benefit is that we are able to collapse all of our columns down into as few features as we want while using heuristics to figure out the amount of information loss each time we compress our feature space. 100 columns suddenly becomes two, which is incredibly valuable when we want to train a model but aren’t sure what features to put into our model.
This can be baffling, and that’s completely understandable. It’s truly brain-bending to say that we can capture most of our information with a fraction of the data – but we do have principled ways of doing this, the most popular of which is principal component analysis (PCA). Let’s see if we can figure out why that is.
The Mallet Example
When I was in undergrad, I was speaking with a professor I was collaborating with on research and began asking about principal component analysis. I’ve never seen a more concise or clear explanation of PCA than what Dr. Kristin Bennett at Rensselaer Polytechnic Institute showed me as a junior in college, so all due credit to her for this.
When asked, she picked up a croquet mallet sitting in the corner of the office, opened the blinds and killed the overhead lights. The question was how to rotate the mallet so that the shadow it cast on the wall looked the most like the real thing.
We fiddled with the angle and position until we got a profile view of the mallet: the full length of the handle, both sides of the head. It was as if we had laid the mallet on the ground and were looking at it straight down.
This is actually the intuition behind PCA. All we’re doing is rotating and transforming the data such that we’re able to find the picture of it that is closest to the real thing. However, we’re not constrained in going from three dimensions to two; we can go from 100, to 99, to 98… to three, or two (though it’s not recommended using one). All we do is reduce the dimensionality by one, then repeat until we have the number of features we want.
The Mathier Part
But you can’t physically pick up your data and spin it around. Even if you could, how are you supposed to do that in 100 dimensions?
We need mathematical formalizations of the idea expressed by the mallet example. Namely, we’ll rely on the variance in the data to guide our intuition on how to transform it such that we maximize our available information when in a lower dimension.
The basic idea is this: we want to rotate the data such that we can find an orthogonal basis for it, then use only the orthogonal vectors that have the most variance.
It’s easiest to conceptualize when working in three dimensions. Why not use our mallet example? Say we have a shape that looks something like this.
We want to find an orthogonal basis that best describes the shape in terms of its variance. Given that objective, we might find a constrained solution that looks like this.
How is this new orthogonal basis any different than the one that already existed? All we did was translate the origin and rotate it a bit, right?
Well, exactly. That little transformation gives us a new orthogonal basis that actually better describes the variance of our data! Look at what happens when we use it as our new basis.
See? This new coordinate system captures our data’s variance far better than the original orthogonal basis.
Now the question is which orthogonal direction to delete in order to reduce the dimensionality. We want to remove the one that has the least variance, which is equal to the sum of the average squared difference from the mean. For our purposes, we can think of it as the direction which has the least spread for our data, which comports with the idea that certain directions will have less information than others.
Let’s look at the different possible orientations of the data, minus one direction.
It should be plain that the Z direction is the most important for understanding our data based on its spread. This will be our first principal component.
The second direction is trickier. Does X or Y tell us more about the data? Given our variance-based measure, we want to select the Y direction. This is because the width of the mallet’s head is less than its depth. In other words, the variance in the size of the mallet head is larger along the Y direction than the X direction, so we select Y and Z as our principal components and drop our X feature. Our data now looks like the rightmost image above.
Voila! We have data in two dimensions from three. This idea applies no matter how many features your data has, be it three or three thousand. Principal component analysis is great for exploratory data analysis, especially when you want to visualize your examples on a simple X-Y coordinate plane. You can also use your reduced data as features when training a model – a much, much better solution than dumping five hundred variables into a neural network and hoping for the best.
PCA will save your skin when your data is too big to understand and you’re not sure where to go with it. It’s great whenever you want to throw something at the wall and see what sticks, so give it a shot.
Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!