Linear models will take you far as a machine learning practitioner — much further than most rookies would expect. I previously wrote that complex approaches like neural nets are a great way to shoot yourself in the foot. That’s especially true if you lack the data to justify nonlinear techniques or don’t understand the finesse behind these methods.
That being said, you won’t always have linearly-separable data. For every dataset you can draw a straight line through, there will be ten more with crazy nonlinear solutions. So what gives with my post arguing for linear models?
It turns out there’s a third option: you can execute linear models on transformed data to find nonlinear solutions. Our ability to find nonlinear solutions isn’t constrained by the selection of our model. Rather, it’s constrained by how we choose to pre-process our data before using that model. We’ll discuss how you can leverage vanilla linear models with data projections to find ways to carve nonlinear paths.
What’s a data projection anyways?
Assume you have a dataset of dimensionality d that you want to run a support vector machine against. You randomly sample the data, split it into training and testing sets, run it with cross-validation against a number of different hyperparameters for the SVM — and get atrocious performance.
It’s possible that your data is inherently nonlinear in this case. Maybe you could give it another crack with a neural net. But given the amount of data you have available — say, 1,000 examples — this seems unwise.
Instead, you can project your data into a higher-dimensional space and run your SVM there. This maps individual predictions against the ‘real’ values in low-dimensional space. It works because depending on the transformation you apply, data in the higher-dimensional space will not have the same spatial relationships as in the lower dimension. This induces the algorithm to behave nonlinearly in practice, even though you haven’t changed anything about the way the model actually behaves.
Multiple functions manipulate a d-dimensional input such that it projects into a d + k dimensional space (k is arbitrarily large and can even go to infinity in some applications), where a linear solution exists and can be leveraged to find a non-linear solution in low-dimensional space. These functions are called kernels.
The Kernel Trick
Computing the kernel on every point in your dataset can become quite expensive at scale. However, many linear approaches — most notably the support vector machine — can leverage what is called the kernel trick.
To understand this, let’s take a look at the optimization problem that defines SVM training.
The exact reason this analytically solves support vector machines in training isn’t important. What is important is that xixj term, which is the pairwise dot product of two examples for all examples in the dataset. In other words, our training is contingent on the inner products of every point relative to every other point.
The intuition is that certain kernels are defined such that computing the kernel function on two examples is equivalent to taking the dot product of some other function applied to each point. To formalize this, we say:
K(xi,xj) = F(xi) · F(xj)
In other words, we don’t have to actually place our points in a higher dimension at all. We can get the same results by applying our kernel pairwise to all points. This is often less expensive than actually solving the SVM in a higher-dimensional space.
Kernel Models in Practice
The only difference between finding a linear and a nonlinear solution is whether you choose to apply a kernel before training. Everything else is essentially the same: regularization, cross-validation, training, and testing all are the exact same process, except you’re now calculating on higher-dimensional data.
Kernels won’t always get better results, and you must use intuition for what your data actually looks like under the surface. If you expect to find one class to be clustered together and surrounded by examples for the other class, you might find a radial basis function most useful since it will use exponentially decaying distances between points. Get a feel for your data and figure out which function you think will work best. Just make sure you don’t go data snooping!
César Souza’s website has an impressive list of different kernels you can try out, along with explanations for when to use each one and another explainer on the kernel trick.
Kernels are an awesome way to put some shine on linear models. Combined with a kernel, a linear model like the SVM can outperform state-of-the-art techniques such as neural networks, finding better solutions in less time. Most machine learning libraries make it easy to incorporate a kernel transformation into your learning workflow, so give it a shot and see what you can do with them on a sample project.
Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!