Anyone that got their computer science degree in the past five years is probably familiar with this presentation: “We trained a neural network on this dataset over a two-day span to separate our data. Unfortunately, the results weren’t very good; we got around 60 percent accuracy. Next time we’d consider using a different model.”
It’s a mistake no one is willing to allow others to learn for them. The temptation to go for gold and use state-of-the-art techniques like neural nets is ever-present among machine learning rookies, and it is a constant source of consternation among managers and more experienced data scientists.
Simply put, more advanced techniques give you a lot more rope to hang yourself with. The reason is that as the number of parameters that must be learned increases, so does the complexity of the model, and thus the variance in its solutions.
That increase in variance has a cost. The expected test error of a model g trained on dataset D is the sum of the bias and the variance of the model. This is called the bias-variance decomposition.
ExpD[Eout(g(D))] = bias + variance
In practice, selecting models with a higher number of parameters will dramatically increase the variance, making it practically impossible to find a good approximation.
When training a neural network, your parameters are the weights of each connection between nodes. Consider a neural network composed of a two-node input layer with a bias term, a five node hidden layer with another bias term and a one-node output layer. Assuming the network is fully-connected, you will have 21 weights to adjust, or 21 parameters. Compare that against a support vector machine, which has just three parameters to optimize (weights for each dimension and a bias term). It’s easy to see why a neural network has so much more variance in its output models, and thus why the expected error is so much higher.
There’s a second reason to steer clear of neural networks in the general case. Remember our friends in the undergraduate presentation: not only did they throw their data into a fully-connected neural net with too few examples, but they had to wait two days just to find out that their model was bad.
That’s the second advantage of using simple models: they tend to have very efficient optimization methods that allow the data scientist to iterate quickly. Instead of waiting two days to test the performance, how about two minutes? That speedup is not an exaggeration; support vector machines are very fast to train because of a well-known linear algebra technique called quadratic programming to optimize the weights very quickly.
That’s not to say that you should never use a neural network for a classification task. There are lots of cases where neural nets are better-suited to the task at hand. For example, if it is difficult to engineer new features for inputs, neural networks excel at discovering features on their own. That’s why the latest and greatest methods in machine vision lean on them: it’s too hard for a human to tell a machine what to look for, so the neural net figures out what to consider. In other words, it discovers new features itself.
However, if you plan to use a neural network and want it to actually stand a chance, then make sure you have the data to train on. More training data reduces the overall expected out-of-sample error, making the increased number of parameters more manageable. There’s no way of beating the system when you have neural networks – you just have to feed the beast.
With all that said, there’s still no guarantee that your neural network will outperform an SVM on the same task – or vice versa. The only way to know for sure is to test multiple solutions. However, when you’re short on time and data, let your undergrad classmates learn the hard lesson for you: the neural network can break your heart. Trust and depend on the traditional linear models you learned in your first machine learning class.