What Model Should I Choose for My Data Science Project? What Model Should I Choose for My Data Science Project?
What to ask yourself when you’re balancing model performance, interpretability, and other costs It might seem silly to bother doing anything other than... What Model Should I Choose for My Data Science Project?

What to ask yourself when you’re balancing model performance, interpretability, and other costs

It might seem silly to bother doing anything other than build the best black box machine learning model possible, as long as it gets good performance. That makes perfect sense on personal projects and Kaggle competitions. But it’s an unwise approach to solutions that require explainability or extensive training time.

It is usually computationally prohibitive — or even impossible — to tune a model so that it finds a guaranteed global optimum. Worse still, those models might not tell you anything about why they work in the first place. Neural networks are notorious for both of these reasons.

At ODSC West, Marc Fridson will give an in-depth talk on code-driven ways to incorporate these considerations as part of your pipeline. We won’t go into that level of detail here. For now, we can get a sense of what we should think about when we choose an algorithm.

What is My Model Doing?

If you’ve heard of neural networks, you’re already familiar with this popular refrain: you’ll never understand what’s happening under the surface. And in truth, you won’t.

Neural networks are effectively a series of matrix multiplications performed repeatedly until a final value or vector is outputted. If you think you’ll wrap your mind around how multiplying twenty matrices together will tell you whether or not a picture depicts a cat — well, let the rest of us know when you figure it out.

Some business applications have no problem with a lack of understandability, but that won’t fly in many situations. Stakeholders will want to know how an algorithm works if its decisions put millions of dollars are on the line. Additionally, a neural network isn’t/shouldn’t be acceptable in situations where people’s lives could be affected dramatically, such as in deciding whether or not they’ll be approved for insurance.

If you need to understand what’s happening under the hood, you’ll want to make sure you pick a model that you can easily explain. Support vector machines, for example, are highly performant models that are easy to understand. All they do is draw a line through data to separate it. That should be easier to explain to your stakeholders than how your neural network arrived at a particular decision, right?

How Much Tuning is Too Much Tuning?

We also have no way to guarantee that a neural network will arrive at a good solution at all. Imagine the horror of training a neural network for a month — a realistic timeframe in many industrial and academic applications — only to discover that it performs worse than random guessing. Seriously, it happens all the time.

Truthfully, there is no way to ensure that a neural network will converge to a global optimum under any circumstances. You just have to start somewhere and hope for the best. We can use various techniques to determine a local optimum, but it’s extremely difficult to find the global optimum. Even if you do, it’s impossible to prove you have it.

For many other models, this isn’t a problem. For example, linear regression has an analytical solution, meaning it can be solved extremely quickly with no guesswork involved. We never have to worry whether we have a good solution since we’ll always get the best one possible. Even better, we can start testing it immediately to see whether it’s feasible to push into production. This minimizes our downtime between model iterations.

Furthermore, it might make sense to use an imperfect model rather than train the same model longer with a marginal performance gain. For example, cross-validation can be an expensive operation. We have to be careful how many different values we test against for a particular variable. If we are cross-validating against multiple variables — say multiple values of both alpha and gamma when adjusting a regression model with an RBF kernel — the number of cross-validation iterations will be equal to the number of values we test for each variable, multiplied together. If we test 10 values of alpha and five values of gamma, we must retrain our model 50 times to find the best combination of variables.

That might not be a big deal with regression since it has an analytical solution. But imagine doing that with random forests or other approaches with expensive training phases. Sometimes it’s better in practice to take something that’s just good enough.

These are just some of the questions we have to deal with when building a model. Consider attending Fridson’s talk at ODSC West 2018 for a more detailed discussion on the business case for ‘imperfect’ models.

Spencer Norris, ODSC

Spencer Norris is a data scientist and freelance journalist. He currently works as a contractor and publishes on his blog on Medium: https://medium.com/@spencernorris