Over-Optimising: A Story about Kaggle

Tags:

I recently took a stab at a Kaggle competition. The premise was simple, given some information about insurance quotes, predict whether or not the customer who requested the quote will follow through and buy the insurance. Straight forward classification problem, data already clean and in one place, clear scoring metric (Area under the ROC curve).

I took a starter script to do bare minimum formatting and trained a few big random forests on it with slightly different parameters, nothing too serious. Total human time invested, probably less than 30 minutes, runtime was a few hours.

I came in 1113th place.

Out of 1762.

Not so hot.

You can check the results here.

But let’s dig into what these results mean. The area under the ROC curve represents the overall quality of a binary classifier (assuming that it has roughly even class distributions). It has false positive rate on the x-axis and the true positive rate on the y-axis and has one point per valid threshold available. If the class distribution is exactly 50/50 (even number of true and false samples), then a totally random model would result in an ROC score of 0.5 (straight diagonal line). A perfect model would have score of 1 (one point at 100% TPR and 0% FPR).

In the Kaggle competition linked, my illustrious contribution had an ROC AUC of 0.96290, while the winner had 0.97024. This particular competition had a prize of $20,000 and it’s not uncommon for teams to spend man weeks or months on a given contest. So while my 30 minutes wasn’t nearly enough to win, how close was it in practical terms?

By simple percentage, it would have taken a 0.7% improvement in my score to win. But that’s not really all that informative. Recall that a totally random model will achieve an ROC AUC score of 0.5, and let’s do an experiment.

Take a random set of true values: [0, 1, 1, 0, …, 0, 1, 0]

Then take some random variable and use it to produce noisy predictions based on that truth variable: [0.03, 0.98, 0.97, 0.02, …, 0.10, 0.99, 0.01]

As this random variable increases in standard deviation, the predictions get noisier, the model gets worse and the ROC AUC drops. More generically: \(pred = abs(truth + random() * entropy)\)

where entropy ranges between 0 and 1. Any time entropy is less than 0.5, the model still perfectly classifies the problem, but between 0.5 and 1, it becomes increasingly inaccurate. Using a simple python script, we can plot this behavior:

import random
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
plt.style.use('ggplot')

__author__ = 'willmcginnis'


def create_preds(entropy=0.2, n_samples=10000):
    truth = [random.randint(0, 1) for _ in range(n_samples)]
    preds = [abs(t - (random.random() * entropy)) for t in truth]

    return truth, preds


def score(truth, preds):
    return roc_auc_score(truth, preds)


def score_n(n=10):
    data = []
    for entropy in np.linspace(0.01, 0.99, n):
        t, p = create_preds(entropy)
        data.append([entropy, score(t, p)])
    return pd.DataFrame(data, columns=['entropy', 'ROC AUC'])

if __name__ == '__main__':
    df = score_n(1000)
    df.plot(kind='line', x='entropy', y='ROC AUC')
    plt.xlabel('Entropy of Predictions')
    plt.ylabel('ROC AUC Score')
    plt.title('ROC Scores of Random Models')
    plt.show()

Which gives the plot:

entropy

In a perfect simulation you would see the line slope smoothly from (0.5, 1) to (1, 0). So the slope of this curve represents the differential change in ROC AUC score given random noise in predictions. Using this we can see that the difference between my ROC AUC score and the winners’ amounts to:

\(entropy = \frac{0.00734}{\frac{2}{1}} = 0.00367\)

Really not that much. In practice, and in industry, if you can save a week and deliver model within that margin of the best, the faster model is generally better. So understand your scoring metrics, and be careful not to over-optimize and spend weeks chasing marginal gains if they aren’t necessarily needed. I’m many cases, a big random forest or linear model will get you where you need to be quickly.


Originally posted at www.willmcginnis.com