In the 2015 hackathon organized by Singapore’s Ministry of Defense, one of the tasks was to predict resignation rates in the military, using anonymized data on 23,000 personnel which included their age, military rank, years in service, as well as performance indicators such as salary increments and promotions. Our team won overall 3rd place. In this post, we elaborate on our methodology.
We first used a random forest analysis to identify the top 20 most important features. Next, we trained 7 models: gradient boosting, extreme gradient boosting, random forest, ada boost, neural network, support vector machines, and bagged CART. We generated 7 cross-validated predictions for each personnel corresponding to each model. These predictions were then fed back into the training dataset as meta-features. With the expanded dataset, we trained and averaged the results from 2 models, extreme gradient boosting and random forest, to generate final predictions.
While examining features, we noticed that personnel ID was a significant predictor of resignation. This led us to postulate that the data order was non-random. Plotting personnel ID by age, it was clear that personnel who had resigned were clustered at the bottom. New features were engineered to identify these bottom clusters of resigned personnel. The data appeared to be sorted in a particular order during preparation. Hence, to obtain resigned clusters, we reverse sorted the data.
Visualizing the dataset in its entirety, we noticed there were features that were balanced across the test and training sets (see heat map below). Each feature’s balance score was computed based on the L1 norm of its distribution over the test and training sets. Combining that and the importance of the feature given by the classifier, the data was then sorted to derive a sort key.
Heatmap of data before reverse sorting. The first column indicates whether the data was in the test (blue) or training (red) set.
Homogeneous clusters revealed after reverse sorting.
Even though the sort key was not a perfect oracle, we were able to use it to improve predictions. Specifically, we could now ascertain a personnel’s probability of resignation based on its neighbors’ in the sorted position. We also derived the size of the public test dataset and used it to calculate how many of these ascertained predictions were correct. Correct predictions verified by the public test dataset were then used to expand our training dataset, improving our classifer.
Fun fact: During the reverse sort process, we deduced the size of the private (n = 4600) and public (n = 3488) datasets. Coincidentally or not, the numbers “46” and “88” signify good fortune in Chinese tradition.