Classification tasks in Data Science come frequently, but the hardest are those with unbalanced classes. From biology to finance, the real-life...

Classification tasks in Data Science come frequently, but the hardest are those with unbalanced classes. From biology to finance, the real-life situations are numerous.

Before balancing your errors, establishing a baseline with the most frequent occurrence can give you over 90% accuracy right off the bat.  The question of whether it is worse to have a Type 1 or Type 2 error really comes down to the domain where the problem lives.

The German credit data set is not an extreme case of this type of problem, but it does illustrate the point pretty well. The data comes via Dr. Hans Hofman of the University of Hamburg . It is a [consists] complete package of information – credit history, job status, and age, to name a few – collected from 1,000 credit applicants to a bank with labels on each account, good risk or bad risk. The goal is to build a model that predicts if someone is liable to have good or bad credit.

Initial exploration of some of these features unearthed some interesting points. Some of these are below:

* Almost 40% have no checking account.

* More than 50% have paid back all their credit at the point where the data was collected.

* A little over 60% had less than 100 Deutsche Marks in their account.

* Most are single males.

* 70% own their own homes.

* Over 90% of the data points are foreign workers.

* Average amount of credit asked for is 3.2 thousand DM.

The most salient of all being that 70% of the people in the data set have good credit. Despite these observations, there were no strong relationships between the variables and a person’s credit score; exploring correlations exposed similar dead ends, however, there were a few strong relationships in either direction. Credit amount and duration and installment rate as a percentage of disposable income were some of these.




The presence of a number of categorical variables meant that encoding them was a natural pre-processing step.  

With the initial exploration done, it was time to move onto to modeling. The code book includes a detail which would match one’s expectation. It is worse to classify someone with bad credit as good than vice versa. Thus the goal from the outset is to reduce the number of false negatives.

The results of the Random Forest Classifier, Logistic Regression, and Linear Discriminant Analysis are worth noting in this robust cross validation procedure. The accuracy scores were all around 75%. Though this isn’t a huge lift from the 70% baseline, it’s a start. Repeating this modeling procedure after using Principal Component Analysis (P.C.A) in tandem with the Kaiser-Harris criterion for dimensionality reduction doesn’t change these scores much. This is unsurprising given the lack of strong correlations between features. Applying P.C.A in two dimensions gives a glimpse into the separability issue of the two classes in that low feature space.  


©ODSC 2016

Gordon Fleetwood

Gordon studied Math before immersing himself in Data Science. Originally a die-hard Python user, R's tidyverse ecosystem gradually subsumed his workflow until only scikit-learn remained untouched. He is fascinated by the elegance of robust data-driven decision making in all areas of life, and is currently involved in applying these techniques to the EdTech space.