Editor’s note: Aric LaBarr, PhD is a speaker for ODSC East 2023 this May 9th-11th! Be sure to check out his full-day training session, “Advanced Fraud Modeling & Anomaly Detection with Python & R,” there!
When it comes to model building, there are some general approaches that apply to pretty much every field. For example, splitting data into two (or even three) pieces to validate your model results is useful. However, in the field of fraud detection, things just typically work a little differently. If you ever have modeled fraud, have you considered creating a model to predict fraud as well as a model to predict not-fraud? You might be thinking this is pointless! Let me explain some more…
Fraud modeling can be tough. It is one of the only times you are modeling something that is actively trying to hide from you. If you are modeling churn at a company, whether a person would prefer website A or website B, or even whether someone will default on a loan, that binary target variable (one value for success and one for failure) isn’t actively trying to thwart you at every step. Fraudsters are trying to appear like everyone else. They are trying to “hide” in the dataset.
This act of hiding from the modeler makes fraud modeling very difficult to begin with, let alone when new types of fraud arise. The ideal fraud modeling solution can do two things:
- Detection – able to identify current fraud in the system
- Prevention – able to flag potentially new fraud in the system
The first one is relatively straight-forward if you have some notion of previous instances of fraud. Take the plot you see. Imagine that the lower three observations are fraudulent while the remaining ones are not. Building a fraud model to try and predict the binary of 1 (fraud) vs. 0 (not-fraud) would be a rather straight-forward exercise. However, I am proposing that you should also build a not-fraud model to predict the binary of 1 (not-fraud) vs. 0 (fraud). You might think this is pointless and will provide the same, but opposite, information. A model that predicts a binary target variable will provide the probability that an observation takes the value of 1. In the fraud model, that would be the probability that each observation was fraudulent. In the not-fraud model, that would be the probability that each observation was not fraudulent.
In this simple example, you would think these two models would provide the exact opposite probabilities and therefore the same conclusions. If an observation has a probability of fraud of 0.3, then it would have a probability of not-fraud of 0.7 so this isn’t helpful. For detection of previous instances of fraud, probably. What about prevention of new types of fraud?
Take a look at the updated picture with a new observation that has been added. If all you have was a fraud model based solely on previous cases of fraud you would rank this new observation in the upper left corner as having a LOW probability of fraud. That is because it doesn’t look like any other instances of fraud like you have seen in the past. Your model would rank this observation as a low risk, and you would move on and approve this observation without thinking twice. However, if you had a model that predicted not-fraud, it would also have a low probability in that model since it doesn’t look like any regular or non-fraudulent observations you have seen in the past either.
This would signal to you that you have something new on your hands! Does this mean it is fraudulent necessarily? No. However, you have never seen this type of data point before, so it is definitely worth investigating further! Maybe you have just identified a new customer type that isn’t fraudulent. On the other hand, you could have just identified a new pattern of fraud in your data before it becomes a bigger problem!
This is only a small piece of the things we need to consider when modeling fraud with data science. What are some other things you need to consider when modeling fraud? What if you don’t even have known cases of fraud to model? How do you even get to the modeling stage? I will talk about all these things and more in my training on “Advanced Fraud Modeling & Anomaly Detection with Python & R” at ODSC East 2023 in Boston, MA. Looking forward to seeing you there!
About the author:
A Teaching Associate Professor at the Institute for Advanced Analytics, Dr. Aric LaBarr is passionate about helping people solve challenges using their data. There he helps design the innovative program to prepare a modern workforce to wisely communicate and handle a data-driven future at the nation’s first Master of Science in Analytics degree program. He teaches courses in predictive modeling, forecasting, simulation, financial analytics, and risk management. Previously, he was Director and Senior Scientist at Elder Research, where he mentored and led a team of data scientists and software engineers. As director of the Raleigh, NC office he worked closely with clients and partners to solve problems in the fields of banking, consumer product goods, healthcare, and government. Dr. LaBarr holds a B.S. in economics, as well as a B.S., M.S., and Ph.D. in statistics — all from NC State University.