ODSC East Interviews: Shir Meir Lador
ConferencesModelingPredictive AnalyticsInterview|Machine Learning|ODSC East 2017posted by George McIntire, ODSC September 26, 2017 George McIntire, ODSC
The following Q&A is part of a series of interviews conducted with speakers at the 2017 ODSC East conference in Boston. This interview is with Shir Meir Labor, Lead Data Scientist at Bluevine, whose talk was entitled “Fraud Detection Challenges and Data Skepticism. The transcript has been edited condensed and edited for clarity.
Give us a recap of your talk
My talk was about Fraud Detection Challenges, model explanation and interpretations. I described the model I worked on at my job at BlueVine. BlueVine gives funding for small businesses, invoice factorings and loan for small and medium businesses. We write models that automate the process. That gives loan automatically, that decide who should get money and how much, and what is the risk, and all sorts of factors.
I’ve made a model, that it’s goal was to approve deals for returning client and debtors. In my talk, I described how I build the model. I saw that I got really good results which did not seem logical to me, because it was too good to be true. I showed the process of how I understood, that I actually had a problem in my dataset, that the module was not learning the right thing.
Instead of understanding what is the pattern of a good deal, it learned to recognize specific clients because I had multiple deals per client in my dataset. Then I described the solution for that which was to create a few datasets with only one deal per client. And build a model for each of the dataset, and then just create an ensemble of these models. Each model sees only data with one deal per client, it doesn’t have the ability to recognize specific client. That’s the first part of my talk.
Also, I talked a bit about continuous model training. About how we don’t just train on models offline, but we train them automatically in production all the time in order to get the most updated data and be updated with our policies. The second part of my talk is about the tool called LIME, developed by a research group from Washington University. That helps data scientists and people that use machinery model to understand the model decisions and predictions.
It actually gives per prediction the explanation for the specific prediction. As a form of what was the most important features specifically for this sample, for this prediction. It does that by doing a proximation of the model around each sample. It’s very nice, to my opinion. I show how I apply this on my initial model which seem to be good, but in fact, was not good. Also, my final model, I see the results. The explanations are more reasonable for the second model. It’s another way to understand whether models are performing right or not.
The reception you got from the crowd, what are the questions they’re asking?
There are lot of questions. I’m trying to think now what was specially irrelevant. Some people ask me, instead of just taking a random deal per client and create a new dataset, maybe I should just average one person. They just ask me that. Whether I should just take the average of all the deals, average all the features per client, and take something that represent this client. I thought this wasn’t relevant because if I would just average all the features, I will get something which was not real. This is not a real deal, this is just the average of all the features.
Since I do non-linear model, it will ruin the connections between the features. Maybe this is good for a linear model. This was an example of an interesting question. Also, they asked me whether I thought about building different models for customers of different stages. People which is their first deal or their second deal, build a different model for them.
The problem with that is that I don’t have enough data to do that. Instead, we just incorporate the number of deals the client have in our model as a feature. It’s one of the features. It’s interesting to think whether it’s the same thing. To put it as a feature in the model or to separate different models? It’s not actually, it’s not the same thing. [laughs]
Can you elaborate more on the type of fraud that you deal with at your work?
Maybe the name of my lecture is not very accurate. I wanted to change it, but it was a bit late. I think it’s more correct to call it credit analysis, not fraud detection. This is a model that usually less meant for fraud and more meant for just recognizing which people would be able to return the loans, and if they’re working with credible debtors. It’s less of fraud.
Talk more about the modeling. With the machine learning, what false positive rates or false negative rates are you getting? Is it really hard increasing the accuracy by .002% or something? What unorthodox features are you also looking at as well?
I’m not really allowed to talk about the specific features we use. I can talk about the type of model that I use. Also, I didn’t mention it now, but in the beginning, I used Random Forest which is really good for these kinds of problems, which is fraud problems because– It’s not fraud, but as I said, it’s credit. You analyze different types of features, numerical and categorical.
For these type of mixture of features, it’s good to use tree-based models because they don’t have assumptions along the distribution of the features. They treat numerical and binary, or categorical feature in the same way, they just do splits. Also, they have a very good property that they can capture the relations between features in the data, which is very important in our case in which we have many scenarios of people, of views that we want to encapture in, like I said, rows.
We like tree-based models, we used to use Random Forest. I didn’t get such great results with Random Forest for this model after I did one day per client dataset. I heard about XGBoost which is implementation of gradient boosting, which is really strong. It has a great part in all the latest Kaggle competition in three years. I decided to try it on my dataset because I knew it’s also based on trees. It can also fill us, and then it gave me much better results. From about 70% ROC, which is my metric, I come to 77% ROC, which is a nice list. In the end, they just use XGBoost.
Do you find that the ROC is the best metric when it comes to this sort of problem set?
No, it just gives me a good way to compare between models. In the end, when we apply a rule that defines whether we should approve or reject deal, it’s always about precision and recall with that. It’s always like, the head of risk will declare, ”Okay, I want this kind of precision when we approve deals because I will amount on the searching loss in dollars.” That’s how I will set the final threshold. This is just to compare between models, I use ROC. For the final threshold, I always use precision and recall.
Moving on from the fraud detection, I’m want to talk about some of your experiences as a data scientist in Israel. I’ve been reading that you’re very active in your local data science community. Can you talk more about that? What it’s been like fostering the data science community over there.
I worked at BlueVine for a year. Before that, I worked at a company called TaKaDu which does monitoring for water system, very nice. Applying machine learning, statistical model for predicting anomalies in water. That’s basically my experience besides the university. When I got to BlueVine, I was very enthusiastic about learning more machine learning because we did it last at TaKaDu.
I understood that the best way to learn would be to meet more people that does the same thing. I wanted to build a community. With a friend of mine from work, we started running PyData Tel Aviv. We just reached out to the global PyData. We decided we want to do meet-ups under this label of PyData. We already had four successful meet-ups in Tel Aviv. It’s very nice that we organize it first because I get to choose the lectures. I get to find the interesting topics, I meet so many people, and I learn so many things.
The most important thing for this is just to understand what people are doing in different companies. It is so changing from company to company how do data scientists work. You can learn a lot from talking with different people and let them present their work. I really try that the talks at our meet-ups would be very practical, that people will actually talk about projects they did and their process in the project.
What are some interesting things going on? Is there a sizable startup committee of data scientists and machine learning startups in Tel Aviv?
Yes, I would say so. Israel is a startup nation, we have many startups. In every new startup that rises, there’s the need– almost every new startup, there is a need of data scientists.
Interviewer: What are some of these machine learning startups that are happening in Tel Aviv?
I need to think about it, there’s so many. I can tell you specific in our world, in the FinTech world, we have Simplex which is about Bitcoin. There’s Riskified, they’re also similar to us. They do also stuff with fraud detection. There’s Payoneer, it’s already a big company, also in the FinTech field. Well, that’s not a startup, but we have Paypal which is huge. They also do lots of similar things. There are also many other small startups, I just can’t think about it now.
Is it possible to get a perfect credit analysis model?
No. I don’t think any real data can have a perfect model just because it’s a real data. I don’t expect a model to give perfect results. If it gives perfect results, then there’s a problem, for sure. That’s what happened to me, actually. I got a perfect result, I knew there has to be a problem. I usually expect the highest accuracy would be about 80% wrong, maybe a bit more than that.
Do you feel that the models that you made maybe one month are useless the next month because the data’s changing so much?
Yes. That’s the thing I talked about in my presentation, that we do continuous model training. We started this framework a few months ago Many of our models, we retrain them automatically in production by reading all the new data from the database, or the new data we didn’t use, and adding it to our model. Retraining our model with our new data and forgetting from the too old data. Then what we use is the new model with the updated data. If it’s good enough in our metrics.
It must get tiresome. You start from scratch every month or every other month.
I don’t work about– It’s automatically. I just build a framework and it does this automatically, I don’t do that. That’s the whole thing. That’s a really big breakthrough for us. Before that, the models were stale. We were just building models and releasing them to production. Maybe after a few months, we have to work on it again because it got old. Now, it’s done automatically.