fbpx
How to Tune a Model Using Feature Contribution and Simple Analytics How to Tune a Model Using Feature Contribution and Simple Analytics
Editor’s note: Ori Nakar is a speaker for ODSC Europe this June 15th-16th. Be sure to check out his talk, “How... How to Tune a Model Using Feature Contribution and Simple Analytics

Editor’s note: Ori Nakar is a speaker for ODSC Europe this June 15th-16th. Be sure to check out his talk, “How to Tune a Model using Feature Contribution and Simple Analytics,” there!

Tuning a model is a core element of a data scientist’s work. An important and integral part of the model tuning process is the feature selection process. This is because in many cases, the model itself is a “black box,” which makes it hard to understand the features’ performance. 

We can add, remove or change features. Normally each feature has both a good and a bad impact on the model’s performance. We would like to know if a feature is good or bad and how its performance changes in different test sets. To accomplish this, we suggest a simple process that consists of two main steps:

  1. Calculate and keep the feature contribution data for different experiments and test sets
  2. Analyze the data using a query engine or other analytics tools

In this post, you will train a model, calculate the feature contribution for a test set, and perform an analysis on top of the results. You will use different methods to analyze the feature performance. Later you will decide how to update your features and see how the model improves.

Step 1: Learn about the data

A web application firewall is used to detect and block malicious traffic:

Feature Contribution

SQL injection is one of the most common web hacking techniques. It is a placement of malicious code in SQL statements, via web page input. For example, an attacker can submit the following input in the “User Id” field:

The application may construct the following statement, and return the entire “users” table:

SELECT * FROM users WHERE user_id = 105 OR 1=1

A successful attack may result in the unauthorized viewing of user lists, the deletion of entire tables and, in certain cases, the attacker gaining administrative rights to a database.

Here are some examples from the data set we will use:

Feature Contribution

Next, we will build a model for detecting SQL Injection attacks.

Step 2: Train a model for detection of SQL Injection attacks

First we read the data and split it to a train set and a test set:

df = pd.read_csv("sqliv2-updated.csv.gz")
train, test = train_test_split(df, random_state=0)

The feature extraction step is defined using a lambda function per feature. Each lambda gets a query string and returns a value:

features: Dict[str, Callable[[str], object]] = {
   "length": lambda x: len(x),
   "single_quotes": lambda x: x.count("'"),
   "punctuation": lambda x: count_chars(x, string.punctuation),
   "single_line_comments": lambda x: x.count("--"),
   "keywords": lambda x: len(re.findall("select|delete|insert|union", x)),
   "whitespaces": lambda x: count_chars(x, string.whitespace),
   "operators": lambda x: count_chars(x,  "+-*/%&|^=><"),
   "special_chars": lambda x: len(x) - count_chars(x, string.ascii_lowercase),
}

def process_feature_vectors(input_df: pd.DataFrame) -> pd.DataFrame:
   return pd.DataFrame({f: [features[f](qs) for qs in input_df["query_string"]] for f in features})

train_input = process_feature_vectors(train)
test_input = process_feature_vectors(test)

Here are some feature vectors examples:

We train a binary classification model:

model = RandomForestClassifier(n_estimators=5, random_state=0)
model.fit(train_input, train["label"])
predictions = model.predict(test_input)

We calculate metrics and print them:

labels = test["label"]
print(f"precision: {round(precision_score(labels, predictions), 3)}")
print(f"accuracy: {round(accuracy_score(labels, predictions), 3)}")
print_conf_mat(confusion_matrix(labels, predictions))

Here is the output of our metrics:

We also print the feature importance which ranks features based on the effect that they have on the model’s prediction:

model.feature_importances_

Here is an output example:

Now we have an initial working model and we know the feature importance. It is time to calculate the feature performance and tune our model.

Step 3: Feature Contribution Data

The feature contribution data consists of an array of floating point numbers per prediction. Each feature has number between -1 and 1 representing its contribution to the prediction:

We did the calculation using the SHAP library:

explainer = shap.TreeExplainer(model)
contrib = explainer.shap_values(test_input)[1]
contrib_df = pd.DataFrame(contrib, columns=features)

SHAP can also plot a waterfall chart for a single prediction:

shap.Explanation(contrib[i], explainer.expected_value[0], test_input.iloc[i], features))

This chart explains how each feature contributed to a prediction:

Feature Contribution

To perform analytics we remove very small contribution values, so we will be able to calculate an average based on meaningful values. In our case about 10% of the values were removed:

contrib_df[abs(contrib_df) <= 0.005] = None

We group the data by classification and print the result:

contrib_df["classification"] = get_classification(predictions, labels)
cls_mean_df = contrib_df.groupby("classification").mean()

From the result we can find, for example, the feature which contributes the most to false positives and try to update it. Here is what it looks like:

If we invert the contribution values for classes in which positive contribution is considered bad, we can draw a heat map for feature performance per classification:

We can also use the same values to score our features:

cls_mean_df.mean()

This is the flow we used:

Here is the output example:

Takeaway

We believe that knowing more about the contributions will help learning both about your model and your data. That will lead to better results, and it will be easier and faster to achieve them. You can use analytics tools to calculate your features metrics and use it during development and production.

About the Author/ODSC Europe 2022 Speaker on Feature Contribution

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure. In the Threat Research group, Ori is responsible for the data infrastructure and involved in analytics projects, machine learning, and innovation projects.

More on his ODSC Europe 2022 session:

Tuning a model is a core element of a data scientist’s work. It is often very difficult, requiring both experience and expertise to do effectively. An important and integral part of the model tuning process is the feature selection process. This is because in many cases, the model itself is a “black box,” which makes it hard to understand the features’ performance.

In one case, we had created a working model and we decided to update our test data. That is when everything fell apart. Inexplicably, model accuracy dropped instantly. To fix the problem, we first tried looking at the misclassifications but we saw nothing remarkable. They all looked different from each other. Next, we calculated the feature contributions of the model’s predictions and performed an analysis on top of it. This process helped us find the trouble-making features. We’ll show how we did it by describing our experience with tuning a model using analytics on the features’ contribution data.

 

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1