Working on behalf of an insurance company, our goal is to determine whether its clients would be potentially interested in purchasing an auto insurance. For this purpose we will build a model classifying whether or not the client would be interested. At our disposal we have informations about the clients (gender, age, sex, region), their vehicle (age of the vehicle, presence of damages) and their insurance contract (amount paid by the client, means of contact of the client). Finally (the part we are interested in), we want to explain the predictions of the model (e.g.: Why this customer was classified as a potential car insurance policy holder) using SHAP.
The dataset used for this project can be found here.
import os, sys
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, cross_validate, StratifiedKFold
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, OrdinalEncoder, Binarizer
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.under_sampling import RandomUnderSampler
from category_encoders import TargetEncoder
import shap #SHAP LIBRARY
shap.initjs()
pd.options.mode.chained_assignment = None # default='warn
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
#Printing the shape of the train and test set:
print(f'The train set has {train.shape[0]} rows and {train.shape[1]} columns')
print(f'The test set has {test.shape[0]} rows and {test.shape[1]} columns')
variables = {'id':'Unique ID for the customer',
'Gender':'Gender of the customer',
'Age':'Age of the customer',
'Driving_License':'0 : Customer does not have DL, 1 : Customer already has DL',
'Region_Code':'Unique code for the region of the customer',
'Previously_Insured':'1 : Customer already has Vehicle Insurance, 0 : Customer doesn\'t have Vehicle Insurance',
'Vehicle_Age':'Age of the Vehicle',
'Vehicle_Damage':'1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn\'t get his/her vehicle damaged in the past.',
'Annual_Premium':'The amount customer needs to pay as premium in the year',
'PolicySalesChannel':'Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.',
'Vintage':'Number of Days, Customer has been associated with the company',
'Response':'1 : Customer is interested, 0 : Customer is not interested'}
#Creating a dataframe for data description:
variables_df = pd.DataFrame.from_dict(variables,orient='index')
variables_df.rename(columns={0:'Definition'},inplace=True)
variables_df
#Preview of train set:
train.head()
#Plor of target variable:
fig, ax = plt.subplots(figsize=(7,5))
sns.barplot(x=train['Response'].value_counts().index, y=train['Response'].value_counts())
plt.title('Value counts of target class')
plt.show()
As we can see our dataset is imbalanced. Using metrics such as accuracy here is misleading (F1 score here is much more relevant). If we want a good performing classifier we will have to take care of the imbalancement. This issue can be solved by performing an undersampling of the majority class (0: Not interested by the auto insurance)
#Check if the dataset contain NaN values:
train.isna().sum()
#Check if the dataset contain duplicated values:
train[train.duplicated(subset='id', keep=False)]
#Dropping the 'Id' column in the train and test axis:
train.drop('id',axis=1,inplace=True)
test.drop('id',axis=1,inplace=True)
# Binarize 'gender' & 'Vehicule_damage':
def binarize_gender(x):
''' Function to binarize the gender column'''
if x == 'Male':
return 1
else:
return 0
def binarize_vehicule_dmg(x):
'''Function to binarize the vehicule_dmg column'''
if x == 'Yes':
return 1
else:
return 0
train['Gender'] = train['Gender'].apply(lambda x: binarize_gender(x))
test['Gender'] = test['Gender'].apply(lambda x: binarize_gender(x))
train['Vehicle_Damage'] = train['Vehicle_Damage'].apply(lambda x: binarize_vehicule_dmg(x))
test['Vehicle_Damage'] = test['Vehicle_Damage'].apply(lambda x: binarize_vehicule_dmg(x))
#Splitting features from the target columns:
X_train, y_train = train.drop('Response', axis=1), train['Response']
X_test = test
#Ordinal encoding of 'vehicule_age' column in train and test set:
ord_encoder = OrdinalEncoder()
vehicle_age_encoded_train = ord_encoder.fit_transform(np.array(train['Vehicle_Age']).reshape(-1,1))
X_train['Vehicle_Age'] = pd.Series(vehicle_age_encoded_train.reshape(1,-1)[0])
vehicle_age_encoded_test = ord_encoder.transform(np.array(test['Vehicle_Age']).reshape(-1,1))
X_test['Vehicle_Age'] = pd.Series(vehicle_age_encoded_test.reshape(1,-1)[0])
#Pipelines and columns transformer:
numerical_col = ['Age', 'Annual_Premium', 'Vintage']
categorical_col = ['Region_Code','Policy_Sales_Channel']
numerical_transformer = Pipeline(steps=[('stand_scaler',StandardScaler())])
categorical_transformer = Pipeline(steps=[('target_encoder', TargetEncoder(cols=categorical_col))])
preprocessor = ColumnTransformer(transformers=[
('num',numerical_transformer,numerical_col),
('cat',categorical_transformer, categorical_col)],remainder='drop', n_jobs=-1)
#Applying our preprocessing to the train and test set:
X_train_processed = preprocessor.fit_transform(X_train, y_train)
X_test_processed = preprocessor.transform(X_test)
#Creating a function to retrieve column names:
def get_column_names_from_ColumnTransformer(column_transformer):
'''This function allowed to retrieve the column names after applying column transformer'''
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names()
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
return col_name
#Building our preprocessed DataFrame:
col_dropped = [col for col in X_train.columns if col not in get_column_names_from_ColumnTransformer(preprocessor)]
X_train_processed_df = pd.DataFrame(X_train_processed,columns=get_column_names_from_ColumnTransformer(preprocessor))
X_test_processed_df = pd.DataFrame(X_test_processed, columns=get_column_names_from_ColumnTransformer(preprocessor))
X_train_final = pd.concat([X_train_processed_df,X_train.loc[:,col_dropped]], axis=1)
X_test_final = pd.concat([X_test_processed_df,X_test.loc[:,col_dropped]], axis=1)
#Quick peak of our processed train set:
X_train_final.head()
#Performing undersampling on the majority class (0: Not interested by the auto insurance)
rus = RandomUnderSampler(sampling_strategy='majority')
X_train_final_rus, y_train_rus = rus.fit_resample(X_train_final,y_train)
plt.figure(figsize = (7,5))
sns.barplot(x = y_train_rus.value_counts().index, y = y_train_rus.value_counts())
plt.title('Value counts of target class after undersampling')
plt.show()
I compared the performances of two gradient boosting algorithms (XGBoost and LightGBM ) without/with undersampling to observe the effects on the metrics (Roc_auc and f1 score).
#XGBOOST NO UNDER SAMPLING:
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train_final, y_train)
y_pred_train_xgb_baseline = xgb_clf.predict(X_train_final)
roc_score_train_xgb_baseline = roc_auc_score(y_train, xgb_clf.predict_proba(X_train_final)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_xgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train,y_pred_train_xgb_baseline)}')
xgb_clf_cv = cross_validate(xgb_clf,X_train_final,y_train,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = xgb_clf_cv['test_roc_auc'].mean()
f1_score_val = xgb_clf_cv['test_f1'].mean()
print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
#XGBOOST WITH UNDERSAMPLING:
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X_train_final_rus, y_train_rus)
y_pred_train_xgb_baseline = xgb_clf.predict(X_train_final_rus)
roc_score_train_xgb_baseline = roc_auc_score(y_train_rus, xgb_clf.predict_proba(X_train_final_rus)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_xgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train_rus,y_pred_train_xgb_baseline)}')
xgb_clf_cv = cross_validate(xgb_clf,X_train_final_rus,y_train_rus,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = xgb_clf_cv['test_roc_auc'].mean()
f1_score_val = xgb_clf_cv['test_f1'].mean()
print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
As expected the roc_auc score doesn't change much with or without undersampling whereas the F1 score greatly increases going from 0.04 to 0.82 on the validation set. If we tried to predict new data without undersampling the model won\'t perform really good which means that a good roc_auc score here is misleading and that undersampling is essential here and also focus on the F1 score.
#LGBOOST NO UNDERSAMPLING
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train_final, y_train)
y_pred_train_lgb_baseline = lgb_clf.predict(X_train_final)
roc_score_train_lgb_baseline = roc_auc_score(y_train, lgb_clf.predict_proba(X_train_final)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_lgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train,y_pred_train_lgb_baseline)}')
lgb_clf_cv = cross_validate(lgb_clf,X_train_final,y_train,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = lgb_clf_cv['test_roc_auc'].mean()
f1_score_val = lgb_clf_cv['test_f1'].mean()
print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
#LGBOOST WITH UNDERSAMPLING
lgb_clf = lgb.LGBMClassifier()
lgb_clf.fit(X_train_final_rus, y_train_rus)
y_pred_train_lgb_baseline = lgb_clf.predict(X_train_final_rus)
roc_score_train_lgb_baseline = roc_auc_score(y_train_rus, lgb_clf.predict_proba(X_train_final_rus)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_lgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train_rus,y_pred_train_lgb_baseline)}')
lgb_clf_cv = cross_validate(lgb_clf,X_train_final_rus,y_train_rus,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = lgb_clf_cv['test_roc_auc'].mean()
f1_score_val = lgb_clf_cv['test_f1'].mean()
print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
LightGBM perform slighty better on both the roc_auc and f1 score on the validation set comparing to XGBoost. We will use this model to explain the predictions.
Since roc_auc score isn't really relevant here, we can try to hypertune our model to improve the f1 score. Still, having a f1 score of 0.829 on the train set and 0.823 on the validation set, the improvement can only be minimal.
When performing hyperparameters tuning, the majority is quiet familiar with GridSearchCV and RandomSearch. Still, these hyper-parameter optimizations have hese methods are relatively inefficient because they do not choose the next hyperparameters to evaluate based on previous results. Bayesian optimization is a way to perform informed hypertuning, which means that the search for best parameters will depend on the previous results.py If you wanna know more about bayesian optimization, go check this article.
Thus, I tried to perform bayesian optimization using the library HyperOpt but this method was too expensive in term of computation which ends in failure.
Because the improvement can only have been minimal, i choosed to use the lighGBM model with no tuning to explain the predictions.
(I leave the code for the bayesian optimization if you're interested in reusing it)
### HyperOpt Parameter Tuning
from hyperopt import tpe
from hyperopt import STATUS_OK
from hyperopt import Trials
from hyperopt import hp
from hyperopt import fmin
N_FOLDS = 3
MAX_EVALS = 25
def objective(params, n_folds = N_FOLDS):
"""Objective function for Logistic Regression Hyperparameter Tuning"""
# Perform n_fold cross validation with hyperparameters
# Use early stopping and evaluate based on ROC AUC
clf = lgb.LGBMClassifier(**params,random_state=0,verbose =0)
scores = cross_val_score(clf, X_train_final_rus, y_train_rus, cv=3, scoring='f1')
# Extract the best score
best_score = max(scores)
# Loss must be minimized
loss = 1 - best_score
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'status': STATUS_OK}
space = {'max_depth': hp.choice('max_depth',range(2,8)),
'learning_rate': hp.uniform('learning_rate',0.001,0.3),
'n_estimators': hp.choice('n_estimators',range(100,1000)),
'num_leaves': hp.choice('num_leaves',range(10,80)),
'reg_alpha': hp.uniform('reg_alpha',0.01,0.5)}
# Algorithm
tpe_algorithm = tpe.suggest
# Trials object to track progress
bayes_trials = Trials()
# Optimize
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = MAX_EVALS, trials = bayes_trials)
force_row_wise=true
force_col_wise=true
best #Best model
best_lgb_clf = lgb.LGBMClassifier() #Put the best parameters (previous cells)
best_lgb_clf.fit(X_train_final_rus, y_train_rus)
y_pred_train_lgb_best = best_lgb_clf.predict(X_train_final_rus)
roc_score_train_lgb_best = roc_auc_score(y_train_rus,best_lgb_clf.predict_proba(X_train_final_rus)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_lgb_best}')
print(f'F1 score on train set: {f1_score(y_train_rus,y_pred_train_lgb_best)}')
best_lgb_clf_cv = cross_validate(best_lgb_clf,X_train_final_rus,y_train_rus,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
best_roc_auc_val = best_lgb_clf_cv['test_roc_auc'].mean()
best_f1_score_val = best_lgb_clf_cv['test_f1'].mean()
print(f'ROC AUC score on validation set: {best_roc_auc_val}')
print(f'F1 score on validation set: {best_f1_score_val}')
#Computing predictions probabilities and classes:
lgb_clf.fit(X_train_final_rus, y_train_rus)
preds_proba = [preds[1]for preds in lgb_clf.predict_proba(X_test_final)]
preds_test = lgb_clf.predict(X_test_final)
pred_target_df = pd.Series(preds_test, name='Target_pred')
pred_proba_df = pd.Series(preds_proba, name='Target_proba')
#Test set sample with predictions:
pd.concat([X_test,pred_proba_df,pred_target_df], axis=1).head(n=5)
Developed by Lundberg and Lee (2016), SHAP is a library created to explain predictions of a ML model.
Based on game theory, SHAP use the concept of Shapley value.
#Computing shapley values:
explainer = shap.TreeExplainer(lgb_clf) # TreeSHAP is a variant of SHAP for tree-based machine learning models such as decision trees, random forests and gradient boosted trees.
shap_values = explainer.shap_values(X_train_final_rus) # Compute shapley values on the train set
shap.summary_plot(shap_values[0],X_train_final_rus, plot_type='bar')
When talking about feature importance we can think of:
SHAP can be used to obtain features importance based on Shapley values. Features with large average |Shapley values| are more important which mean that they contribute more to predictions.
By looking at our feature importance plot, we understand that Previously_insured,Vehicle_damage and Policy_sales_channel are the three most important features that contribute to predictions.
Using SHAP feature importance in a case of high-dimensionality could be a good idea to reduce the dimensionality by removing the features with a low average |Shapley value|.
shap.summary_plot(shap_values[0],X_train_final_rus,plot_type='dot', plot_size=(20,10))
SHAP summary plot provides mix informations feature importance and feature effects.
Each point on the plot is a Shapley value for each feature of each instance. The position is defined by the Shapley value on the x-axis and the features on the y-axis. On the right, the color bar represents the feature value from low (blue) to high (red).
Here, a low value of Previously_insured (0: not insured) means a negative SHAP value that reduce the probability of being interested in a car insurance (Reminder: the output is represented as the sum of shap values). On contrary, a high value of Previously_insured (1: insured) means a positive SHAP value that increase the probability of being interested in a car insurance.
Nevertheless, we can see that some feature effects are difficult to interpret on this plot since we performed a scaling on these data. For better comprehension, going at an individual level could help.
def force_plot(ind_num):
plot =shap.force_plot(base_value=explainer.expected_value[1], shap_values=shap_values[1][ind_num,:], features=X_train_final.loc[ind_num,:],feature_names= X_train_final.columns, link='logit')
return plot
def frame(ind_num):
return pd.DataFrame(X_train.loc[ind_num,:]).rename(columns={0:'Values'}).T
frame(3)
force_plot(3)
Force plot are really useful when we want to understand a single prediction. For this client, we observe that the model predict a probability of 0.61 and as a consequence classify the client as interested by a car insurance. Looking at the plot, we see that going from the base value (the average prediction) Age, Policy_Sales_Channel, Vehicule_Damage and Previously_Insured are the principal features increasing the probability.
How this can be interpreted ?
Well, the fact that this client is aged of 21 years old, has already been insured, that his vehicule has not been damaged and has been outreached by the channel using the code 152, increases its probability of being interested by a car insurance.
(Notes:Continuous values on the force plot are scaled)
frame(1)
force_plot(1)
Concerning this client, the model predict a near-zero probability and as a result is classified as not interested by a car insurance.
This client has not been previously insured, his vehicule has not received any damage and has been outreached by a channel using the code 26 leads to reduce from the base value the probability of being interested by the insurance.
shap.force_plot(base_value=explainer.expected_value[1], shap_values=shap_values[1][:1000,:], features=X_train_final_rus.loc[:1000,:],feature_names= X_train_final_rus.columns, link='logit')
If we are interetested in looking at multiple observations we can simply plot a stacked force plot. The plot consists of many force plots, each of which explains the prediction of an instance. We rotate the force plots vertically and place them side by side according to their clustering similarity.
Here we clearly see two groups based mostly on Previously_Insured, Vehciule_Damage and Policy_Sales_Channel similarities: Those not interested with large blue part (reduce the probability) and those interested with a large red part (increase the probability).
shap.force_plot(base_value=explainer.expected_value[1], shap_values=shap_values[1][:1000,:], features=X_train_final_rus.loc[:1000,:],feature_names= X_train_final_rus.columns, link='logit')
def age_stand_back(age_stand):
'''Since Age feature has be standardized we have to get back to the original data'''
mean_age = X_train['Age'].mean()
std_age = X_train['Age'].std()
age = int((age_stand*std_age)+ mean_age)
return age
age_stand_back(-0.955), age_stand_back(0.65)
A stacked force plot can also be used to observe each feature effect. Taking the example of Age feature, while looking at the plot we can see that being aged between 24 and 48 years old increases the probability, while getting older and older reduces it.
The aim of this project was to demonstrate how SHAP can be used to explain ML models and in continuity bring new knowledge with high added value. Based on Game theory through Shapley value, SHAP has solid theoritical foundation which is an strong advantage for the explainability.
Taking one example in the field of insurance we demonstrated how Shapley values can be used to compute feature importance and effects at a global scale but also for each instance of a dataset.
To summarize, SHAP can be used for :