Project : Insurance cross sell prediction explainability

insurance.jpeg

Description

Working on behalf of an insurance company, our goal is to determine whether its clients would be potentially interested in purchasing an auto insurance. For this purpose we will build a model classifying whether or not the client would be interested. At our disposal we have informations about the clients (gender, age, sex, region), their vehicle (age of the vehicle, presence of damages) and their insurance contract (amount paid by the client, means of contact of the client). Finally (the part we are interested in), we want to explain the predictions of the model (e.g.: Why this customer was classified as a potential car insurance policy holder) using SHAP.

The dataset used for this project can be found here.

Packages

In [209]:
import os, sys
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import xgboost as xgb 
import lightgbm as lgb

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, cross_validate, StratifiedKFold
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, OrdinalEncoder, Binarizer
from sklearn.metrics import  f1_score, roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.under_sampling import RandomUnderSampler
from category_encoders import TargetEncoder

import shap #SHAP LIBRARY


shap.initjs()
pd.options.mode.chained_assignment = None  # default='warn
In [210]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Data Description

In [211]:
#Printing the shape of the train and test set:

print(f'The train set has {train.shape[0]} rows and {train.shape[1]} columns')

print(f'The test set has {test.shape[0]} rows and {test.shape[1]} columns')
The train set has 381109 rows and 12 columns
The test set has 127037 rows and 11 columns
In [212]:
variables = {'id':'Unique ID for the customer',
'Gender':'Gender of the customer',
'Age':'Age of the customer',
'Driving_License':'0 : Customer does not have DL, 1 : Customer already has DL',
'Region_Code':'Unique code for the region of the customer',
'Previously_Insured':'1 : Customer already has Vehicle Insurance, 0 : Customer doesn\'t have Vehicle Insurance',
'Vehicle_Age':'Age of the Vehicle',
'Vehicle_Damage':'1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn\'t get his/her vehicle damaged in the past.',
'Annual_Premium':'The amount customer needs to pay as premium in the year',
'PolicySalesChannel':'Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.',
'Vintage':'Number of Days, Customer has been associated with the company',
'Response':'1 : Customer is interested, 0 : Customer is not interested'}
In [213]:
#Creating a dataframe for data description:

variables_df = pd.DataFrame.from_dict(variables,orient='index')
variables_df.rename(columns={0:'Definition'},inplace=True)

variables_df
Out[213]:
Definition
id Unique ID for the customer
Gender Gender of the customer
Age Age of the customer
Driving_License 0 : Customer does not have DL, 1 : Customer al...
Region_Code Unique code for the region of the customer
Previously_Insured 1 : Customer already has Vehicle Insurance, 0 ...
Vehicle_Age Age of the Vehicle
Vehicle_Damage 1 : Customer got his/her vehicle damaged in th...
Annual_Premium The amount customer needs to pay as premium in...
PolicySalesChannel Anonymized Code for the channel of outreaching...
Vintage Number of Days, Customer has been associated w...
Response 1 : Customer is interested, 0 : Customer is no...
In [214]:
#Preview of train set:

train.head()
Out[214]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
0 1 Male 44 1 28.0 0 > 2 Years Yes 40454.0 26.0 217 1
1 2 Male 76 1 3.0 0 1-2 Year No 33536.0 26.0 183 0
2 3 Male 47 1 28.0 0 > 2 Years Yes 38294.0 26.0 27 1
3 4 Male 21 1 11.0 1 < 1 Year No 28619.0 152.0 203 0
4 5 Female 29 1 41.0 1 < 1 Year No 27496.0 152.0 39 0
In [215]:
#Plor of target variable:

fig, ax = plt.subplots(figsize=(7,5))

sns.barplot(x=train['Response'].value_counts().index, y=train['Response'].value_counts())
plt.title('Value counts of target class')

plt.show()

As we can see our dataset is imbalanced. Using metrics such as accuracy here is misleading (F1 score here is much more relevant). If we want a good performing classifier we will have to take care of the imbalancement. This issue can be solved by performing an undersampling of the majority class (0: Not interested by the auto insurance)

Data cleaning and preprocessing

In [216]:
#Check if the dataset contain NaN values:

train.isna().sum()
Out[216]:
id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64
In [217]:
#Check if the dataset contain duplicated values:

train[train.duplicated(subset='id', keep=False)]
Out[217]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
In [218]:
#Dropping the 'Id' column in the train and test axis:

train.drop('id',axis=1,inplace=True)

test.drop('id',axis=1,inplace=True)
In [219]:
# Binarize 'gender' & 'Vehicule_damage':

def binarize_gender(x):
    
    ''' Function to binarize the gender column'''
    
    if x == 'Male':
        return 1
    else:
        return 0
    
def binarize_vehicule_dmg(x):
    
    '''Function to binarize the vehicule_dmg column'''
    
    if x == 'Yes':
        return 1
    else:
        return 0
    
train['Gender'] = train['Gender'].apply(lambda x: binarize_gender(x))
test['Gender'] = test['Gender'].apply(lambda x: binarize_gender(x))

train['Vehicle_Damage'] = train['Vehicle_Damage'].apply(lambda x: binarize_vehicule_dmg(x))
test['Vehicle_Damage'] = test['Vehicle_Damage'].apply(lambda x: binarize_vehicule_dmg(x))
In [220]:
#Splitting features from the target columns:

X_train, y_train = train.drop('Response', axis=1), train['Response']
X_test = test
In [221]:
#Ordinal encoding of 'vehicule_age' column in train and test set:

ord_encoder = OrdinalEncoder()
vehicle_age_encoded_train = ord_encoder.fit_transform(np.array(train['Vehicle_Age']).reshape(-1,1))
X_train['Vehicle_Age'] = pd.Series(vehicle_age_encoded_train.reshape(1,-1)[0])

vehicle_age_encoded_test = ord_encoder.transform(np.array(test['Vehicle_Age']).reshape(-1,1))
X_test['Vehicle_Age'] = pd.Series(vehicle_age_encoded_test.reshape(1,-1)[0])
In [224]:
#Pipelines and columns transformer:

numerical_col = ['Age', 'Annual_Premium', 'Vintage']
categorical_col = ['Region_Code','Policy_Sales_Channel']

numerical_transformer = Pipeline(steps=[('stand_scaler',StandardScaler())])

categorical_transformer = Pipeline(steps=[('target_encoder', TargetEncoder(cols=categorical_col))])

preprocessor = ColumnTransformer(transformers=[
    ('num',numerical_transformer,numerical_col),
    ('cat',categorical_transformer, categorical_col)],remainder='drop', n_jobs=-1)
In [225]:
#Applying our preprocessing to the train and test set:

X_train_processed = preprocessor.fit_transform(X_train, y_train)
X_test_processed = preprocessor.transform(X_test)
In [226]:
#Creating a function to retrieve column names:

def get_column_names_from_ColumnTransformer(column_transformer):
    
    '''This function allowed to retrieve the column names after applying column transformer'''
    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name
In [227]:
#Building our preprocessed DataFrame:

col_dropped = [col for col in X_train.columns if col not in get_column_names_from_ColumnTransformer(preprocessor)]

X_train_processed_df = pd.DataFrame(X_train_processed,columns=get_column_names_from_ColumnTransformer(preprocessor))
X_test_processed_df = pd.DataFrame(X_test_processed, columns=get_column_names_from_ColumnTransformer(preprocessor))

X_train_final = pd.concat([X_train_processed_df,X_train.loc[:,col_dropped]], axis=1)
X_test_final = pd.concat([X_test_processed_df,X_test.loc[:,col_dropped]], axis=1)
In [229]:
#Quick peak of our processed train set:

X_train_final.head()
Out[229]:
Age Annual_Premium Vintage Region_Code Policy_Sales_Channel Gender Driving_License Previously_Insured Vehicle_Age Vehicle_Damage
0 0.333777 0.574539 0.748795 0.187163 0.199385 1 1 0 2.0 1
1 2.396751 0.172636 0.342443 0.127662 0.199385 1 1 0 0.0 0
2 0.527181 0.449053 -1.521998 0.187163 0.199385 1 1 0 2.0 1
3 -1.148985 -0.113018 0.581474 0.112760 0.028624 1 1 1 1.0 0
4 -0.633242 -0.178259 -1.378580 0.121776 0.028624 0 1 1 1.0 0
In [230]:
#Performing undersampling on the majority class (0: Not interested by the auto insurance)

rus = RandomUnderSampler(sampling_strategy='majority')

X_train_final_rus, y_train_rus = rus.fit_resample(X_train_final,y_train)
In [231]:
plt.figure(figsize = (7,5))

sns.barplot(x = y_train_rus.value_counts().index, y = y_train_rus.value_counts())

plt.title('Value counts of target class after undersampling')

plt.show()

Machine Learning

I compared the performances of two gradient boosting algorithms (XGBoost and LightGBM ) without/with undersampling to observe the effects on the metrics (Roc_auc and f1 score).

XGBoost

  • No Undersampling
In [232]:
#XGBOOST NO UNDER SAMPLING:

xgb_clf = xgb.XGBClassifier()

xgb_clf.fit(X_train_final, y_train)
y_pred_train_xgb_baseline = xgb_clf.predict(X_train_final)

roc_score_train_xgb_baseline = roc_auc_score(y_train, xgb_clf.predict_proba(X_train_final)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_xgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train,y_pred_train_xgb_baseline)}')

xgb_clf_cv = cross_validate(xgb_clf,X_train_final,y_train,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = xgb_clf_cv['test_roc_auc'].mean()
f1_score_val = xgb_clf_cv['test_f1'].mean()

print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
ROC AUC score on train set: 0.8760163235016266
F1 score on train set: 0.06681096091803815
ROC AUC score on validation set: 0.8565369592420277
F1 score on validation set: 0.04151269451370867
  • With undersampling
In [233]:
#XGBOOST WITH UNDERSAMPLING:

xgb_clf = xgb.XGBClassifier()

xgb_clf.fit(X_train_final_rus, y_train_rus)
y_pred_train_xgb_baseline = xgb_clf.predict(X_train_final_rus)

roc_score_train_xgb_baseline = roc_auc_score(y_train_rus, xgb_clf.predict_proba(X_train_final_rus)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_xgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train_rus,y_pred_train_xgb_baseline)}')


xgb_clf_cv = cross_validate(xgb_clf,X_train_final_rus,y_train_rus,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = xgb_clf_cv['test_roc_auc'].mean()
f1_score_val = xgb_clf_cv['test_f1'].mean()

print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
ROC AUC score on train set: 0.8912025073423655
F1 score on train set: 0.8400213915999771
ROC AUC score on validation set: 0.8521452645976364
F1 score on validation set: 0.8198450269782491

As expected the roc_auc score doesn't change much with or without undersampling whereas the F1 score greatly increases going from 0.04 to 0.82 on the validation set. If we tried to predict new data without undersampling the model won\'t perform really good which means that a good roc_auc score here is misleading and that undersampling is essential here and also focus on the F1 score.

LightGBM

In [234]:
#LGBOOST NO UNDERSAMPLING

lgb_clf = lgb.LGBMClassifier()


lgb_clf.fit(X_train_final, y_train)
y_pred_train_lgb_baseline = lgb_clf.predict(X_train_final)
roc_score_train_lgb_baseline = roc_auc_score(y_train, lgb_clf.predict_proba(X_train_final)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_lgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train,y_pred_train_lgb_baseline)}')

lgb_clf_cv = cross_validate(lgb_clf,X_train_final,y_train,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = lgb_clf_cv['test_roc_auc'].mean()
f1_score_val = lgb_clf_cv['test_f1'].mean()

print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
ROC AUC score on train set: 0.8624093044607041
F1 score on train set: 0.008509008913186837
ROC AUC score on validation set: 0.857707757245102
F1 score on validation set: 0.006908681887180042
In [235]:
#LGBOOST WITH UNDERSAMPLING

lgb_clf = lgb.LGBMClassifier()

lgb_clf.fit(X_train_final_rus, y_train_rus)
y_pred_train_lgb_baseline = lgb_clf.predict(X_train_final_rus)
roc_score_train_lgb_baseline = roc_auc_score(y_train_rus, lgb_clf.predict_proba(X_train_final_rus)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_lgb_baseline}')
print(f'F1 score on train set: {f1_score(y_train_rus,y_pred_train_lgb_baseline)}')

lgb_clf_cv = cross_validate(lgb_clf,X_train_final_rus,y_train_rus,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
roc_auc_val = lgb_clf_cv['test_roc_auc'].mean()
f1_score_val = lgb_clf_cv['test_f1'].mean()

print(f'ROC AUC score on validation set: {roc_auc_val}')
print(f'F1 score on validation set: {f1_score_val}')
ROC AUC score on train set: 0.8679884844979024
F1 score on train set: 0.8296300500392557
ROC AUC score on validation set: 0.8565216829349349
F1 score on validation set: 0.8230964498858955

LightGBM perform slighty better on both the roc_auc and f1 score on the validation set comparing to XGBoost. We will use this model to explain the predictions.

Bayesian optimisation

Since roc_auc score isn't really relevant here, we can try to hypertune our model to improve the f1 score. Still, having a f1 score of 0.829 on the train set and 0.823 on the validation set, the improvement can only be minimal.

When performing hyperparameters tuning, the majority is quiet familiar with GridSearchCV and RandomSearch. Still, these hyper-parameter optimizations have hese methods are relatively inefficient because they do not choose the next hyperparameters to evaluate based on previous results. Bayesian optimization is a way to perform informed hypertuning, which means that the search for best parameters will depend on the previous results.py If you wanna know more about bayesian optimization, go check this article.

Thus, I tried to perform bayesian optimization using the library HyperOpt but this method was too expensive in term of computation which ends in failure.

Because the improvement can only have been minimal, i choosed to use the lighGBM model with no tuning to explain the predictions.

(I leave the code for the bayesian optimization if you're interested in reusing it)

In [23]:
### HyperOpt Parameter Tuning
from hyperopt import tpe
from hyperopt import STATUS_OK
from hyperopt import Trials
from hyperopt import hp
from hyperopt import fmin
In [24]:
N_FOLDS = 3
MAX_EVALS = 25
In [28]:
def objective(params, n_folds = N_FOLDS):
    """Objective function for Logistic Regression Hyperparameter Tuning"""

    # Perform n_fold cross validation with hyperparameters
    # Use early stopping and evaluate based on ROC AUC

    clf = lgb.LGBMClassifier(**params,random_state=0,verbose =0)
    scores = cross_val_score(clf, X_train_final_rus, y_train_rus, cv=3, scoring='f1')

    # Extract the best score
    best_score = max(scores)

    # Loss must be minimized
    loss = 1 - best_score

    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'status': STATUS_OK}
In [29]:
space = {'max_depth': hp.choice('max_depth',range(2,8)),
        'learning_rate': hp.uniform('learning_rate',0.001,0.3),
        'n_estimators': hp.choice('n_estimators',range(100,1000)),
        'num_leaves': hp.choice('num_leaves',range(10,80)),
        'reg_alpha': hp.uniform('reg_alpha',0.01,0.5)}
In [ ]:
# Algorithm
tpe_algorithm = tpe.suggest

# Trials object to track progress
bayes_trials = Trials()

# Optimize
best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = MAX_EVALS, trials = bayes_trials)
force_row_wise=true
force_col_wise=true
In [102]:
best #Best model 
Out[102]:
{'learning_rate': 0.14268576366595795,
 'max_depth': 0,
 'n_estimators': 14,
 'num_leaves': 60,
 'reg_alpha': 0.3197264198339445}
In [33]:
best_lgb_clf = lgb.LGBMClassifier() #Put the best parameters (previous cells)

best_lgb_clf.fit(X_train_final_rus, y_train_rus)
y_pred_train_lgb_best = best_lgb_clf.predict(X_train_final_rus)

roc_score_train_lgb_best = roc_auc_score(y_train_rus,best_lgb_clf.predict_proba(X_train_final_rus)[:,1])
print(f'ROC AUC score on train set: {roc_score_train_lgb_best}')
print(f'F1 score on train set: {f1_score(y_train_rus,y_pred_train_lgb_best)}')

best_lgb_clf_cv = cross_validate(best_lgb_clf,X_train_final_rus,y_train_rus,scoring=['roc_auc','f1'],cv=10,n_jobs=-1)
best_roc_auc_val = best_lgb_clf_cv['test_roc_auc'].mean()
best_f1_score_val = best_lgb_clf_cv['test_f1'].mean()

print(f'ROC AUC score on validation set: {best_roc_auc_val}')
print(f'F1 score on validation set: {best_f1_score_val}')
ROC AUC score on train set: 0.7900300280852155
F1 score on train set: 0.8193740070193076
ROC AUC score on validation set: 0.7900303672509622
F1 score on validation set: 0.8193779981868665

Predictions on test set

In [236]:
#Computing predictions probabilities and classes:

lgb_clf.fit(X_train_final_rus, y_train_rus)

preds_proba = [preds[1]for preds in lgb_clf.predict_proba(X_test_final)]
preds_test = lgb_clf.predict(X_test_final)

pred_target_df = pd.Series(preds_test, name='Target_pred')
pred_proba_df = pd.Series(preds_proba, name='Target_proba')
In [237]:
#Test set sample with predictions:

pd.concat([X_test,pred_proba_df,pred_target_df], axis=1).head(n=5)
Out[237]:
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Target_proba Target_pred
0 1 25 1 11.0 1 1.0 0 35786.0 152.0 53 0.002063 0
1 1 40 1 28.0 0 0.0 1 33762.0 7.0 111 0.764131 1
2 1 47 1 28.0 0 0.0 1 40050.0 124.0 199 0.762986 1
3 1 24 1 27.0 1 1.0 1 37356.0 152.0 187 0.026325 0
4 1 27 1 28.0 1 1.0 0 59097.0 152.0 297 0.004745 0

Model Explainability

Developed by Lundberg and Lee (2016), SHAP is a library created to explain predictions of a ML model.

Based on game theory, SHAP use the concept of Shapley value.

In [238]:
#Computing shapley values:

explainer = shap.TreeExplainer(lgb_clf) # TreeSHAP is a variant of SHAP for tree-based machine learning models such as decision trees, random forests and gradient boosted trees.
shap_values = explainer.shap_values(X_train_final_rus) # Compute shapley values on the train set
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

SHAP feature importance

In [239]:
shap.summary_plot(shap_values[0],X_train_final_rus, plot_type='bar')

When talking about feature importance we can think of:

  • Weights when performing a linear regression (Features with a large weight are more important in the model (only if features are at the same scale)).
  • Information gain for tree based models(Features that reduce more impurity contributed by all splits are more important).

SHAP can be used to obtain features importance based on Shapley values. Features with large average |Shapley values| are more important which mean that they contribute more to predictions.

By looking at our feature importance plot, we understand that Previously_insured,Vehicle_damage and Policy_sales_channel are the three most important features that contribute to predictions.

Using SHAP feature importance in a case of high-dimensionality could be a good idea to reduce the dimensionality by removing the features with a low average |Shapley value|.

SHAP Summary plot

In [240]:
shap.summary_plot(shap_values[0],X_train_final_rus,plot_type='dot', plot_size=(20,10))

SHAP summary plot provides mix informations feature importance and feature effects.

Each point on the plot is a Shapley value for each feature of each instance. The position is defined by the Shapley value on the x-axis and the features on the y-axis. On the right, the color bar represents the feature value from low (blue) to high (red).

Here, a low value of Previously_insured (0: not insured) means a negative SHAP value that reduce the probability of being interested in a car insurance (Reminder: the output is represented as the sum of shap values). On contrary, a high value of Previously_insured (1: insured) means a positive SHAP value that increase the probability of being interested in a car insurance.

Nevertheless, we can see that some feature effects are difficult to interpret on this plot since we performed a scaling on these data. For better comprehension, going at an individual level could help.

SHAP individual observation

In [241]:
def force_plot(ind_num):
    plot =shap.force_plot(base_value=explainer.expected_value[1], shap_values=shap_values[1][ind_num,:], features=X_train_final.loc[ind_num,:],feature_names= X_train_final.columns, link='logit')
    return plot

def frame(ind_num):
    return pd.DataFrame(X_train.loc[ind_num,:]).rename(columns={0:'Values'}).T
  • Client 1:
In [242]:
frame(3)
Out[242]:
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage
3 1.0 21.0 1.0 11.0 1.0 1.0 0.0 28619.0 152.0 203.0
In [243]:
force_plot(3) 
Out[243]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Force plot are really useful when we want to understand a single prediction. For this client, we observe that the model predict a probability of 0.61 and as a consequence classify the client as interested by a car insurance. Looking at the plot, we see that going from the base value (the average prediction) Age, Policy_Sales_Channel, Vehicule_Damage and Previously_Insured are the principal features increasing the probability.

How this can be interpreted ?

Well, the fact that this client is aged of 21 years old, has already been insured, that his vehicule has not been damaged and has been outreached by the channel using the code 152, increases its probability of being interested by a car insurance.

(Notes:Continuous values on the force plot are scaled)

  • Client 2 :
In [185]:
frame(1)
Out[185]:
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage
1 1.0 76.0 1.0 3.0 0.0 0.0 0.0 33536.0 26.0 183.0
In [97]:
force_plot(1)
Out[97]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Concerning this client, the model predict a near-zero probability and as a result is classified as not interested by a car insurance.

This client has not been previously insured, his vehicule has not received any damage and has been outreached by a channel using the code 26 leads to reduce from the base value the probability of being interested by the insurance.

SHAP multiple observations

In [99]:
shap.force_plot(base_value=explainer.expected_value[1], shap_values=shap_values[1][:1000,:], features=X_train_final_rus.loc[:1000,:],feature_names= X_train_final_rus.columns, link='logit')
Out[99]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

If we are interetested in looking at multiple observations we can simply plot a stacked force plot. The plot consists of many force plots, each of which explains the prediction of an instance. We rotate the force plots vertically and place them side by side according to their clustering similarity.

Here we clearly see two groups based mostly on Previously_Insured, Vehciule_Damage and Policy_Sales_Channel similarities: Those not interested with large blue part (reduce the probability) and those interested with a large red part (increase the probability).

In [246]:
shap.force_plot(base_value=explainer.expected_value[1], shap_values=shap_values[1][:1000,:], features=X_train_final_rus.loc[:1000,:],feature_names= X_train_final_rus.columns, link='logit')
Out[246]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [264]:
def age_stand_back(age_stand):
    
    '''Since Age feature has be standardized we have to get back to the original data'''
    
    mean_age = X_train['Age'].mean()
    std_age = X_train['Age'].std()
    
    age = int((age_stand*std_age)+ mean_age)
    
    return age
    
    
In [266]:
age_stand_back(-0.955), age_stand_back(0.65)
Out[266]:
(24, 48)

A stacked force plot can also be used to observe each feature effect. Taking the example of Age feature, while looking at the plot we can see that being aged between 24 and 48 years old increases the probability, while getting older and older reduces it.

Conclusion

The aim of this project was to demonstrate how SHAP can be used to explain ML models and in continuity bring new knowledge with high added value. Based on Game theory through Shapley value, SHAP has solid theoritical foundation which is an strong advantage for the explainability.

Taking one example in the field of insurance we demonstrated how Shapley values can be used to compute feature importance and effects at a global scale but also for each instance of a dataset.

To summarize, SHAP can be used for :

  • Understand the global and local importance of the features
  • Explain the predictions of a ML model
  • Compare two observations
  • Clustering instances by similarity of the effects of the variables
In [ ]: