Ramgopal Prajapat:

Learnings and Views

Insurance Claim Fraud Modeling - Step by step Approach

By: Ram on Sep 03, 2020

Insurance Claim Fraud

Insurance Fraud is done to get the monetary benefits that the policyholder is not entitled to on the insurance policy. Fraud can be committed by insurance company employees, brokers, intermediaries providing services to claimants (e.g. hospital, auto service garage, etc) and policyholders.

It can be done at the time of taking policy - hiding some facts or information to reduce policy premiums.

Also, the claim process is another important area to find fraud. The policyholder tries to get benefits against losses never incurred or claimed a higher amount than the actual losses.

In the US along, FBI estimates that the total cost of insurance fraud (excluding health insurance) is more than $40 billion per year.

In the insurance claim, there are a lot of interesting problems that are being solved using Machine Learning, Deep Learning, and AI.

Some of Machine Learning Modeling Scenarios in Insurance(by no means an exhaustive list):

  • Predicting Potential Claim by a Policy Holder: What are chances a policyholder will apply for a claim
  • Claim Volume Forecasting
  • Predicting Claim Fraud at First Report
  • Refreshing Claim Fraud Model at various stages of the Claim Process
  • Predicting Litigation Probability on a Claim
  • Predicting Subrogation Potential for Claim
  • Based on Auto Accident or Property Damage Images - Estimating Claim Amount

In this blog, we have a sample data related model insurance claim fraud. The objective is to find ways to predict whether a claim is fraudulent or not.

Modeling Methodology:

* Understanding Objective
* Data and Context
* Exploratory Data Analysis (EDA)
* Feature Engineeting and Variable Transformation
* Model Variable Selection
* Model Selection 
* Validation: Training and Testing Samples

Data and Context

In a claim process, typically three sets of information are available - CustomerPolicy, and Claim. The claims adjudication process involving the assessment of the information before approving a claim. In this data sample, we have features related to these three sections and we aim to predict whether the claim is fraudulent or not.


# read data 
import pandas as pd

claim_insurance = pd.read_csv("http://ramgopalprajapat.com/static/files/2020/09/02/insurance_claims.csv")


# Remove last column 
claim_insurance=claim_insurance.drop('_c39', axis=1)

# Row and Column Counts

# Column Names

We need to find Date columns so that we can create features from them and drop these columns.

Date Columns available in the data are: policy_bind_date, incident_date, and auto_year

We can create following interval features from these date features

Months between incident_date and auto_year
Months between policy_bind_date and incident_date
Months between policy_bind_date and auto_year

import numpy as np
import pandas as pd
# Months 
# 2014-10-17
claim_insurance['bind_months'] = ((pd.to_datetime(claim_insurance.incident_date, format="%Y-%m-%d")-pd.to_datetime(claim_insurance.policy_bind_date, format="%Y-%m-%d"))/np.timedelta64(1, 'M'))
claim_insurance['auto_incident_months'] = ((pd.to_datetime(claim_insurance.incident_date, format="%Y-%m-%d")-pd.to_datetime(claim_insurance.auto_year, format="%Y"))/np.timedelta64(1, 'M'))
claim_insurance['bind_auto_months'] = ((pd.to_datetime(claim_insurance.policy_bind_date, format="%Y-%m-%d")-pd.to_datetime(claim_insurance.auto_year, format="%Y"))/np.timedelta64(1, 'M'))


# Plot Histogram
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.hist(claim_insurance['bind_months'],bins=15, cumulative=False)
ax.set_xlabel('Bind Months')

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.hist(claim_insurance['auto_incident_months'],bins=15, cumulative=False)
ax.set_xlabel('Months Between Auto Year and Incident')

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.hist(claim_insurance['bind_auto_months'],bins=15, cumulative=False)
ax.set_xlabel('Months Between Auto Year and Bind Months')




# Drop Date Columns
claim_insurance=claim_insurance.drop(['policy_bind_date' , 'incident_date','auto_year'], axis=1)

Summary Statistics

We need to check summary statics to see if there anything unusual. describe() method helps us get the summary statistics and using transpose(), we have transposed so that we can see all the columns/features clearly.



policy_number and insured_zip are relevant for the analysis. ZIP code can be used indirectly to see if there is increased fraud in certain ZIP codes. we are going to check.

policy_deductable is looking like an ordinal variable. we need to check the levels of this variable.

umbrella_limit is an interesting variable, very low minimum value, and very high maximum value. Value at 25th, 50th, and 75th percentiles are 0

number_of_vehicles_involvedbodily_injuries, and witnesses are ordinal variables.


Policy Deductible


1000    351
500     342
2000    307
Name: policy_deductable, dtype: int64

We can do one-hot encoding for this variable instead of using a row variable. The Pandas package has a get_dummies method to create dummy variables.

# Merge dummy variables using merge
claim_insurance = claim_insurance.merge(pd.get_dummies(claim_insurance['policy_deductable'],prefix='policy_deductable'),
# drop original variable
claim_insurance=claim_insurance.drop(['policy_deductable'], axis=1)

Umbrella Limit

Umbrella insurance policy is extra liability insurance coverage above the limits of home and auto insurance. We need to explore the distribution and get some useful insights.


It is evident that 0 value indicates no Umbrella insurance policy associated with motor insurance policy. -1000000 may be data issue with - sign.

Now, we may want to create an indicator variable which indicates whether there is associated Umbrella coverage instead of using the amount.

Zip Code

Zipcode can be very important to see if linked to insurance fraud. We can do frequency encoding for zip codes. But we need to check the unique value of the Zip codes.

fq = claim_insurance.groupby('insured_zip').size()/len(claim_insurance)


There are 1000 rows in the data and 995 unique zip codes. There almost every row is linked to a zip code. It is like an ID variable and may not be useful for the analysis.

So, better to drop this variable.

Number of vehicles involved

In an incident, the number of variables involved could be an important indicator of fraudulent claims. If we look at the distribution, we can see that almost 94% of the times 1 or 3 vehicles were involved. We can align 2 and 4 vehicles rows to other rows instead of creating separate columns/dummy variables.


We can create a bivariate analysis table.

import warnings

number_of_vehicles = claim_insurance.groupby(['number_of_vehicles_involved','fraud_reported'])['number_of_vehicles_involved','fraud_reported'].size().reset_index(name='counts')

# Restucture
number_of_vehicles = number_of_vehicles.pivot(index='number_of_vehicles_involved', columns='fraud_reported', values='counts')
# Fill with 0
# Rename the columns

# Calculate Bad Rate for each of the income type
number_of_vehicles['pct_obs'] = (number_of_vehicles['fraud_reported_N']+number_of_vehicles['fraud_reported_Y'])/(sum(number_of_vehicles['fraud_reported_N'])+sum(number_of_vehicles['fraud_reported_Y']))
number_of_vehicles['pct_fraud_reported_N']= number_of_vehicles['fraud_reported_N']/(number_of_vehicles['fraud_reported_N']+number_of_vehicles['fraud_reported_Y'])
number_of_vehicles['pct_fraud_reported_Y']= number_of_vehicles['fraud_reported_Y']/(number_of_vehicles['fraud_reported_N']+number_of_vehicles['fraud_reported_Y'])


Interesting observation - when the number of vehicles is odd, low % fraud reported. Though, only 6% of the incidents involved even vehicles. Still, we may create an indicator variable to see the association with frauds.

claim_insurance['odd_vehicles']=np.where( (claim_insurance.number_of_vehicles_involved==1) | (claim_insurance.number_of_vehicles_involved==3) , 1, 0)

# Drop 'number_of_vehicles_involved'
claim_insurance= claim_insurance.drop(['number_of_vehicles_involved'], axis=1)


Bodily Injuries

In motor insurance, bodily injury can also be claimed. This variable takes value 0, 1, and 2. Probably 0 indicates no body injury claim and others suggest a number of claims.

We can create dummy variables for each of these levels. We will do the one-hot encoding.


# Merge dummy variables using merge
claim_insurance = claim_insurance.merge(pd.get_dummies(claim_insurance['bodily_injuries'],prefix='bodily_injuries'),
# drop original variable
claim_insurance=claim_insurance.drop(['bodily_injuries'], axis=1)


If there is/are witnesses to the incident, it gives some confidence of none fraud case. Again we can create dummy variables from the level of this variable.


claim_insurance = claim_insurance.merge(pd.get_dummies(claim_insurance['witnesses'],prefix='witnesses'),
# drop original variable
claim_insurance=claim_insurance.drop(['witnesses'], axis=1)

EDA for Categorical Variables

In this data, there are a number of categorical variables are available. We can assess which of these variables are relevant and what type of grouping/encoding should be used for these.

Based on the below function we can find variable type based on the scale of measurement.

Variable Type- Measurement Scale

# Find Continuous and Categorical Features
def featureType(df):
    import numpy as np 
    from pandas.api.types import is_numeric_dtype

    columns = df.columns
    rows= len(df)
    for col in columns:
            if rows>10:
                if is_numeric_dtype(df[col]):
                    if uniq==1:
                    elif uniq==2:
                    elif rows/uniq>3 and uniq>5:
                    if uniq==1:
                    elif uniq==2:
                if is_numeric_dtype(df[col]):
    # Create dataframe    
    df_out =pd.DataFrame({'Feature':columns,
    # Sort by BaseFeatureType
    df_out = df_out.sort_values('BaseFeatureType')
    return df_out



We have two variables which are Binary but Categorical and we can convert to Binary variables. There is fraud_reported (Label/Target Variable) and insured_sex. 

print('Reported Fraud \n',claim_insurance.fraud_reported.value_counts())

print('Gender of Insured Person \n',claim_insurance.insured_sex.value_counts())


claim_insurance['fraud']= np.where(claim_insurance['fraud_reported']=='Y', 1, 0)

claim_insurance['insured_male']= np.where(claim_insurance['insured_sex']=='MALE', 1, 0)


# Distribution of new column

print('Fraud: \n', claim_insurance.fraud.value_counts())

print('Insuraned Male: \n', claim_insurance.insured_male.value_counts())


# drop original columns


Getting Level Count Report for each of the categorical columns.

# Find Feature Type
colType = featureType(claim_insurance)
# Select Nominal Features

for feature in catFeatures:


For the first iteration, we may want to drop the below variables.

  • incident_location: Too many values and not repeating. So drop the feature.
  • insured_hobbies : Again for 1000 rows of observations, this variable has too many levels. We may drop this column
  • insured_occupation: Too many levels, so drop it
  • auto_model: Too many levels, so drop it
  • auto_make: This is NOT a straight drop. But we will drop it.

The remaining variables will be considered for the weight of evidence (WOE) based transformation. WOE could be really useful, considering the small dataset and a fairly long list of categorical variables. If we create one-hot encoding for each level, there may be too many features.

# drop variables with too many levels
claim_insurance = claim_insurance.drop(['incident_location','insured_hobbies','insured_occupation','auto_model','auto_make'], axis=1)

Weight of Evidence (WOE) based Transformations

def woe_cat(df, var, label):
    import numpy as np
    grp = df.groupby([var,label])[var,label].size().reset_index(name='counts')
    # Restucture
    pivot = grp.pivot(index=var, columns=label, values='counts')
    # Fill with 0
    # Rename the columns
    # Find total 0 and 1
    sum_0 = pivot.label_0.sum()
    sum_1 = pivot.label_1.sum()
    pivot['woe']= np.log( (pivot['label_1']/sum_1 +0.00001)/(pivot['label_0']/sum_0 +0.00001))
    # Create a bar chart for WOR
    import matplotlib.pyplot as plt
    plt.bar(pivot[var], pivot['woe'], color='green')
    plt.ylabel("WOE Value")
    plt.title("WOE Across Categories")
    plt.xticks(rotation=90) # change orientation of X axis tick label
    return pivot
# property_damage_woe=woe_cat(claim_insurance, 'property_damage', 'fraud')

def woe_cat_features(df,woeDF, var):
    new_feature = var+"_woe"
    for r in range(len(woeDF)):
        df.loc[(df[var]== woeDF[var][r]),new_feature] = woeDF['woe'][r]        
# woe_cat_features(claim_insurance, property_damage_woe,'property_damage')

# Transform all the features
def cat_woe_trans(df,LabelVar,catFeature=[]):
    for f in catFeature:
        woe = woe_cat(df, f, LabelVar)

cat_woe_trans(claim_insurance,'fraud',catFeature=['property_damage', 'police_report_available', 'collision_type', 'incident_city',
                                          'incident_state', 'policy_state', 'policy_csl', 'authorities_contacted', 'incident_severity',
                                          'insured_education_level', 'incident_type', 'insured_relationship'])


# Summary statistics


We have done feature engineering and transformations required. We can now drop the list of categorical variables as we already have WOE transformed variables.

claim_insurance = claim_insurance.drop(['property_damage', 'police_report_available', 'collision_type', 'incident_city',
                                          'incident_state', 'policy_state', 'policy_csl', 'authorities_contacted', 'incident_severity',
                                          'insured_education_level', 'incident_type', 'insured_relationship'], axis=1)

Now, we can review the columns/features. We expect to be numeric type - binary, continuous and ordinal variables. We are ready to proceed to the model development. 

# drop policy_number
claim_insurance = claim_insurance.drop(['policy_number'], axis=1)

The scale of Measurement may play an important role in some of the algorithms. But we are using Tree-based methods and logistic regression, so we should be fine for now.

Split Data to Training and Validation Samples

The sample is not very big, we are creating only 2 samples- Training and Validation with 70% and 30% observations.

label = claim_insurance['fraud']
features = claim_insurance.drop(['fraud'],axis=1)
from sklearn.model_selection import train_test_split

Insurance Claim Fraud: Logistic Regression

We have a high number of features compared to the number of data points, we can select a good set of features and build the model on. This can be done in a number of ways with each has its own limitations.

Information Value is one of them. We are going to use the Recursive Feature Elimination method available in Python.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logitreg = LogisticRegression()
# We want to select top 20 features
rfe = RFE(logitreg, 20)
rfe = rfe.fit(claim_train, fraud_train.values.ravel())

Select variables and create the structure to exclude variables that are not significant based on P-Value.

select_vars =claim_train.columns[rfe.support_].to_list()

exclude = ['witnesses_0','bodily_injuries_2','insured_male','bodily_injuries_1','policy_csl_woe','policy_state_woe',
select_vars_updated = np.setdiff1d(select_vars,exclude)


import statsmodels.api as sm



Multi-collinearity amount set of features play a role in the stability of the model. We can check if the final model features have a high level of multi-collinearity.

Variance Inflation Factor (VIF) = 1/(1-R^2) is one of the commonly used metrics to assess multi-collinearity.

from statsmodels.stats.outliers_influence import variance_inflation_factor

def vif(X):
    # Creat empty data Frame
    vif = pd.DataFrame()
    # Add Feature Names
    vif["Features"] = X.columns
    # Calculate VIF for each Feature as dependent and rest all as independent
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Final Model Selected Feature DataFrame
model_features = claim_train[select_vars_updated]
vif_values = vif(model_features)

VIF values are less than 2, so we conclude that there is no significant collinearity amount of these selected features.

Model Prediction

Now we can select this model for model performance evaluation. We can predict the label using the model and compare it with actual label values. For the selected variables, fit the model using sklearn's LogisticRegression classifier.

# Import logistic Regression classifier
from sklearn.linear_model import LogisticRegression
# Call function with default parameters
logreg = LogisticRegression()
# Model Development: fit the model with data

# Predict The Fraud Class
pred_class = logreg.predict(claim_train[select_vars_updated]) 

Model Performance: Training

Various model classification matrics are calculated to see the performance on the training sample.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# Model Accuracy
print('Model Accuracy(Training) {:.2f}'.format(logreg.score(claim_train[select_vars_updated], fraud_train)))
# Confusion Matrix
confusion_matrix = confusion_matrix(fraud_train, pred_class)
print("Confusion Matrix (Training): \n",confusion_matrix)
print("Model Classification Report (Training): \n",classification_report(fraud_train, pred_class))

Model Performance on Test Sample

# Score validation/test sample:Predict The Fraud Class

test_pred_class = logreg.predict(claim_test[select_vars_updated])


from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

# Model Accuracy

print('Model Accuracy(Training) {:.2f}'.format(logreg.score(claim_test[select_vars_updated], fraud_test)))

# Confusion Matrix

confusion_matrix = confusion_matrix(fraud_test, test_pred_class)

print("Confusion Matrix (Training): \n",confusion_matrix)

print("Model Classification Report (Training): \n",classification_report(fraud_test, test_pred_class))


Model performance is very stable across training and testing samples. In my view, this is one of the great advantages of logistic regression, typically performance is stable across samples if the model is developed rigorously.

Some of the area of improvement in this Model Development process

  • Feature Transformation: some of the continuous variables can also be explored for WOE based or Mathematical Transformation
  • Feature Selection: Using IV, Decision Tree or Random Forest-based feature selection to find the combinations which can improve the model predictive power


In the next section, we will use the sample data sample and build Random Forest and XGBoost Models.

Leave a comment