By: Ram on Jun 19, 2020
Predictive Modeling is an approach to build a statistical relationship between a response variable and a set of independent variables using a data sample (called development sample). The model so developed will be used for predicting values of Response Variable on a new data.
Predictive Models are used across functional areas and industry verticals. A predictive model can be built for identifying target customers for cross-selling, retaining existing customers, and various other business problems. Depending on the type of predictive modeling problem and levels of response variables, appropriate Statistical & Machine Learning Techniques can be used. Some of the commonly used techniques are
Now we will show the steps involved in a Predictive Model development. One of the steps is to understand the business context and scenario.
In a real-world scenario, one has to identify a business problem and understand the business context for defining target variable appropriately. Let’s understand it from a scenario.
A marketing department of a bank runs various marketing campaigns for cross-selling products, improving customer retention, and better customer services. Now let’s assume that the bank wants to cross-sell the term deposit product to its existing customers. How can the bank find the list of customers to be contacted for the term deposit product?
Contacting all customers is costly and does not create a good customer experience. So, the bank’s team will work together to find a mechanism that will work better.
The marketing team’s judgment or experience is one of the common approaches adopted by the non-data science-driven bank. A better approach will be to build a predictive model that will identify customers who are more likely to respond to the term deport cross-sell campaign. This approach helps in identify customers with a higher probability to take up and contact customers based on the limit on the marketing budget. Now we will discuss the data required for building the predictive model.
Before proceeding with the Predictive or Statistical Modelling development, we should build, explore, and understand the data.
We need data for finding out customers who have historically responded or not and second the features or customers’ information helping us differentiate between term deposit responders vs non-responders.
Defining Target Variables: Historical customers contacted and flag to identify who have taken term deposit
Features or Independent Variables:
For a response model, we need to review historical campaigns that are related to the current scenario or future campaign roll-out plan.
For a historical campaign, the response duration - the time between contact and positive response - is analyzed. We can develop a response curve for each of the selected campaigns. The response curve analysis can help in selecting the performance window and defining the target variable (who should be marked as target variable value 1 and 0).
Response Rate: Number customer responded Positively/Count of customer contacted Performance Window & Target Variable: Based on whether a customer has responded positively or negatively (or not responded at all) in the performance window, the target variable is defined as 1 or 0.
In the Term Deposit Cross-Sell, based on historical campaigns, the customers who have taken term-deposit within 30 days can be called responders and others as non- responders.
Data preparation and treatment involves various steps. Data preparation requires data aggregation, appending, and merging. We need to have one record per customer in the final model data. From the transactions information, we may need to summarize transactions to total, count, min & max spend data for the selected time window. Similarly, we may need to summarize other information before merging across the type of information.
For some of the variables, values might be missing e.g. customer may not have used a credit card in the last 30 days or may not have taken a home loan. So, missing value treatment may be required.
One dataset is ready from all these sources; we need to do exploratory analysis. One of the basic ones is called univariate analysis.
Univariate Analysis
Univariate analysis involves frequency analysis for categorical variables and understanding distribution & summary statistics for continuous variables. Univariate analysis helps in missing value treatment, understanding distribution, and outlier treatment.
A summary view of the input variables - type, missing and count values.
Summary Statistics for numeric variables.
We can use Box Plot to understand the distribution of continuous variables. The box plots are one of the useful tools for outlier analysis.
Variable Creation and Transformation
Based on input base variables we may require to create an additional set of variables.
Some of the reasons for creating new variables
Dummy Variable creation - Character variables are transformed and new dummy variables are created. How do you create a Dummy variable in R/Python?
Above we have grouped the Job Levels into 5 classes (coarse classes), we can create (5-1) dummy variables. So, 4 new variables are to be created based on Job Levels or Groups formed using WOE.
Mathematical Transformation for continuous variables. Some of the commonly used transformations are
Variable Selection
When we have a long list of independent variables, we may require to reduce the list of variables. Of course, there is a separate approach to variable or dimension reduction techniques, we are focusing on selecting the most important variables.
There are many variable selection techniques and some of the options are:
Information Value-based: select variables which have higher Information Value probably above 0.1
Decision Tree: We can use try a few versions of Decision Tree and selection variables which are part of the decision trees
Stepwise Selection: Forward, Backward, and Stepwise options typically available in logistic regression
Random Forest: RF creates an output with Variable Importance, so we can use that for selecting the most important variables.
We can create Train/Development and Test/Validation samples. We can split randomly and in the ratio of 70:30, 60:40, and 50:50.
The development sample helps in training a model and then the model is validated on validation samples to assess whether model performance is stable on unforeseen data samples.
In real life, sometimes the model is validated on out of time sample before deploying a model.
Based on business reasons, variable significance, multi-collinearity, predictive model, we can select the final model. Some of the model statistics, we should check for below performance statistics for Binary Predictive Model.
Confusion Matrix
The structure of the confusion matrix is as below.
Overall Accuracy of the model is around 84%. But it captures only 1033 of 4640 responders. What can be done to improve the capture rate?