Ramgopal Prajapat:

Learnings and Views

Concepts and Applications of Multiple regression

By: Ram on Jun 18, 2020

When do you use Multiple Regression?

Based on the scale of measurement, variables can be defined as Binary, Ordinal, Nominal, and Continuous (Ratio and Interval Scale) type. When a decision (or target/dependent) variable is continuous, one of the Statistical Methods available for building the model is multiple regression.  These types of scenarios or problems are classified as Regression problems.

 

Some of the scenarios and ideas are list below. These examples are across functional areas and business verticals.

 Multiple Regression Applications across Industries.

 

 

Industry Vertical

Scenario

Scenario Description

 salary

Human Resource

Salary Estimate

Predicting or estimating the salary of a person based on a set of attributes such as years of experience, level of education, an industry of work, previous job salary etc.

 stickness

Human Resources

Month of Stickiness

Considering a high level of employee churn, multiple regression-based model to estimate months of stickiness (or job with a new employer) at the time of recruitment based on candidate attributes.

 Demand

Human Resources

Resource Demand

Causal Forecasting for Demand estimation for each of the technical skills. The level of bench in most of the big IT services providers is an important level to get project & deliver but also add to the cost. An accurate estimation of demand by skills could be important measures to manage requirements at the right cost.

 House

Real Estate

House Price Prediction

Predicting House Prices considering house, locality, and builder characteristics.

 House Demand

Real Estate

House Demand Forecast

Developing a forecasting model to find the volume of houses on sales in a month given economic factors, seasonality, and other dimensions

 CLTV

Banking/Financial Services

Customer Value Estimation

Considering customer level attributes, estimating customer value.

 Spend

Banking/Financial Services

Spend Value at a Customer

Spend on Credit Card is a strong indicator of customer engagement on the card and whether a credit card is a front of the wallet card. Predicting the Spend value of cardholders could help the product and marketing teams in engaging the customers with an appropriate treatment strategy.

 Balance Flow1

Banking/Financial Services

Balance In Flow into Transaction or Saving Account

Predicting the amount of balance expected to be deposited into customers’ transaction and saving account using customer level characteristics.

 Account Volume

Banking/Financial Services

Drivers of Account Open Volume

Building Marketing or Media Mix Model to find economic, advertisement spend (across media or channels), competitor and offer related variables impacting new account open volume in a week

 portfolio loss

Banking/Financial Services

Portfolio Loss Forecasting

In portfolio risk estimation, Loan Over Line Equivalent Concept is estimated using Multiple Regression Framework and Account Variables such as Account Line and Outstanding Balance at observation, and Economic Factors are used as independent variables.  Reference:

https://www.philadelphiafed.org/-/media/research-and-data/publications/working-papers/2014/wp14-10.pdf

 claim

Insurance/Financial Services

Claim Amount Estimation

Insurance providers charge a premium based on the estimated claim amount for the target group of the customers along with other factors. The claim could be against Motor, Home or Pet Policy.  Also, the estimated claim amount could be used for operational cash reserve calculations.

 

https://www.casact.org/pubs/proceed/proceed87/87354.pdf

 healthcare_1

Healthcare/Insurance

Healthcare Cost

The healthcare cost of an individual to healthcare insurer using previous claim history, demographic  and other data available about the individual

 media

Retailer /CPG

Sales Volume  and Return on Investment Modeling

Finding out drivers of retail product sales as a function of spend across media channels, economic factors, and competitor actions

 revenue

Bank

Revenue Regression Model

Predicting revenue of customers and identifying parameters that are linked to increased revenue of the customers. This helps business bankers in realigning the priority and focus.

 

 

Overall Approach of Regression Model Development

 

  • Objective: What is the objective of a regression model? For example, building a regression model to estimate claim amount given a list of independent variables
  • Data Preparation
    • In the claim regression model, the data sample can be shared on request.
    • Otherwise, you had to think about data sources and variables you will use
    • Some of the dimensions of data for a claim scenario are 
      • Policy Details - when is a policy taken? Insurance amount etc, How many times renewed ..
      • Policy Holder Details- Age, gender and others
      • Claim Details- how many times claimed in the last 3,6, 9 and 12 months? Summary statistics (min, max, avg, total amount) from claim amount in the last 3,6, 9 and 12 months
    • Variable Creation/transformation
      • Create dummy variables based on categorical/character variables
      • Mathematical Transformation variables - log/sqrt/sqr variables for numeric variables.
  • Exploratory Data Analysis (EDA)
    • Univariate Analysis (One variable at a time)
      • Summary statistics
      • Missing Value
      • Outliers
    • Bivariate Analysis - Claim Amount and Other variables
      • Summary claim amount by categorical variable values e.g. average claim amount for Male & Female
      • Correlation Analysis - correlation analysis e.g. how the claim amount and age are related
  • Model development and validation samples
    • 70% development or training sample and 30% validation or testing sample
  • Model Development
    • Run Multiple Regression
    • Select variables which are significant
    • Step-wise variable selection
    • Assumptions - Residual Analysis and Multi-Collinearity Checks
    • Final model - all variable significant (p value less than 0.001), high R2 and adj R2 , over model F statistics & p value significant (p value less than 0.001)
  • Score both Development and Validation samples
    • Using Model developed, score both development and validation samples
    • Check MAPE (Mean Absolute % Error) for Development and Validation samples

 

Multiple Regression Algorithm: Concepts

 

A Multiple Regression Problem formulations is of the following form

Y = B0 + B1* X1 + B2 * X2 + B3 * X3…. + Error

 

Y is Target or Dependent Variable

X1, X2, and X3 etc are set of independent variables or features

 

B0 is an intercept and B1, B2 and B3 are coefficients for each of the independent variables.

 

The main aim of the model is to find the values of these parameter estimates.  The method used for estimating parameter values is Ordinary Least Square (OLS).  The method aims to find the values of these parameters such as that the overall error of the model is minimized.

One of the simplest examples of Multiple Regression is Simple Linear Regression in which only one independent Variable is considered and the form will be

Simple Regression - Formulation

Now, we explaining the detailed steps to find values of intercept (B0) and B1, parameter coefficient for X1 variable.

 

Parameter Estimation in Simple Regression

Simple Regression Parameter Estimation

Review Multiple Regression Output

Most of the analytical tools (such as Python, SAS, R, and SPSS) gives similar output for a regression model.

A regression model output typically will have 3 parts in the output.

  • Analysis of Variance (ANOVA)

Regression - ANOVA

  • Regression Model Performance Statistics – R2 and Adj R2

Regression - R2

One of the key performance statistics for a regression model is R2 indicating % of variance explained by the model. But R2 only increases if you are adding more variables, so Adj R2 is evaluated to not select the complicated model.

  • Variable Significance or Parameter Estimates

Regression Parameter Estimates

The main objective of modeling is to find parameter estimates. Based on T Statistics and P-Value, the variable significance is evaluated. P-Value indicates evidence in favor of the null hypothesis. In the regression model, the null hypothesis is "Beta Coefficient or Parameter Estimate for a variable is Zero". So the lower P-value indicates, the variable can be kept in the model.

Model Selection

  • Variable Significance: All variables in the model have to be significant based on the agreed significance level. In practical scenarios, we keep variables that have less than 0.001 P-value but this is a judgment, in other scenarios you may be Ok in selecting variables with P values less than 0.01.
  • Variance Explained – R2 and Adj R2 Higher is the better is a model
  • Review Multiple Regression Assumptions
    • Multi-collinearity: Use VIF can be used to check for multi-collinearity
    • Outliers, Independence of Error and Homoscedasticity: Residual Analysis

 

Leave a comment