Ramgopal Prajapat:

Learnings and Views

Cox Regression - Survival Modeling

By: Ram on Dec 19, 2020

Cox Regression: Overview

In the previous blog on Survival Modeling, we have given an overview of Survival Modeling and also described on Kaplan Meier estimator and Log Rank test.

In this blog, the focus is on Cox Proportional Hazards (PH) model. Cox Proportional Hazards model is also called as Cox Model, Cox Regression, or Proportional Hazard Model.

Cox Model is used for the analysis of Survival data and finding out the relationship between Survival Time and covariates (also called explanatory variables or predictors).

Survival Modeling involves both time (survival time) and covariates. Kaplan-Meier focuses only on time. Cox Proportional Hazard Model considers both time and explanatory variables.

Cox Model formulation has two parts - baseline hazard function and exponential transformation of Covariates.

PH(t,X) = h0(t) e (b0+b1x1 +b2x2+…)

One of the key assumptions in Cox Regression is that the proportional hazards do not depend on the covariates. And exponential expression involves only covariates and does not depend on time. So, covariates are time-independent. When covariates are time-dependent, a parametric survival model or extended Cox Model is used.

Cox Model or Cox Regression is semi-parametric regression. It is semi-parametric as the baseline hazard function could take a different distribution.

The Cox Regression model is different from Multiple Regression as the first involves both time and covariate but the latter only covariates.

The Cox Regression model has an exponential formulation similar to logistic regression but the baseline hazard function in addition to logistic regression.

Also, multiple regression uses the least square estimation for parameter estimation whereas logistic regression leverage maximum likelihood estimation. Cox Regression uses maximum likelihood but “partial” likelihood. Since the likelihood formulation considers only “failed” cases and not the censored, it is called partial likelihood estimation.

Hazard Ratio in the Cox Regression Model

In Cox Model, Hazard is formulated as

h(t,X) = h0(t) e (b0+b1x1 +b2x2+…)

Consider example for i and j cases

hi(t) = h0(t) e (b0+b1x1i +b2x2i+…)                                                         ..... (1)

hj(t) = h0(t) e (b0+b1x1j +b2x2j+…)                                                         ..... (2)

Hazard Ratio for i and j

hi(t)/hj(t) = e (b1(x1i - x1j ) +b2(x2i - x2j))

Hazards Ratio  hi(t)/hj(t)  does not depend on the time. This should be parallel across time points for the cases. If we want to check per unit change for a variable  X1  keeping other variables constant then Hazard Ratio will be

hi(t)/hj(t) = e b1

It does not depend on value of X, similar to Odds Ratio in Logistic Regression.

Hazard Ratio is Hazard for one individual to Hazard for the second individual. And it is dependent on a set of exploratory variable values for these individuals. The difference between Odd Ratio and Hazard Ratio is that Hazard Ratio is the ratio of failure rates whereas Odd Ratio is the ratio of proportion.

We will show an example later in the blog.

Business Context and Data

We are using the same context of wealth management as discussed in the previous blog. We want to understand the factors impacting attrition/retention rates across 24 months.

Data will have a set of independent/explanatory variables and two dependent/target variables - time to event/attrition and whether the customer has attrited/censored.

Cox Regression in R

In the survival package in R, coxph the function can be used for building the Cox Proportional Hazard Model.

install.packages("survival")
library(survival)

Surv function is used for creating a survival object before using coxph function.

Some of the main parameters in coxph function are

the formula: Defines the input Cox Regression Model. - a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the surv function.

data: Input data frame which has the variables listed in the formula, subset, and weights argument.

weights: When the input sample is biased and if the analyst wants to provide weights of cases. This is a vector of case weights.

ties: A couple of methods to handle Ties in Cox Regression in R. This is a character string specifying the method for tie handling. Options are “Efron”, “Breslow”, and “exact”.

Build a Survival Model - using Cox Regression

A simple model is built using Cox Regression. The explanatory variable is the customers' age.

cox.ph.model <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE,
                      data=wsurv.data)

Summary of Survival Model

Summary of the model results.

summary(cox.ph.model )

## Call:
## coxph(formula = Surv(Time2Event, attrition) ~ CUSTOMER_AGE, data = wsurv.data)
## 
##   n= 1957, number of events= 607 
## 
##                  coef exp(coef) se(coef)     z Pr(>|z|)    
## CUSTOMER_AGE -0.02067   0.97954  0.00239 -8.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## CUSTOMER_AGE      0.98       1.02     0.975     0.984
## 
## Concordance= 0.606  (se = 0.012 )
## Rsquare= 0.041   (max possible= 0.99 )
## Likelihood ratio test= 82.9  on 1 df,   p=0
## Wald test            = 74.8  on 1 df,   p=0
## Score (logrank) test = 74.6  on 1 df,   p=0

Hazard Ratio Cox Regression Model estimates the Hazard Ratio and does not calculate the baseline function.

The summary result has coef and exp(coef). These shows coefficient for the variable Age and exponentiation of coefficient.

Hazard Ratio = H1/H0 = e(b0+b0Age1)/e(b1+b1Age0)

H1/H0 = e b1

So, the Hazard Ratio for the variable age in the Cox Regression Model is exp(coef). It indicates that the attrition rate goes down by (1-0.979538) = 3% with one unit change in age.

Survival Model

The model should have a two-component - baseline hazard function and exponential transformation. But the current summary output has only one component exponential model.

We can get the baseline hazard function using basehaz function. survfit function could also be used for baseline hazard function estimation.

base.hazard <- survfit(Surv(Time2Event, attrition)~1,
 data=wsurv.data)

basehaz(cox.ph.model)

## hazard time
## 1 0.01630 3
## 2 0.03095 31
## 3 0.04586 59
## 4 0.06206 91
## 5 0.08955 122
## 6 0.11036 153
## 7 0.12452 181
## 8 0.13725 213
## 9 0.15985 244
## 10 0.17372 276
## 11 0.19020 304
## 12 0.20944 335
## 13 0.22419 367
## 14 0.24621 398
## 15 0.27344 426
## 16 0.30774 458
## 17 0.31740 486
## 18 0.32227 517
## 19 0.32577 549
## 20 0.32717 577
## 21 0.33351 608
## 22 0.33634 640
## 23 0.34061 671
## 24 0.34705 699
## 25 0.35281 731

We can run a Stratified Cox Model. The strata variable could be gender. We can either use the strata option to define/give strata variables in this case it is gender. Or we can create dummy variables and use them as explanatory variables.

cox.ph.model <- coxph(Surv(Time2Event, attrition)~(CUSTOMER_AGE)*strata(GENDER),
                      data=wsurv.data)
summary(cox.ph.model)

Cox Regression - Interpret Result and Predict

In the survival package in R, the coxph function can be used for building the Cox Proportional Hazard Model.

install.packages("survival")

library(survival)

cox.ph.m <- coxph(Surv(Time2Event, attrition)~CUSTOMER_AGE+MoB,

                      data=wsurv.data)

Summary output

summary(cox.ph.m )

## Call:

## coxph(formula = Surv(Time2Event, attrition) ~ CUSTOMER_AGE +

##     MoB, data = wsurv.data)

##

##   n= 1957, number of events= 607

##

##                  coef exp(coef) se(coef)     z Pr(>|z|)   

## CUSTOMER_AGE -0.00310   0.99690  0.00443  -0.7     0.48   

## MoB          -0.27636   0.75854  0.01824 -15.2

cox_regression_output

The summary output gives information on a formula used for the Cox Regression Model. Also, it gives sample information, in this sample size is 1957 observations and the number of events(attrition) is 607.

Similar to other regression output, Cox Regression output has a beta coefficient, standard error, statistics, and P-value.

Cox Regression has an additional column for eb (exponentiation of beta coefficients).

The goodness of Fit Test for each variable is to assess whether a variable is significant or not. It is done using Z statistics and its P-Value.

For assessing the overall Cox Regression Model, a few statistics are calculated and shown in the summary output.

Concordance: Similar to the concordance level in logistic regression, concordance shows the fraction of pairs in the sample, where the observations with the higher survival time have the higher probability of survival predicted by the model.

Rsquare: R2 is a measure variance explained by the model under consideration.

Likelihood ratio test, Wald test, and Score (logrank) test tests are using for testing Global Null Hypothesis Beta =0. In this example, the P-Value for each of these tests is 0, so the null hypothesis could not be accepted.

One of the great ways to measure model predictive power is by comparing predicted vs actual results.

Prediction using Cox Regression

predict function can be used for finding out the predicted values using the Cox Model developed. In this example, the same sample has been used for calculating predicted values.

If we want to find Hazard Probability, we can use the model we have built. The below function gives Hazard Probability

Hazard Function

Now, we will discuss steps to get values using R.

type=“lp” calculate linear predicted (xb). We can then use the baseline probability for a time and do the exponential transformation of linear prediction.

lp.pred <- predict(cox.ph.m,

                  type="lp",

                  data=wsurv.data)

# Baseline Function

base <- basehaz(cox.ph.m)

 

# Base value H0 = 0.025880965 at 59 Days

# Predicted Value at time 59 days

 

Pred.val <- 0.025880965*exp(lp.pred)

Using Predicted Probability, we can create a predicted target level. Now, we have predicted probabilities at a time: 59 days. We can define the target value (actual at a time 59 days). Similarly, for different time periods, we can get baseline probabilities using basehaz function output.

Pred.target <- ifelse(Pred.val>0.09,1,0)

Considering that the input data has attrition/censoring values at different times, we have to define observed target values at 59 days for an appropriate comparison.

target <- ifelse(wsurv.data$Time2Event <=59,    0,

                 wsurv.data$attrition)

Model Accuracy: Confusion Matrix

Using the Predicted and Observed target level, a confusion matrix is created. For this example, the accuracy level & confusion matrix is created at 59 days.

ConfusionMatrix

 

The overall accuracy of the model prediction is 79%. Since action will be on predicted group 1, 61% of the 1s are actually ones.

Gains Table and Lift Chart

Gains Table can be created for each of the important time intervals and validate the accuracy of the model.

Gain Table for one of the time point is as follow.

GainTable

 

The lift chart shows the lift of the model at each of the deciles. For the selected time, the lift chart is as follows.

Lift Chart

If we want to calculate Survival Probability, we can use the below code.

install.packages("pec")

library(pec)

Pred_Prob <- predictSurvProb(cox.ph.m,

                newdata=wsurv.data,

                times=c(59))     

# predicts the survival probability based on the model for the dataset -wsurv. data and for the time 59 days

 

References

  • http://stats.stackexchange.com/questions/29815/how-to-interpret-the-output-for-calculating-concordance-index-c-index
  • https://apha.confex.com/apha/134am/techprogram/paper_135906.htm
  • David G. Kleinbaum, Mitchel Klein, Survival Analysis - A Self-Learning Text
  • http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf

Leave a comment