By: Ram on Jun 18, 2020
One of the Decision Tree algorithms is CART (Classification and Regression Tree). CART is developed by Breiman, Friedman, Olshen, & Stone in 1984 (Book - Classification and Regression Trees).
Classification Tree: When a decision or target variable is categorical, the decision tree is a classification decision tree. e.g. predicting whether a customer will default or not (Binary Target variable). Or predicting food choices of the customers (nominal variable) using a set of independent variables is also an example of Classification Decision Tree.
Regression Tree: When the decision or target variable is a continuous variable, the decision tree is called the regression decision tree. e.g. predicting house prices using attributes of houses such as the size of a house, type of house (independent, apartment etc) and others. The independent variables can continuous and categorical variables.
CART algorithm can be used for building both Classification and Regression Decision Trees. The impurity (or purity) measure used in building decision tree in CART is Gini Index. The decision tree built by CART algorithm is always a binary decision tree (each node will have only two child nodes).
Consider an example, the target variable is Binary variable which means it takes two values (Yes and No). There can be 4 combinations of target variable values.
|1|1|
|0|0|
P(Target=1)P(Target=1) + P(Target=1).P(Target=0) + P(Target=0).P(Target=1) + P(Target=0).P(Target=0) = 1
P(Target=1).P(Target=0) + P(Target=0).P(Target=1) = 1 – P2(Target=0) – P2(Target=1)
Gini Index for Binary Target variable is
= 1 – P2(Target=0) – P2(Target=1)
Pt : Proportion of observations with target variable value t. In Binary t takes value 0 and 1.
Similarly if Target Variable is a categorical variable with multiple levels, the Gini Index will be still similar. If Target variable takes k different values, the Gini Index will be
The maximum value of the Gini Index could be when all target values are equally distributed.
For Binary Target variable, Max Gini Index value
= 1 - (1/2)2 - (1/2)2
= 1 - 2*(1/2)2
= 1- 2*(1/4)
= 1-0.5
= 0.5
Similarly, for Nominal variable with k level, the maximum value Gini Index is
= 1 - 1/k
Minimum value of Gini Index will be 0 when all observations belong to one label.
Consider an example where the target variable is binary, the summary table for such an example will be similar to the below table.
Gini Index = 1 - 0.742 - 0.262
= 1-0.5476 -0.0676
= 0.3848
When the target variable is a nominal variable with different levels, we can calculate Gini Index in a similar way. When the target variable is nominal the summary table looks similar to the below table. In India, there are a number of different cuisines available and people have different food preferences. We have considered a few options and the table below shows the proportion of people with their food preferences. This is an illustrative example.
Gini Index = 1 – 0.082 - 0.132 - 0.292 -0.52
= 1 - 0.006 - 0.018 - 0.083 - 0.250
= 0.643
GINI of a split
GINI (s,t) = GINI (t) – PL GINI (tL) – PR GINI (tR)
Where
s : split
t : node
GINI (t) : Gini Index of input node t
PL : Proportion of observation in Left Node after split, s
GINI (tL) : Gini of Left Node after split, s
PR : Proportion of observation in Right Node after split, s
GINI (tR) : Gini of Right Node after split, s
Example
For Example, banks and financial institutions grant credit facilities after evaluating the credit risk involved. Credit risk involved in credit decisions is evaluated using a Credit Scorecard. Also, there are a few additional decisions involved in credit underwriting.