Ramgopal Prajapat:

Learnings and Views

Gini Index Explained with Worked out Examples

By: Ram on Jun 18, 2020

One of the Decision Tree algorithms is CART (Classification and Regression Tree). CART is developed by Breiman, Friedman, Olshen, & Stone in 1984 (Book - Classification and Regression Trees).

Classification Tree: When a decision or target variable is categorical, the decision tree is a classification decision tree.  e.g. predicting whether a customer will default or not (Binary Target variable). Or predicting food choices of the customers (nominal variable) using a set of independent variables is also an example of Classification Decision Tree.

Regression Tree: When the decision or target variable is a continuous variable, the decision tree is called the regression decision tree. e.g. predicting house prices using attributes of houses such as the size of a house, type of house (independent, apartment etc) and others. The independent variables can continuous and categorical variables.

CART algorithm can be used for building both Classification and Regression Decision Trees. The impurity (or purity) measure used in building decision tree in CART is Gini Index. The decision tree built by CART algorithm is always a binary decision tree (each node will have only two child nodes).

 

Gini Index

 

Gini Index: Explained for Binary Target Variable (Binary Classification)

Consider an example, the target variable is Binary variable which means it takes two values (Yes and No). There can be 4 combinations of target variable values.


|1|1|
|0|0|

P(Target=1)P(Target=1) + P(Target=1).P(Target=0) + P(Target=0).P(Target=1) + P(Target=0).P(Target=0) = 1

P(Target=1).P(Target=0)  + P(Target=0).P(Target=1)   = 1 – P2(Target=0) – P2(Target=1)

 

Gini Index for Binary Target variable is

= 1 – P2(Target=0) – P2(Target=1)

Gini Index Prob

 Pt : Proportion of observations with target variable value t. In Binary t takes value 0 and 1.

Similarly if Target Variable is a categorical variable with multiple levels, the Gini Index will be still similar. If Target variable takes k different values, the Gini Index will be

Gini Index Target k

 
The maximum value of the Gini Index could be when all target values are equally distributed.

For Binary Target variable, Max Gini Index value

= 1 - (1/2)2 - (1/2)2
= 1 - 2*(1/2)2
= 1- 2*(1/4)
= 1-0.5
= 0.5

Similarly, for Nominal variable with k level, the maximum value Gini Index is

= 1 - 1/k

Minimum value of Gini Index will be 0 when all observations belong to one label.

 

Binary Target Variable: Worked out Example

Consider an example where the target variable is binary, the summary table for such an example will be similar to the below table.

Binary Target Gini Index

  

Gini Index = 1 - 0.74- 0.262

                       = 1-0.5476 -0.0676

= 0.3848

Nominal Target Variable: Worked out Example

When the target variable is a nominal variable with different levels, we can calculate Gini Index in a similar way. When the target variable is nominal the summary table looks similar to the below table. In India, there are a number of different cuisines available and people have different food preferences. We have considered a few options and the table below shows the proportion of people with their food preferences. This is an illustrative example.

Nominal Target Gini Index

 

 Gini Index =  1 – 0.082  - 0.132 - 0.292 -0.52

= 1 -  0.006 -  0.018 -  0.083 - 0.250

                               = 0.643

GINI of a split

GINI (s,t)             = GINI (t) – PL GINI (tL) – PR GINI (tR)

 

Where

s                              : split

t                              : node

GINI (t)                 : Gini Index of input node t

PL                            : Proportion of observation in Left Node after split, s

GINI (tL)                : Gini of Left Node after split, s

PR                            : Proportion of observation in Right Node after split, s

GINI (tR)               : Gini of Right Node after split, s

Example

For Example, banks and financial institutions grant credit facilities after evaluating the credit risk involved. Credit risk involved in credit decisions is evaluated using a Credit Scorecard. Also, there are a few additional decisions involved in credit underwriting.

Leave a comment