By: Ram on Dec 28, 2020
A decision tree is a hierarchical or tree-like representation of decisions. A Decision Tree is a technique to iteratively break input data (or node) into two or more data samples (or nodes). And this recursive partitioning of input data (or node) continues until it meets specified condition(s).
A Decision Tree is a method for objective segmentation. The aim of the decision tree-based recursive partitioning is to improve the impurity measure of the output/child node(s). These nodes are called the child node and the input node as a parent node. If an algorithm breaks the parent node into two child nodes at each stage, is called a binary decision tree.
For Example, banks and financial institutions grant credit facility after evaluating credit risk involved
The last 2 years of customer performance on meeting credit obligations is available with us. We want to understand the variable(s) that explain the high risk of customers who defaulted on a credit facility given to them.
The sample has 24 customers. And for making it simple, only customer age and gender are considered. Age is a continuous variable and Gender is a nominal variable.
Input sample has 12 customers who have defaulted on the credit facility. So, the default rate is 50%.
We want to understand if the customers with a certain age group have a higher chance of defaulting, or one gender has higher default rate than that of the other gender.
A Decision Tree is one of the techniques which can help us answer these questions. Decision Tree process has to find the variable and cut off (for numeric and group values for nominal variables) to be considered for the split. The aim of the split will be to improve the impurity (default rate) of the child nodes.
Based on exploratory analysis, we can see that the Male group has a higher default rate of 63% whereas the Female group has 25%. The average age of default customers is around 39 years as compared to non-defaulting customers have an average age of 47.
Decision Tree can help to find the cut of Age variable and interaction effect between Age and Gender. Also, if there are more variables, the efficacy of exploratory data analysis in selecting the variables or finding an association with the target variable could be very low.
In this example, the Gender variable is selected for partitioning the input data sample. After the split, there are two samples (or child nodes) – one for each Gender.
Impurity (or default rate) has increased for one child node to 63%. Now, each of these child nodes is further partitioned to improve the impurity. Since each child node undergoes the same process of partitioning as its parent node, the process is called recursive partitioning.
Left Node (Gender=Female) is partitioned based on Age>40 condition and Right Node (Gender=Male) is partitioned using Age >50 condition. In this example, Age is the only variable so the left and right nodes are partitioned using the same variable but in reality, all the input variables considered for each node, and the best variable and the split point will be selected for each of the nodes.
The default rate for Male Customers who are aged below 50 is 77% compared to that of the customers who have a 50% default rate.
This is a very simple example of a decision tree building. We need to understand and answer a few more questions related to the decision tree.