The decision tree is a striking algorithm as it can be intuitively explained to anyone. It is also referred to as CART (Classification And Regression Tree). A simple analogy could be how we make decisions in real life. For instance, when planning up for a team meeting, we carefully decide on the timing to ensure the maximum presence of team members. Likewise, we split the data based on different features(time in the example), whichever predictor gives the least error comparatively that will be considered.
Table of contents:
- Classification Requirement
- Discrete data and impurity measures
- Gini index and impurity
- Merits & Demerits
Classification Requirement: Just like any other ML algorithm, the approach is distinctive for classification and regression. But before delving into classification, we need to understand concepts such as Entropy, Information gain and Gini impurity. It all starts with a root node but how do we choose the root? The impurity/purity measures just mentioned above is the solution.
We will create our own dataset for the understanding purpose.
Discrete data and impurity measures: To gain better clarity, let’s take 3 discrete input predictors(Property_Area, Self_employed & Education) to check from which one the branching can begin.
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Property_Area 6 non-null object
1 Self_employed 6 non-null object
2 Education 6 non-null object
3 Target 6 non-null object
memory usage: 320.0+ bytes
Let’s check out the distinctive values under each variable,
print('Property_Area unique values:', train_df['Property_Area'].unique())print('Self_employed unique values:', train_df['Self_employed'].unique())print('Education unique values:', train_df['Education'].unique())
print('Target unique values:', train_df['Target'].unique())#output:
Property_Area unique values: ['Urban' 'Semi-Urban']
Self_employed unique values: ['Yes' 'No']
Education unique values: ['UG' 'PG']
Target unique values: ['Yes' 'No']
Since the data we are dealing with is discrete, we will be computing the counts, proportions and probability. In these scenarios, crosstab is the best choice to quickly view how the target variable is split across different categories in the independent predictors.
Step1: Calculating the parent entropy (i.e) proportion of the output target variable before the split.
p(yes) = 3/ 6= 0.5
p(no) = 3/ 6= 0.5
entropy(parent) = -( p(yes) * log(p(yes)) ) -( p(no) * log(p(no)) ) = 1
When we have balanced data then the proportion split will be equal so the entropy will be 1. Entropy also explains the degree of impurity within data. Say suppose if all the values belong to one single class(log1=0), then entropy is zero. If there is a equal mix of data, then it is highly impure with value as 1.
Step2: Calculation of information gain for each input predictor.
Information gain = H(S) — summation(p(t) * t(s))
H(S) — entropy of the parent node
t(s) — entropy of the subnode, p(t) — proportion within each category
Information gain & entropy for Property_Area:
entropy(Semi-Urban) = -( p(yes) * log(p(yes)) ) -( p(no) * log(p(no)) ) = -(0/3 * log(0/3))-(3/3 * log(3/3)) = 0
entropy(Urban) = -( p(yes) * log(p(yes)) ) -( p(no) * log(p(no)) ) = -(3/3 * log(3/3))-(0/3*log(0/3)) = 0
Information gain = 1— ((3 / 6) * 0) — ((3/6) * 0) = 1
Information gain & entropy for Self_employed:
entropy(No) = -(0/ 2 * log(0/2))-(2/2* log(2/2)) = 0
entropy(Yes) = -(3/4 * log(3/4))-(1/4*log(1/4)) = 0.81
information gain = 1— ((2/ 6) * 0) — ((4/6) * 0.81) = 0.46
Information gain & entropy for Education:
entropy(PG) = -(2/ 4* log(2/4))-(2/4* log(2/4)) = 1
entropy(UG) = -(1/2 * log(1/2))-(1/2*log(1/2)) = 1
information gain = 1 — ((4/ 6) * 1) — ((2/6) * 1) = 0
- When comparing the above 3 information gains, we can conclude Property_Area is the best predictor, whereas Education has the least predicting power. If we observe the data closely, the logic holds true because when we split the data based on the Property_Area it completely aligns with the target output. For the very same reason, it has been treated as the topmost one.
- On the other hand, the feature ‘Education’ is not able to differentiate between different outputs as it has the same value across the classes.
- Basically, the information gain compares the previous entropy and the current impurity. If the subnode is not able to distinguish between the data, then the initial impurity is carry forwarded without alteration.
Gini index and impurity: It is an alternative method to measure the impurity within the data.
Gini impurity/index(weighted) = 1 — summation of (Pi) ^2
In the case of entropy, we have used log but here the proportions are squared. The low value of the Gini index implies the node is pure, while, a high value denotes the leaf node is impure.
Gini impurity for Property_Area:
Gini impurity(Semi-Urban) = 1 —( (0/3)² + (3/3)²) = 0
Gini impurity(Urban) = 1- ((3/3)² + (0/3)²) = 0
Gini impurity = (3/6) * 0 + (3/6) * 0 = 0
Gini impurity for Self_employed:
Gini impurity(No) = 1-((0/ 2)² + (2/2)²)) = 0
Gini impurity(Yes) = 1-((3/4)² +(1/4)²) = 0.3725
Gini impurity= ((2/ 6) * 0) + ((4/6) * 0.3725) = 0.24
Gini impurity for Education:
Gini impurity(PG) = 1- ((2/4)² + (2/4)²) = 0.5
Gini impurity(UG) = 1-((1/2)² + (1/2)²) = 0.5
Gini impurity = (4/ 6) * 0. 5 + (2/6) * 0.5 = 0.5
When compared to all the values Property Area has the least value. So the branching will start from there.
Notes: In the next post we will be discussing how to handle continuous data and regression problem.
- It becomes the most preferred choice because it is effortless to prepare data (pre-processing).
- The algorithm even handles outliers very well as the split can be compared to binning. So anomalies will be placed in separate bins.
- The input data to the decision tree need not be scaled, unlike other supervised learning algorithms. Since the data is segregated based on the labels/values(in case of continuous data).
- It is a non-parametric technique (i.e) the underlying data distribution does not disturb the splitting process.
- Multi-dimensional features are also well handled.
- It is a go-to algorithm when there exists a non-linear relationship between the predictor variables and the target.
- One of the major drawbacks of the decision tree is overfitting. If we do not control the growth of the tree, it keeps expanding until each leaf(last) node contains only one training sample. This will result in overfitting as every minute details in the data is completely absorbed.
- It is computationally heavy as every feature and the corresponding values are compared in order to decide the branching procedure.
- In addition to that, it is also time-consuming.
Gini Index vs Information Entropy
Everything you need to know about impurity measurements and information gain