Let’s build a tree with a pen and paper

Back to Sermon Notes

Let’s build a tree with a pen and paper

In practice, most of the time Gini impurity is used as it gives good results for splitting and its computation is inexpensive. We have a dummy dataset below, the features(X) are Chest pain, Good blood circulation, Blocked arteries and to be predicted column is Heart disease(y). Every column has two possible options yes and no.

We aim to build a decision tree where given a new record of chest pain, good blood circulation, and blocked arteries we should be able to tell if that person has heart disease or not. At the start, all our samples are in the loansolution.com/installment-loans-oh root node. We will have to decide on which of the feature the root node should be divided first.

We will focus first on how heart disease is changing with Chest pain (ignoring good blood circulation and blood arteries). the dummy numbers are shown below

Taking all three splits at one place in the below image. We can observe that, it is not a great split on any of the feature alone for heart disease yes or no which means that one of these can be a root node but its not a full tree, we will have to split again down the tree in hope of better split.

To decide on which one feature should the root node be split, we need to calculate the Gini impurity for all the leaf nodes as shown below. After calculating for leaf nodes, we take its weighted average to get Gini impurity about the parent node.

Hence we choose good blood circulation as the root node

We do this for all three features and select the one with the least Gini impurity as it is splitting the dataset in the best way out of three.

We do the same for a child node of Good blood circulation now. In the below image we will split the left child with a total of 164 sample on basis of blocked arteries as its Gini impurity is lesser than chest pain(we calculate Gini index again with the same formula as above, just a smaller subset of the sample – 164 in this case).

One thing to note in the below image that, when we try to split the right child of blocked arteries on basis of chest pain, the Gini index is 0.29 but the Gini impurity of the right child of the blocked tree itself, is 0.20. This means that splitting this node any further is not improving impurity. so this will be a leaf node.

We repeat the same process for all the nodes and we get the following tree. This looks a good enough fit for our training data.

This is how we create a tree from data. What should we do if we have a column with numerical values? It is simple, order them in ascending order. Calculate the mean on every two consecutive numbers. make a split on basis of that and calculate Gini impurity using the same method. We choose the split with the least Gini impurity as always.

If we have ranked the numerical column in the dataset, we split on every rating and calculate Gini impurity for each split, and select the one with the least Gini impurity.

If we have categorical choices in our dataset then the split and Gini impurity calculation needs to be done on every possible combination of choices

Avoid Overfitting in Decision Trees

O verfitting is one of the key challenges in a tree-based algorithm. If no limit is set, it will give 100% fitting, because, in the worst-case scenario, it will end up making a leaf node for each observation. Hence we need to take some precautions to avoid overfitting. It is mostly done in two ways:

Leave a Reply

Your email address will not be published. Required fields are marked *

Let’s build a tree with a pen and paper

May 6, 2022

May 6, 2022

Bible Text: |