We’ve already seen an ensemble technique called bagging, where we execute multiple individual models in parallel to decide the final outcome. Another category of ensembling method is Boosting. The main differentiating factor between Boosting & Bagging is the way the models are run. In Boosting, the models are connected in a serial manner(i.e) output from the first model is given to the second and so on.
The main idea here is to reduce the error as it passes from one model to another in a serial fashion. Similar to the bagging procedure, it permits a homogenous set of models. There are so many popular boosting algorithms available and they are preferred in many ML hackathons over the other. The reason being its ability to handle complex functionality found in the input data. For this discussion, we will consider the Adaptive boosting algorithm.
Adaptive boosting algorithm(base — Decision trees): Here instead of building trees(bagging), we form multiple stumps(having minimal depth). For the very same reason, bagging reduces overfitting caused by the individual model. However, boosting brings down the high bias setting caused by the shallow trees.
- The crux of the boosting is, involving so many weak learners to create a strong learner. So, what is the working principle behind the scenes and how does it create an overall powerful model?. A weight is assigned to each model output based on the accuracy score. Based on the weightage given to the data points, subsequent trees will be built and again the same procedure gets repeated. We can say, the successor trees are dependent on the predecessors.
- Let's take an example to get a clear understanding, the objective is to predict the target ‘High Analytical Skills’. We assume the total number of the individual decision tree is 3.
- Step1: Initially we give equal weightage to all the training samples (i.e) when feeding the input to the first model. The weight is considered as 1/n where ’n’ is the total count of data points in the train set.
- Step2: Using the Gini Index or information gain, we decide upon which feature can be used for branching. For our example, let's say “Proficient in Maths” acts as a better predictor in the stump. There are two misclassifications from the first stump.
- Step3: Giving more weightage to the incorrectly categorized data. In addition to that, another value that needs to be computed is the amount of say by individual model. If a particular model has higher accuracy, then it has more say in the final outcome.
amount of say = 1/2 * log((1 — Total Error) / Total Error), when the total error is small, then the amount of say will increase proportionally. Let’s say the total error for our case = 0.75.
amount of say = (1/2) * log (0.25/0.75) = (1/2) * log(0.33) = -0.24
A large negative ‘amount of say’ will change the actual vote into the opposite one.
If the data is correctly classified,
new_weight = previou_weight * e ^ amount of say = (1/6) * e^-0.24 = 0.13
If the data is incorrectly classified,
new_weight = previous_weight * e ^ -amount of say = (1/6) * e ^ 0.24 = 0.211
As we can see the correct ones are given less weightage in contrast to the mislabeled ones.
- Step4: We can normalize the weights by dividing each value by the total sum. This will bring the total summation to 1.
- Step5: The normalised weights will be used in the subsequent model in two ways to improve accuracy. The first method, where the normalised values can be considered as sample_weights(similar to class imbalanced setting). This would place more weightage on the misclassified records while computing the Gini Index or Information gain. Secondly, the normalised weights could be used in the data sampling by computing the cumulative values as given below.
- When we randomly sample the training data based on the cumulative value, the data point that has a huge bandwidth will get sample frequently when compared to the rest. In this way, the consecutive models become better learners when compared to the predecessors.
- For the final result, similar to bagging we take the voting approach but based on the “amount of say” the model has. If the model has a strong prediction power, then it has more emphasis on the final say.
- In the case of regression, we use the mean squared error or deviation to update the weights.
My learnings are from this Reference Video: