Day 56(ML) — Random Forest(Ensembling + Bagging)

Photo by Jay Mantri on Unsplash

“None of us is as smart as all of us” ~ Ken Blanchard. This is what exactly ensembling implies. Rather than depending on just one ML model, we bring together different models to take a final decision.

Table of contents:

  • Bagging — an ensembling technique

Bagging — an ensembling technique: The topic of this discussion is one of the ensembling techniques called bagging. It is also termed Bootstrap aggregation that unites only homogenous models(10 KNNs or 50 SVMs). But how does it make each model distinctive within the ensemble?

  • That’s where we apply the concept of a random sampling of the input data with replacement. For every model, the training samples fed in will be having a different mix. As a result, all the individual model learns the details coming from a unique group of data points. The assumption here is the models are highly independent of one another.
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
Base estimator could be any algorithm but similar type.

Some of the hyperparameters include,

max_samples — The number of training samples to be taken from the input dataset which will be given to the estimator.

max_features — Number of features to be considered for each model. The selection process is completely random.

bootstrap — If it is given as True, then the training samples are drawn with replacement else without replacement.

bootstrap_features — The logic is similar to the attribute “bootstrap” but instead of input data, the model randomly samples the input predictors.

n_estimators — The number of base estimators to be taken into account. For examples, if we give the values such as 100, then 100 models are created to be part of the ensemble.

Random Forest(Collection of Decision Trees): It follows the same logic as that of the bagging classifier. But the model picked is Decision Tree which’s why the name Random Forest. It uses both the random sampling of the training data, as well as, the random sampling of the independent features.

  • When the training samples are randomly sampled, there are some input data points that might not have been included. They are referred to as out-of-bag records. Using the OOB records, the model is evaluated. In other words, OOB acts as a testing set internally.

Advantages:

  • It helps to alleviate the problem of overfitting which is an inherent nature in decision trees. Although, the independent trees create a high variance situation but not the accumulated one since the selection of input data is randomized.

Disadvantages:

  • The algorithm is computationally heavy as it has to create so many decision trees to produce a Random Forest Model.

Recommended Reading:

AI Enthusiast | Blogger✍