“None of us is as smart as all of us” ~ Ken Blanchard. This is what exactly ensembling implies. Rather than depending on just one ML model, we bring together different models to take a final decision.
Table of contents:
- Bagging — an ensembling technique
- Random Forest(Collection of Decision Trees)
- Advantages & Disadvantages
Bagging — an ensembling technique: The topic of this discussion is one of the ensembling techniques called bagging. It is also termed Bootstrap aggregation that unites only homogenous models(10 KNNs or 50 SVMs). But how does it make each model distinctive within the ensemble?
- That’s where we apply the concept of a random sampling of the input data with replacement. For every model, the training samples fed in will be having a different mix. As a result, all the individual model learns the details coming from a unique group of data points. The assumption here is the models are highly independent of one another.
- Even though the isolated models are extremely complex when we merge them, the finalised outcome will be having only relevant information.
- The reason being, in the case of classification we apply the voting option to figure out the maximum occurrence(mode). Whereas, the resultant values are averaged out for regression scenarios.
- Each model is executed in parallel and the results are finally combined.
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressorBase estimator could be any algorithm but similar type.
Some of the hyperparameters include,
max_samples — The number of training samples to be taken from the input dataset which will be given to the estimator.
max_features — Number of features to be considered for each model. The selection process is completely random.
bootstrap — If it is given as True, then the training samples are drawn with replacement else without replacement.
bootstrap_features — The logic is similar to the attribute “bootstrap” but instead of input data, the model randomly samples the input predictors.
n_estimators — The number of base estimators to be taken into account. For examples, if we give the values such as 100, then 100 models are created to be part of the ensemble.
Random Forest(Collection of Decision Trees): It follows the same logic as that of the bagging classifier. But the model picked is Decision Tree which’s why the name Random Forest. It uses both the random sampling of the training data, as well as, the random sampling of the independent features.
- When the training samples are randomly sampled, there are some input data points that might not have been included. They are referred to as out-of-bag records. Using the OOB records, the model is evaluated. In other words, OOB acts as a testing set internally.
- If a particular model within the ensemble has less error for out-of-bag data, then it will have more say in the final voting. This is because of the fact that the particular model is capable of producing accurate outcomes. In the case of classification, confusion metrics is used to measure the accuracy while regression requirement uses ordinary least square.
- It helps to alleviate the problem of overfitting which is an inherent nature in decision trees. Although, the independent trees create a high variance situation but not the accumulated one since the selection of input data is randomized.
- It works well on the high-dimensional data structure.
- Have the capability to run on large databases.
- When compared to other machine learning algorithms(KNN, SVM), it works well for complex data.
- The algorithm is computationally heavy as it has to create so many decision trees to produce a Random Forest Model.
- The time taken will be comparatively high because of the random sampling and execution of multiple models.