202412112103
Status: #idea
Tags: Decision Trees
State: #nascient

Boosting

Another method that is often used with Decision Trees to make them suck less.

In comparison to Bagging (Boostrapped Aggregation) and Random Forests that generates a litany of small weak models (stumps) through random feature selection and then combine them, boosting adopts are more refined, I daresay parsimonious approach.
It fits a small model (small because of constraints put on it.) and computes some loss metric

This approach is done slowly through three tuning parameters d which controls the number of splits (leads to d+1 terminal nodes); λ which is used for regularization and means that instead of considering the full prediction, we consider λ×prediction for the full model; and B which controls the amount of trees we fit.

Contrarily to Random Forests and Bagging (Boostrapped Aggregation) trees, we will absolutely overfit to the training set if we set B too high, this overfitting thanks to the λ parameter if it happens will happen extremely slowly.

The learning rate λ is typically a small value like 0.01 or 0.001. Obviously the right choice will vary based on the problem. To compensate the small size (as recall, this represents the "fraction" of the prediction of tree tk that we add for our prediction f^(x))
The division number (I call it that) d as mentioned represents the number of splits, considering the approach used here, there is often no need for too much complexity at the tree level since whatever error is made will be corrected by other trees down the line, for those reasons d=1 is often a good choice.
In which case we have a special case of trees, called stumps. In general, d represents something called the interaction depth which is a fancy way of saying the number of variables we allow to interact within any given tree.

We can pick it through Cross Validation similarly to λ.

Still this type of model is really powerful and often achieves higher accuracy than random forests (which itself typically beats bagging trees thanks to de-correlation.)

Advantages

Cons

There are a few boosting algorithms, the main ones to know are: