202412112103
Status: #idea
Tags: Decision Trees
State: #nascient

Boosting

Another method that is often used with Decision Trees to make them suck less.

In comparison to Bagging (Boostrapped Aggregation) and Random Forests that generates a litany of small weak models (stumps) through random feature selection and then combine them, boosting adopts are more refined, I daresay parsimonious approach.
It fits a small model (small because of constraints put on it.) and computes some loss metric

Then it fits another model that predicts the loss essentially compensating for the loss of the first model
Then another model is fit to the aggregated loss
Etc.

This approach is done slowly through three tuning parameters $d$ which controls the number of splits (leads to $d + 1$ terminal nodes); $λ$ which is used for regularization and means that instead of considering the full prediction, we consider $λ \times prediction$ for the full model; and $B$ which controls the amount of trees we fit.

Contrarily to Random Forests and Bagging (Boostrapped Aggregation) trees, we will absolutely overfit to the training set if we set $B$ too high, this overfitting thanks to the $λ$ parameter if it happens will happen extremely slowly.

The learning rate $λ$ is typically a small value like $0.01$ or $0.001$ . Obviously the right choice will vary based on the problem. To compensate the small size (as recall, this represents the "fraction" of the prediction of tree $t_{k}$ that we add for our prediction $\hat{f} (x)$ )
The division number (I call it that) $d$ as mentioned represents the number of splits, considering the approach used here, there is often no need for too much complexity at the tree level since whatever error is made will be corrected by other trees down the line, for those reasons $d = 1$ is often a good choice.
In which case we have a special case of trees, called stumps. In general, $d$ represents something called the $i n t e r a c t i o n$ $d e p t h$ which is a fancy way of saying the number of variables we allow to interact within any given tree.

We can pick it through Cross Validation similarly to $λ$ .

Still this type of model is really powerful and often achieves higher accuracy than random forests (which itself typically beats bagging trees thanks to de-correlation.)

Advantages

Slow learning often leads to better fit which is the approach promoted by this model
Typically dominates random forests on equal grounds
Slightly more interpretable since for $d = 1$ we have an additive model.
Thanks to the built-in correlation between trees, each tree need not be as big.

Cons

Can overfit if $B$ is too big.

There are a few boosting algorithms, the main ones to know are:

Boosting

Advantages

Cons

Relevant Links