# Bagging and Boosting

# Decision Trees

- ID3和C4.5使用信息熵计算最优分裂。
- CART使用基尼指数计算最优分裂。
- XGBoost使用二阶泰勒展开系数计算最优分裂。

# Ensemble Method / Meta-Algorithm

- bagging / bootstrap aggregating:
one way to reduce the variance of an estimate is to average together multiple estimates
**bootstrap sampling**to obtain the data subsets for training the base learners. Given a training set $D$. Generate $n$ new training sets $D_i$, each of which is sampling from $D$ with replacement.**aggregating**voting for classification and averaging for regression to generate the outputs of base learners.- Example: random forest. $f(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)$
- feature importance??

- boosting:
primarily reducing bias and variance.
It’s incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
**sampling:**more weight is given to examples that were misclassified by earlier rounds**aggregation:**a weighted majority vote (classification) or a weighted sum (regression)- Example: adaptive boosting / AdaBoost.
- Example: GBDT (gradient boosted decision tree). First-order gradient.
- Example: XGBoost, one efficient implementation of GBDT. Second-order gradient. Reference

- Bagging方法有放回地采样同数量样本训练每个学习器, 然后再一起集成(简单投票).
- Boosting方法使用全部样本(可调权重)依次训练每个学习器, 迭代集成(平滑加权).

The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.

# AdaBoost

N samples, M weak classifiers.

- sample weight $w_i = \frac{1}{N}$.
- For m in [1,2, …, M]
- Learn a weak classifier $G_m$
- Get the false classified ratio in the training data: $e_m = \sum_{i=1}^N w_i I(G_m(x_i) \ne y_i)$
- Calculate coefficient to $G_m(x)$, $\alpha_m = \frac{1}{2} \log \frac{1 - e_m}{e_m}$
- Update sample weight: $w_i = \frac{w_i \exp(-\alpha_m y_i G_M(x_i))}{Z}$, where $Z = \sum_{i=1}^N w_i \exp(-\alpha_m y_i G_M(x_i))$.

- Do linear combination of $M$ weak models: $f(x) = \sum_{i=1}^M \alpha_m G_m(x)$. The final classifier is $G(x) = sign (f(x)) = sign (\sum_{i=1}^M \alpha_m G_m(x))$.

# Gradient Boosting Decision Tree

Boosting Tree: boosting method that uses decision tree as weak models.

### Important notions

bootstrap sampling: random sampling with replacement