Bagging and Boosting
Decision Trees
- ID3和C4.5使用信息熵计算最优分裂。
- CART使用基尼指数计算最优分裂。
- XGBoost使用二阶泰勒展开系数计算最优分裂。
Ensemble Method / Meta-Algorithm
- bagging / bootstrap aggregating:
one way to reduce the variance of an estimate is to average together multiple estimates
- bootstrap sampling to obtain the data subsets for training the base learners. Given a training set $D$. Generate $n$ new training sets $D_i$, each of which is sampling from $D$ with replacement.
- aggregating voting for classification and averaging for regression to generate the outputs of base learners.
- Example: random forest. $f(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)$
- feature importance??
- boosting:
primarily reducing bias and variance.
It’s incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
- sampling: more weight is given to examples that were misclassified by earlier rounds
- aggregation: a weighted majority vote (classification) or a weighted sum (regression)
- Example: adaptive boosting / AdaBoost.
- Example: GBDT (gradient boosted decision tree). First-order gradient.
- Example: XGBoost, one efficient implementation of GBDT. Second-order gradient. Reference
- Bagging方法有放回地采样同数量样本训练每个学习器, 然后再一起集成(简单投票).
- Boosting方法使用全部样本(可调权重)依次训练每个学习器, 迭代集成(平滑加权).
The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.
AdaBoost
N samples, M weak classifiers.
- sample weight $w_i = \frac{1}{N}$.
- For m in [1,2, …, M]
- Learn a weak classifier $G_m$
- Get the false classified ratio in the training data: $e_m = \sum_{i=1}^N w_i I(G_m(x_i) \ne y_i)$
- Calculate coefficient to $G_m(x)$, $\alpha_m = \frac{1}{2} \log \frac{1 - e_m}{e_m}$
- Update sample weight: $w_i = \frac{w_i \exp(-\alpha_m y_i G_M(x_i))}{Z}$, where $Z = \sum_{i=1}^N w_i \exp(-\alpha_m y_i G_M(x_i))$.
- Do linear combination of $M$ weak models: $f(x) = \sum_{i=1}^M \alpha_m G_m(x)$. The final classifier is $G(x) = sign (f(x)) = sign (\sum_{i=1}^M \alpha_m G_m(x))$.
Gradient Boosting Decision Tree
Boosting Tree: boosting method that uses decision tree as weak models.
Important notions
bootstrap sampling: random sampling with replacement