Decision Trees

ID3和C4.5使用信息熵计算最优分裂。
CART使用基尼指数计算最优分裂。
XGBoost使用二阶泰勒展开系数计算最优分裂。

Ensemble Method / Meta-Algorithm

bagging / bootstrap aggregating: one way to reduce the variance of an estimate is to average together multiple estimates
- bootstrap sampling to obtain the data subsets for training the base learners. Given a training set $D$. Generate $n$ new training sets $D_i$, each of which is sampling from $D$ with replacement.
- aggregating voting for classification and averaging for regression to generate the outputs of base learners.
- Example: random forest. $f(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)$
  - feature importance??
boosting: primarily reducing bias and variance. It’s incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
- sampling: more weight is given to examples that were misclassified by earlier rounds
- aggregation: a weighted majority vote (classification) or a weighted sum (regression)
- Example: adaptive boosting / AdaBoost.
- Example: GBDT (gradient boosted decision tree). First-order gradient.
- Example: XGBoost, one efficient implementation of GBDT. Second-order gradient. Reference
Bagging方法有放回地采样同数量样本训练每个学习器, 然后再一起集成(简单投票).
Boosting方法使用全部样本(可调权重)依次训练每个学习器, 迭代集成(平滑加权).

The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.

AdaBoost

N samples, M weak classifiers.

sample weight $w_i = \frac{1}{N}$.
For m in [1,2, …, M]
1. Learn a weak classifier $G_m$
2. Get the false classified ratio in the training data: $e_m = \sum_{i=1}^N w_i I(G_m(x_i) \ne y_i)$
3. Calculate coefficient to $G_m(x)$, $\alpha_m = \frac{1}{2} \log \frac{1 - e_m}{e_m}$
4. Update sample weight: $w_i = \frac{w_i \exp(-\alpha_m y_i G_M(x_i))}{Z}$, where $Z = \sum_{i=1}^N w_i \exp(-\alpha_m y_i G_M(x_i))$.
Do linear combination of $M$ weak models: $f(x) = \sum_{i=1}^M \alpha_m G_m(x)$. The final classifier is $G(x) = sign (f(x)) = sign (\sum_{i=1}^M \alpha_m G_m(x))$.

Gradient Boosting Decision Tree

Boosting Tree: boosting method that uses decision tree as weak models.

Important notions

bootstrap sampling: random sampling with replacement

Reference 1

Reference 2: wiki ensemble

Reference 3: GBDT