Decision Trees

  • ID3和C4.5使用信息熵计算最优分裂。
  • CART使用基尼指数计算最优分裂。
  • XGBoost使用二阶泰勒展开系数计算最优分裂。

Ensemble Method / Meta-Algorithm

  • bagging / bootstrap aggregating: one way to reduce the variance of an estimate is to average together multiple estimates
    • bootstrap sampling to obtain the data subsets for training the base learners. Given a training set $D$. Generate $n$ new training sets $D_i$, each of which is sampling from $D$ with replacement.
    • aggregating voting for classification and averaging for regression to generate the outputs of base learners.
    • Example: random forest. $f(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)$
      • feature importance??
  • boosting: primarily reducing bias and variance. It’s incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
    • sampling: more weight is given to examples that were misclassified by earlier rounds
    • aggregation: a weighted majority vote (classification) or a weighted sum (regression)
    • Example: adaptive boosting / AdaBoost.
    • Example: GBDT (gradient boosted decision tree). First-order gradient.
    • Example: XGBoost, one efficient implementation of GBDT. Second-order gradient. Reference
  • Bagging方法有放回地采样同数量样本训练每个学习器, 然后再一起集成(简单投票).
  • Boosting方法使用全部样本(可调权重)依次训练每个学习器, 迭代集成(平滑加权).

The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.

AdaBoost

N samples, M weak classifiers.

  1. sample weight $w_i = \frac{1}{N}$.
  2. For m in [1,2, …, M]
    1. Learn a weak classifier $G_m$
    2. Get the false classified ratio in the training data: $e_m = \sum_{i=1}^N w_i I(G_m(x_i) \ne y_i)$
    3. Calculate coefficient to $G_m(x)$, $\alpha_m = \frac{1}{2} \log \frac{1 - e_m}{e_m}$
    4. Update sample weight: $w_i = \frac{w_i \exp(-\alpha_m y_i G_M(x_i))}{Z}$, where $Z = \sum_{i=1}^N w_i \exp(-\alpha_m y_i G_M(x_i))$.
  3. Do linear combination of $M$ weak models: $f(x) = \sum_{i=1}^M \alpha_m G_m(x)$. The final classifier is $G(x) = sign (f(x)) = sign (\sum_{i=1}^M \alpha_m G_m(x))$.

Gradient Boosting Decision Tree

Boosting Tree: boosting method that uses decision tree as weak models.

Important notions

bootstrap sampling: random sampling with replacement

Reference 1

Reference 2: wiki ensemble

Reference 3: GBDT