# Decision Trees

• ID3和C4.5使用信息熵计算最优分裂。
• CART使用基尼指数计算最优分裂。
• XGBoost使用二阶泰勒展开系数计算最优分裂。

# Ensemble Method / Meta-Algorithm

• bagging / bootstrap aggregating: one way to reduce the variance of an estimate is to average together multiple estimates
• bootstrap sampling to obtain the data subsets for training the base learners. Given a training set $D$. Generate $n$ new training sets $D_i$, each of which is sampling from $D$ with replacement.
• aggregating voting for classification and averaging for regression to generate the outputs of base learners.
• Example: random forest. $f(x) = \frac{1}{M} \sum_{m=1}^M f_m(x)$
• feature importance??
• boosting: primarily reducing bias and variance. It’s incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified.
• sampling: more weight is given to examples that were misclassified by earlier rounds
• aggregation: a weighted majority vote (classification) or a weighted sum (regression)
• Example: XGBoost, one efficient implementation of GBDT. Second-order gradient. Reference
• Bagging方法有放回地采样同数量样本训练每个学习器, 然后再一起集成(简单投票).
• Boosting方法使用全部样本(可调权重)依次训练每个学习器, 迭代集成(平滑加权).

The principal difference between boosting and the committee methods, such as bagging, is that base learners are trained in sequence on a weighted version of the data.

N samples, M weak classifiers.

1. sample weight $w_i = \frac{1}{N}$.
2. For m in [1,2, …, M]
1. Learn a weak classifier $G_m$
2. Get the false classified ratio in the training data: $e_m = \sum_{i=1}^N w_i I(G_m(x_i) \ne y_i)$
3. Calculate coefficient to $G_m(x)$, $\alpha_m = \frac{1}{2} \log \frac{1 - e_m}{e_m}$
4. Update sample weight: $w_i = \frac{w_i \exp(-\alpha_m y_i G_M(x_i))}{Z}$, where $Z = \sum_{i=1}^N w_i \exp(-\alpha_m y_i G_M(x_i))$.
3. Do linear combination of $M$ weak models: $f(x) = \sum_{i=1}^M \alpha_m G_m(x)$. The final classifier is $G(x) = sign (f(x)) = sign (\sum_{i=1}^M \alpha_m G_m(x))$.