Gradient Boosting Decision Tree

In the previous article, we've talked about AdaBoost which combines output of weak learners into a weighted sum that represents the final output of the boosted classifier. If you know little about AdaBoost or additive model, we highly recommend you read the article first.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Boosting Tree

Boosting tree is based on additive model which can be repsentated as following:

\begin{align}f_M(x) = \sum_{m=1}^{M} T(x; \theta_m)\end{align}

where $T(x; \theta_m)$ stands for a decision tree, $M$ is the number of decision trees and $\theta_m$ is the parameters of the m-th decision tree.

Assume that the initial boosting tree $f_0(x) = 0$ the m-th step of this model is

\begin{align}f_m(x) = f_{m-1}(x) + T(x; \theta_m)\end{align}

where $f_{m-1}(x)$ is model currently. The parameters of the next decision tree $\theta_m$ can be deterimined by minimizing the following cost function:

\begin{align}\displaystyle\theta_m^* = arg \min_{\theta_m} \sum_{i=1}^M L(y_i, f_{m-1}(x_i) + T(x_i; \theta_m))\end{align}

For binary classification problem, the boosting tree is AdaBoost classifier. In this article, we mainly talk about boosting trees in regression problem.

Given a training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\right\}$, where $x_i \in \mathbb{R}^n$$y_i \in \mathbb{R}$. Assume that the input space has been divided into separate and disjoint regions $R_1, R_2, \dots, R_J$, the boosting tree can be formulated as:

\begin{align}T(x; \theta) = \sum_{j=1}^J c_j I(x \in R_j)\end{align}

where $c_j$ is a constant weight of the region, $\theta = \{(R_1, c_1), (R_2, c_2), \dots, (R_J, c_J)\}$ stands for the separation of input space, and $J$ is the number of leaf nodes of the regression tree.

As mentioned above, the boosting trees of regression problems fit forward stagewise modeling. Here we use squared error as loss function:

\begin{align}L(y, f(x)) = (y - f(x))^2\end{align}

The cost can be calculated as:

\begin{align}L(y, f_{m-1}(x) + T(x; \theta_m)) &= (y - f_{m-1}(x) - T(x; \theta_m))^2\\ &= (r - T(x; \theta_m))^2\end{align}

where $r = y - f_{m-1}(x)$ which is the residual of the current model.

The procedure of boosting trees is listed below:

Input: training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\right\}$, where $x_i \in \mathbb{R}^n$$y_i \in \mathbb{R}$.

Output: boosting tree $f_M(x)$

1. Initialize $f_0(x) = 0$
2. For $m = 1, 2, \dots, M$
• Calculate residual according to
$r_{mi} = y_i - f_{m-1}(x_i), i = 1, 2, \dots, N$
• Obtain $T(x, \theta_m)$ by fitting residuals
• Update $f_m(x) = f_{m-1}(x) + T(x; \theta_m)$
3. Obtain boosting tree for regression problem
$\displaystyle f_M(x) = \sum_{m=1}^M T(x; \theta_m)$

An Example of Boosting Tree

The training examples are given below:

 $x_i$ 1 2 3 4 5 6 7 8 9 10 $y_i$ 5.56 5.7 5.91 6.4 6.8 7.05 8.9 8.7 9 9.05

Table 1. Training examples

Consider following optimization problem:

\begin{align}\displaystyle \min_s\left[\min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2\right]\end{align}

where $R_1 = \{x | x \le s\}, R_2 = \{x | x > s\}$, $\displaystyle c_1 = \frac{1}{N_1} \sum_{x_i \in R_1} y_i,c_2 = \frac{1}{N_2} \sum_{x_i \in R_2} y_i$, and $N_1, N_2$ are the number of points in $R_1, R_2$ respectively.

Consider following candidate values for $s$$1.5, 2.5, 3.5, \dots, 9.5$.

\begin{align}m(s) = \min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2\end{align}

Take $s = 1.5$ as an example: $R_1 = \{1\}, R_2 = {2, 3, \dots, 10}$, and $c_1 = 5.56, c_2 = 7.50$. Therefore, $m(s) = m(1.5) = 0 + 15.72 = 15.72.$

The results of $s$ and $m(s)$ are listed below:

 $s$ 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 $m(s)$ 15.72 12.07 8.36 5.78 3.91 1.93 8.01 11.73 15.74

Table 2. m(s) values of different values of s

Obvious, $s = arg \min_s m(s) = 6.5$. Corresponding $R_1 = \{1, 2, \dots, 6\}$, $R_2 = \{7, 8, 9, 10\}$, $c_1 = 6.24$, $c_2 = 8.91$. So regression tree $T_1(x)$ is

$T_1 (x) = \begin{cases}6.24, & x <6.5\\8.91, & x \ge 6.5\end{cases}$

$f_1(x) = T_1(x)$

The residual of $f_1(x)$ is listed below, where $r_{2i} = y_i - f_1(x_i), i = 1, 2, \dots, 10$.

 $x_i$ 1 2 3 4 5 6 7 8 9 10 $r_{2i}$ -0.68 -0.54 -0.33 0.16 0.56 0.81 -0.01 -0.21 0.09 0.14

Table 3. Residual values

The cost of $f_1(x)$ is $L(y, f_1(x)) = \sum_{i=1}^{10} (y_i - f_1(x_i))^2 = 1.93$.

Similar to above steps, $T_2(x)$ can be calculated by fitting values in Table 3:

$T_2 (x) = \begin{cases}-0.52 & x <3.5\\0.22 & x \ge 3.5\end{cases}$

Then $f_2(x) = f_1(x) + T_2(x) = \begin{cases}5.72 & x < 3.5, \\6.46, & 3.5 \le x < 6.5 \\9.13, & x \ge 6.5\end{cases}$

The cost of $f_2(x)$ is $L(y, f_2(x)) = \sum_{i=1}^{10} (y_i - f_1(x_i))^2 = 0.79$.

We can obtain following values using similar way:

$T_3 (x) = \begin{cases}0.15 & x <6.5\\-0.22 & x \ge 6.5\end{cases}, L(y, f_3(x)) = 0.47$

$T_4 (x) = \begin{cases}-0.16 & x <4.5\\0.11 & x \ge 4.5\end{cases}, L(y, f_4(x)) = 0.30$

$T_5 (x) = \begin{cases}0.07 & x <6.5\\-0.11 & x \ge 6.5\end{cases}, L(y, f_5(x)) = 0.23$

$T_6 (x) = \begin{cases}-0.15 & x <2.5\\0.04 & x \ge 2.5\end{cases}, L(y, f_3(x)) = 0.17$

Suppose $cost = 0.17$ is below the threshold of error. And the final regression tree is:

Then $f(x) = f_6(x) = f_5(x) + T_6(x) = \begin{cases}5.63 & x < 2.5, \\5.82, & 2.5 \le x < 3.5 \\6.56, & 3.5 \le x < 4.5 \\6.83, & 4.5 \le x < 6.5 \\8.95, & x \ge 6.5\end{cases}$

For complex loss function, it's not easy to minimize the cost function. Freidman proposed gradient boosting algorithm. The data-based analogue of the unconstrained negative gradient:

\begin{align}-\left[\frac{\partial L(y, f(x_i))}{\partial f(x_i)}\right]_{f(x) = f_{m-1}(x)}\end{align}

gives the best steepest-descent step direction in the data space at $f_{m-1}(x)$.

The procedure of gradient boosting decision trees is listed below:

Input: training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\right\}$, where $x_i \in \mathbb{R}^n$$y_i \in \mathbb{R}$.

Output: gradient boosting tree $f_M(x)$

1. Initialize $\displaystyle f_0(x) = arg \min_{c} \sum_{i=1}^N L(y_i, c)$, where $c$ is a constant.
2. For $m = 1, 2, \dots, M$
• For $i = 1, 2, \dots, N$ compute
$r_{mi} = -\left[\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\right]_{f(x) = f_{m-1}(x)}$
• Fit a regression tree to the targets $r_{mi}$ giving terminal regions $R_{mj}, j = 1, 2, \dots, J$.
• For $j = 1, 2, \dots, J_m$ compute
$\displaystyle c_{mj} = arg \min_c \sum_{x_j \in R_{mj}} L(y_i, f_{m-1}(x_i) + c)$
• Update $f_m(x) = f_{m-1}(x) + \sum_{j=1}^J I(x \in R_{mj})$
3. Obtain boosting tree for regression problem
$\displaystyle f_M(x) = \sum_{m=1}^M \sum_{j=1}^J c_{mj} I(x \in R_{mj})$

References

• Hang Li. Statistical Learning Methods. Tsinghua University Press (2012)
• Friedman, Jerome H. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001): 1189-1232.