Gradient Boosting Decision Tree

October 9, 2016

Machine Learning

32 minutes

In the previous article, we’ve talked about AdaBoost which combines output of weak learners into a weighted sum that represents the final output of the boosted classifier. If you know little about AdaBoost or additive model, we highly recommend you read the article first.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Boosting Tree

Boosting tree is based on additive model which can be repsentated as following:

$$\begin{align}f_M(x) = \sum_{m=1}^{M} T(x; \theta_m)\end{align}$$

where $T(x; \theta_m)$ stands for a decision tree, $M$ is the number of decision trees and $\theta_m$ is the parameters of the m-th decision tree.

Assume that the initial boosting tree $f_0(x) = 0$ the m-th step of this model is

$$\begin{align}f_m(x) = f_{m-1}(x) + T(x; \theta_m)\end{align}$$

where $f_{m-1}(x)$ is model currently. The parameters of the next decision tree $\theta_m$ can be deterimined by minimizing the following cost function:

$$ \theta_m^* = \text{argmin}_{\theta_m} \sum_{i=1}^M L(y_i, f_{m-1}(x_i) + T(x_i; \theta_m)) $$

For binary classification problem, the boosting tree is AdaBoost classifier. In this article, we mainly talk about boosting trees in regression problem.

Given a training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\right\}$, where $x_i \in \mathbb{R}^n$, $y_i \in \mathbb{R}$. Assume that the input space has been divided into separate and disjoint regions $R_1, R_2, \dots, R_J$, the boosting tree can be formulated as:

$$\begin{align}T(x; \theta) = \sum_{j=1}^J c_j I(x \in R_j)\end{align}$$

where $c_j$ is a constant weight of the region, $\theta = \{(R_1, c_1), (R_2, c_2), \dots, (R_J, c_J)\}$ stands for the separation of input space, and $J$ is the number of leaf nodes of the regression tree.

As mentioned above, the boosting trees of regression problems fit forward stagewise modeling. Here we use squared error as loss function:

$$\begin{align}L(y, f(x)) = (y - f(x))^2\end{align}$$

The cost can be calculated as:

$$ \begin{align} L(y, f_{m-1}(x) + T(x; \theta_m)) &= (y - f_{m-1}(x) - T(x; \theta_m))^2 \\ &= (r - T(x; \theta_m))^2 \end{align} $$

where $r = y - f_{m-1}(x)$ which is the residual of the current model.

The procedure of boosting trees is listed below:

Input: training set $\mathbf{T}=\left\{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\right\}$, where $x_i \in \mathbb{R}^n$, $y_i \in \mathbb{R}$.

Output: boosting tree $f_M(x)$

Initialize $f_0(x) = 0$
For $m = 1, 2, \dots, M$
- Calculate residual according to
  $r_{mi} = y_i - f_{m-1}(x_i), i = 1, 2, \dots, N$
- Obtain $T(x, \theta_m)$ by fitting residuals
- Update $f_m(x) = f_{m-1}(x) + T(x; \theta_m)$
Obtain boosting tree for regression problem
$f_M(x) = \sum_{m=1}^M T(x; \theta_m)$

An Example of Boosting Tree

The training examples are given below:

Table 1. Training examples


$x_i$	1	2	3	4	5	6	7	8	9	10
$y_i$	5.56	5.70	5.91	6.40	6.80	7.05	8.90	8.70	9.00	9.05

Consider following optimization problem:

$$\begin{align} \min_s\left[\min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2\right]\end{align}$$

where $R_1 = \{x | x \le s\}, R_2 = \{x | x > s\}$, $c_1 = \frac{1}{N_1} \sum_{x_i \in R_1} y_i,c_2 = \frac{1}{N_2} \sum_{x_i \in R_2} y_i$, and $N_1, N_2$ are the number of points in $R_1, R_2$ respectively.

Consider following candidate values for $s$: $1.5, 2.5, 3.5, \dots, 9.5$.

$$\begin{align}m(s) = \min_{c_1} \sum_{x_i \in R_1} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2} (y_i - c_2)^2\end{align}$$

Take $s = 1.5$ as an example: $R_1 = \{1\}, R_2 = {2, 3, \dots, 10}$, and $c_1 = 5.56, c_2 = 7.50$. Therefore, $m(s) = m(1.5) = 0 + 15.72 = 15.72.$

The results of $s$ and $m(s)$ are listed below:

Table 2. m(s) values of different values of s


$s$	1.5	2.5	3.5	4.5	5.5	6.5	7.5	8.5	9.5
$m(s)$	15.72	12.07	8.36	5.78	3.91	1.93	8.01	11.73	15.74

Obvious, $s = \text{argmin}_s m(s) = 6.5$. Corresponding $R_1 = \{1, 2, \dots, 6\}$, $R_2 = \{7, 8, 9, 10\}$, $c_1 = 6.24$, $c_2 = 8.91$. So regression tree $T_1(x)$ is

$$T_1 (x) = \begin{cases}6.24, & x <6.5 \\ 8.91, & x \ge 6.5\end{cases}$$

$$f_1(x) = T_1(x)$$

The residual of $f_1(x)$ is listed below, where $r_{2i} = y_i - f_1(x_i), i = 1, 2, \dots, 10$.

Table 3. Residual values


$x_i$	1	2	3	4	5	6	7	8	9	10
$r_{2i}$	-0.68	-0.54	-0.33	0.16	0.56	0.81	-0.01	-0.21	0.09	0.14

The cost of $f_1(x)$ is $L(y, f_1(x)) = \sum_{i=1}^{10} (y_i - f_1(x_i))^2 = 1.93$.

Similar to above steps, $T_2(x)$ can be calculated by fitting values in Table 3:

$$T_2 (x) = \begin{cases}-0.52 & x <3.5 \\ 0.22 & x \ge 3.5\end{cases}$$

Then $f_2(x) = f_1(x) + T_2(x) = \begin{cases}5.72 & x < 3.5, \\ 6.46, & 3.5 \le x < 6.5 \\ 9.13, & x \ge 6.5\end{cases}$

The cost of $f_2(x)$ is $L(y, f_2(x)) = \sum_{i=1}^{10} (y_i - f_1(x_i))^2 = 0.79$.

We can obtain following values using similar way:

$$T_3 (x) = \begin{cases}0.15 & x <6.5 \\ -0.22 & x \ge 6.5\end{cases}, L(y, f_3(x)) = 0.47$$

$$T_4 (x) = \begin{cases}-0.16 & x <4.5 \\ 0.11 & x \ge 4.5\end{cases}, L(y, f_4(x)) = 0.30$$

$$T_5 (x) = \begin{cases}0.07 & x <6.5 \\ -0.11 & x \ge 6.5\end{cases}, L(y, f_5(x)) = 0.23$$

$$T_6 (x) = \begin{cases}-0.15 & x <2.5 \\ 0.04 & x \ge 2.5\end{cases}, L(y, f_3(x)) = 0.17$$

Suppose $cost = 0.17$ is below the threshold of error. And the final regression tree is:

Then $f(x) = f_6(x) = f_5(x) + T_6(x) = \begin{cases} 5.63 & x < 2.5, \\ 5.82, & 2.5 \le x < 3.5 \\ 6.56, & 3.5 \le x < 4.5 \\ 6.83, & 4.5 \le x < 6.5 \\ 8.95, & x \ge 6.5 \end{cases}$