In one of the working papers under review, I am using the Extreme Gradient Boosting (XGBoost) to identify examinees with potential item preknowledge in a certification exam. In the original paper, I did not talk much about the technical aspects of XGBoost and went straight to my application. However, one of the reviewers asked more description to the XGBoost to help the readers get a conceptual understanding of how XGBoost works.
After spending a few weeks and going over the original paper and this presentation so many times, I finally felt that I had a good grasp of it. Then, I came up with a small numerical illustration in the context of the problem I am working on. Here, I provide an informal introduction of this illustration with accompanying R code, so that anyone can play with it.
You can download the illustration dataset from this link.
Suppose, we have 30 observations with three variables. RT1 and RT2 are the response time (in seconds) for two test items and our features (predictors). \(y\) is a binary outcome variable with 0 indicating that the individual did not have item preknowledge and 1 indicating that the individual had item preknowledge. The goal is to develop a tree ensemble model to predict the outcome given the feature variables.
d < read.csv("illustration2.csv")
d$Obs < 1:30
d < d[,c(4,1,2,3)]
require(knitr)
require(kableExtra)
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:4,width="1em")
Obs  RT1  RT2  y 

1  114  36  1 
2  92  121  1 
3  49  24  1 
4  58  30  1 
5  40  17  1 
6  67  66  1 
7  30  79  1 
8  19  14  1 
9  33  8  1 
10  172  38  1 
11  55  37  0 
12  103  79  0 
13  68  117  0 
14  194  17  0 
15  68  84  0 
16  48  127  0 
17  48  200  0 
18  61  47  0 
19  37  155  0 
20  58  84  0 
21  139  97  0 
22  99  84  0 
23  47  50  0 
24  41  54  0 
25  29  90  0 
26  117  31  0 
27  44  104  0 
28  34  62  0 
29  83  20  0 
30  64  57  0 
First, we need an objective function to optimize. So, we can look back and check if we are improving our model. In the context of predicting a binary outcome, XGBoost is using a logistic loss function.
\[\ell(y_i,\hat{y}_i)=y_i ln(1+e^{\hat{y}_i})+(1y_i)ln(1+e^{\hat{y}_i}),\]
where \(y_i\) and \(\hat{y}_i\) are the observed and predicted outcome for the \(i^{th}\) observation. Before we start iterations, XGBoost assigns a predicted value of 0 for each observation as our initial prediction. Let's call this Iteration 0.
d$y0 < 0
colnames(d)[5] < "$\\hat{y}^0$"
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:5,width="1em")
Obs  RT1  RT2  y  \(\hat{y}^0\) 

1  114  36  1  0 
2  92  121  1  0 
3  49  24  1  0 
4  58  30  1  0 
5  40  17  1  0 
6  67  66  1  0 
7  30  79  1  0 
8  19  14  1  0 
9  33  8  1  0 
10  172  38  1  0 
11  55  37  0  0 
12  103  79  0  0 
13  68  117  0  0 
14  194  17  0  0 
15  68  84  0  0 
16  48  127  0  0 
17  48  200  0  0 
18  61  47  0  0 
19  37  155  0  0 
20  58  84  0  0 
21  139  97  0  0 
22  99  84  0  0 
23  47  50  0  0 
24  41  54  0  0 
25  29  90  0  0 
26  117  31  0  0 
27  44  104  0  0 
28  34  62  0  0 
29  83  20  0  0 
30  64  57  0  0 
The value of the loss function based on \(y_i\) and \(\hat{y}^0\) at Iteration 0 is 20.79442.
y < d[,4]
y0 < d[,5]
loss0 < sum(y*log(1+exp(y0)) + (1y)*log(1+exp(y0)))
loss0
[1] 20.79442
To build the first tree, we need the first and second order gradient statistics of the loss function from Iteration 0. Chen and Guestrin (2016) used a secondorder Taylor approximation for the loss function and derived the equations for the first and secondorder gradient statistics. In this case, we can compute the first and secondorder gradient statistics for each observation as the following:
\[g_i^{(1)} = \frac{(y_i1)e^{\hat{y_i}^{(0)}}+y_i}{e^{\hat{y_i}^{(0)}}+1}\]
\[h_i^{(1)} = \frac{e^{\hat{y_i}^{(0)}}}{(e^{\hat{y_i}^{(0)}}+1)^2}\]
d$g1 < ((y1)*exp(y0)+y)/(exp(y0)+1)
d$h1 < exp(y0)/((exp(y0)+1)^2)
t = kable(d,digits=1,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 2))
Iteration 1



Obs  RT1  RT2  y  \(\hat{y}^0\)  g1  h1 
1  114  36  1  0  0.5  0.2 
2  92  121  1  0  0.5  0.2 
3  49  24  1  0  0.5  0.2 
4  58  30  1  0  0.5  0.2 
5  40  17  1  0  0.5  0.2 
6  67  66  1  0  0.5  0.2 
7  30  79  1  0  0.5  0.2 
8  19  14  1  0  0.5  0.2 
9  33  8  1  0  0.5  0.2 
10  172  38  1  0  0.5  0.2 
11  55  37  0  0  0.5  0.2 
12  103  79  0  0  0.5  0.2 
13  68  117  0  0  0.5  0.2 
14  194  17  0  0  0.5  0.2 
15  68  84  0  0  0.5  0.2 
16  48  127  0  0  0.5  0.2 
17  48  200  0  0  0.5  0.2 
18  61  47  0  0  0.5  0.2 
19  37  155  0  0  0.5  0.2 
20  58  84  0  0  0.5  0.2 
21  139  97  0  0  0.5  0.2 
22  99  84  0  0  0.5  0.2 
23  47  50  0  0  0.5  0.2 
24  41  54  0  0  0.5  0.2 
25  29  90  0  0  0.5  0.2 
26  117  31  0  0  0.5  0.2 
27  44  104  0  0  0.5  0.2 
28  34  62  0  0  0.5  0.2 
29  83  20  0  0  0.5  0.2 
30  64  57  0  0  0.5  0.2 
Now the tricky part here is to find the best split of these observations that would improve the predictions most (reduce the value of loss function most). The observations can be split based on the RT1 or RT2 features. For instance, suppose we divide the observations into two groups based on whether the RT1 value is below or above 58.5 seconds. Below is a hypothetical tree structure with two leaves. There are 16 observations in the first leaf (\(S_1\)), and 14 observations in the second leaf (\(S_2\)).
There are many potential splits available. We can say the number of possible theoretical splits are \(2^{30}\). However, most of them are not relevant here. In this case, there are only 53 possible splits of interest, 27 based on RT1 and 26 based on RT2 (as many as the number of unique elements for RT1 and RT2).
So, how could we know if this particular split is good or bad? Or, how could we know the best split that would improve our predictions most? To evaluate a tree structure like this, we need a quantity to measure its quality. Chen and Guestrin (2016) derived the Gain score for evaluating a fixed tree structure.
\[ Gain = \left [ \sum_{l=1}^{L} \left ( \frac{\sum_{i \epsilon S_l}g_i)^2}{\sum_{i \epsilon S_l}h_i+\lambda}  \frac{(\sum_{i}g_i)^2}{\sum_{i}h_i+\lambda} \right ) \right ]\gamma \]
Here, \(l\) indexes the leaf, \(L\) is the number of leaves in the tree structure, and \(S_l\) indicates the set of observations in the \(l^{th}\) leaf. \(\lambda\) and \(\gamma\) are the parameters to penalize the complexity of the tree structure and should be set to specific values. In practice, these values are typically determined through a parameter tuning procedure. For the sake of this illustration, I set the values to 1.5 and 1 for \(\lambda\) and \(\gamma\), respectively.
Now, we will compute the gain score for each possible split based on the RT1 and RT2 features, and then check which one gives the best solution.
g < d$g1
h < d$h1
lambda = 1.5
gamma = 1
# All Splits based on RT1
gain < c()
s < sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split < (d$RT1<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All Splits for RT2
gain < c()
s < sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split < (d$RT2<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

38.5  3.512  RT2 
36.5  3.081  RT2 
30.5  2.742  RT2 
47.5  2.444  RT2 
37.5  2.011  RT2 
31.5  1.651  RT2 
As we can see, the split with the largest gain is the one based on a threshold level of 38.5 for the RT2 feature. So, the first branch of the first tree is developed.
The algorithm will be repeated for the observations on each leaf. The algorithm will compute a gain score for all potential splits on each leaf separately to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves). Let's first look at all potential splits on the first leaf (RT2 < 38.5).
# All RT1 Splits on Leaf 1
gain < c()
s < sort(d[d$RT2<38.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2<38.5 & d$RT1<s[i])
gr2 < which(d$RT2<38.5 & d$RT1>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 1
gain < c()
s < sort(d[d$RT2<38.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2<38.5 & d$RT2<s[i])
gr2 < which(d$RT2<38.5 & d$RT2>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

49.5  0.148  RT1 
58.5  0.105  RT1 
114.5  0.275  RT1 
172.5  0.387  RT1 
40.5  0.529  RT1 
55.5  0.711  RT1 
The results indicate that there is only one potential split with a positive gain, a split based on a threshold level of 49.5 for the RT1 feature. Now, let's check all potential splits on the second leaf (RT2 > 38.5).
# All RT1 Splits on Leaf 2
gain < c()
s < sort(d[d$RT2>38.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2>38.5 & d$RT1<s[i])
gr2 < which(d$RT2>38.5 & d$RT1>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 2
gain < c()
s < sort(d[d$RT2>38.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2>38.5 & d$RT2<s[i])
gr2 < which(d$RT2>38.5 & d$RT2>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

30.5  0.412  RT1 
139.5  1.000  RT1 
200.5  1.000  RT1 
34.5  1.103  RT1 
64.5  1.512  RT1 
37.5  1.598  RT1 
It seems none of the possible splits on the second leaf yield a positive gain score. In this case, the search algorithm suggests adding a new branch on the first leaf based on a threshold level of 49.5 for the RT1 feature. Our first tree structure now looks like this.
The depth of our tree is now 2. As you can imagine, we can continue to search, and add even more branches to the tree by picking the best split as long as there is a positive gain. This would increase the depth of the tree. The more depth a tree structure has, the more complex interactions it accounts for when predicting the outcome. For the sake of this illustration, we stop and end Iteration 1 here.
Once, the tree structure is determined, we can now compute the weights for each leaf. Our first tree structure has three leaves now. Chen and Guestrin (2016) derived that the optimal leaf weight for a fixed tree structure is equal to
\[w_l= \frac{\sum_{i \epsilon S_l}g_i}{\sum_{i \epsilon S_l}h_i+\lambda}\]
Let's compute the leaf weights on the first tree structure.
gr1 < which(d$RT2 > 38.5)
gr1
[1] 2 6 7 12 13 15 16 17 18 19 20 21 22 23 24 25 27 28 30
gr2 < which(d$RT2 <38.5 & d$RT1<49.5)
gr2
[1] 3 5 8 9
gr3 < which(d$RT2 <38.5 & d$RT1>49.5)
gr3
[1] 1 4 10 11 14 26 29
w1 = sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = sum(g[gr2])/(sum(h[gr2])+lambda)
w3 = sum(g[gr3])/(sum(h[gr3])+lambda)
w1
[1] 1.04
w2
[1] 0.8
w3
[1] 0.1538462
These weights represent how we will improve our predictions at the end of Iteration 1 before we move on to Iteration 2. Each observation in our dataset is now assigned to one of these weights depending on which leaf they belong to. In the end, we add these estimated leaf weights to the previous predictions (\(\hat{y}^0\)) and obtain the new predictions at the end of Iteration 1 (\(\hat{y}^1\)).
d$w < NA
d$y1 < NA
d[gr1,]$w = w1
d[gr2,]$w = w2
d[gr3,]$w = w3
d$y1 < d[,5] + d$w
colnames(d)[8] < "$w^{1}$"
colnames(d)[9] < "$\\hat{y}^1$"
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4))
Iteration 1



Obs  RT1  RT2  y  \(\hat{y}^0\)  g1  h1  \(w^{1}\)  \(\hat{y}^1\) 
1  114  36  1  0  0.5  0.25  0.154  0.154 
2  92  121  1  0  0.5  0.25  1.040  1.040 
3  49  24  1  0  0.5  0.25  0.800  0.800 
4  58  30  1  0  0.5  0.25  0.154  0.154 
5  40  17  1  0  0.5  0.25  0.800  0.800 
6  67  66  1  0  0.5  0.25  1.040  1.040 
7  30  79  1  0  0.5  0.25  1.040  1.040 
8  19  14  1  0  0.5  0.25  0.800  0.800 
9  33  8  1  0  0.5  0.25  0.800  0.800 
10  172  38  1  0  0.5  0.25  0.154  0.154 
11  55  37  0  0  0.5  0.25  0.154  0.154 
12  103  79  0  0  0.5  0.25  1.040  1.040 
13  68  117  0  0  0.5  0.25  1.040  1.040 
14  194  17  0  0  0.5  0.25  0.154  0.154 
15  68  84  0  0  0.5  0.25  1.040  1.040 
16  48  127  0  0  0.5  0.25  1.040  1.040 
17  48  200  0  0  0.5  0.25  1.040  1.040 
18  61  47  0  0  0.5  0.25  1.040  1.040 
19  37  155  0  0  0.5  0.25  1.040  1.040 
20  58  84  0  0  0.5  0.25  1.040  1.040 
21  139  97  0  0  0.5  0.25  1.040  1.040 
22  99  84  0  0  0.5  0.25  1.040  1.040 
23  47  50  0  0  0.5  0.25  1.040  1.040 
24  41  54  0  0  0.5  0.25  1.040  1.040 
25  29  90  0  0  0.5  0.25  1.040  1.040 
26  117  31  0  0  0.5  0.25  0.154  0.154 
27  44  104  0  0  0.5  0.25  1.040  1.040 
28  34  62  0  0  0.5  0.25  1.040  1.040 
29  83  20  0  0  0.5  0.25  0.154  0.154 
30  64  57  0  0  0.5  0.25  1.040  1.040 
The value of the loss function based on \(y_i\) and \(\hat{y}^1\) at the end of Iteration 1 is 15.15 (reduced from 20.79)
y < d[,4]
y1 < d[,9]
loss1 < sum(y*log(1+exp(y1)) + (1y)*log(1+exp(y1)))
loss1
[1] 15.15075
The remaining of this analysis is simply a repetition of Iteration 1. At the end of Iteration 1, we developed the first tree in our model. Now, we will develop the second one. First, we have to update our first and secondorder gradient statistics based on the new predictions at the end of Iteration 1.
\[g_i^{(2)} = \frac{(y_i1)e^{\hat{y_i}^{(1)}}+y_i}{e^{\hat{y_i}^{(1)}}+1}\]
\[h_i^{(2)} = \frac{e^{\hat{y_i}^{(1)}}}{(e^{\hat{y_i}^{(1)}}+1)^2}\]
d$g2 < ((y1)*exp(y1)+y)/(exp(y1)+1)
d$h2 < exp(y1)/((exp(y1)+1)^2)
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 2))
Iteration 1

Iteration 2



Obs  RT1  RT2  y  \(\hat{y}^0\)  g1  h1  \(w^{1}\)  \(\hat{y}^1\)  g2  h2 
1  114  36  1  0  0.5  0.25  0.154  0.154  0.538  0.249 
2  92  121  1  0  0.5  0.25  1.040  1.040  0.739  0.193 
3  49  24  1  0  0.5  0.25  0.800  0.800  0.310  0.214 
4  58  30  1  0  0.5  0.25  0.154  0.154  0.538  0.249 
5  40  17  1  0  0.5  0.25  0.800  0.800  0.310  0.214 
6  67  66  1  0  0.5  0.25  1.040  1.040  0.739  0.193 
7  30  79  1  0  0.5  0.25  1.040  1.040  0.739  0.193 
8  19  14  1  0  0.5  0.25  0.800  0.800  0.310  0.214 
9  33  8  1  0  0.5  0.25  0.800  0.800  0.310  0.214 
10  172  38  1  0  0.5  0.25  0.154  0.154  0.538  0.249 
11  55  37  0  0  0.5  0.25  0.154  0.154  0.462  0.249 
12  103  79  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
13  68  117  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
14  194  17  0  0  0.5  0.25  0.154  0.154  0.462  0.249 
15  68  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
16  48  127  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
17  48  200  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
18  61  47  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
19  37  155  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
20  58  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
21  139  97  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
22  99  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
23  47  50  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
24  41  54  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
25  29  90  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
26  117  31  0  0  0.5  0.25  0.154  0.154  0.462  0.249 
27  44  104  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
28  34  62  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
29  83  20  0  0  0.5  0.25  0.154  0.154  0.462  0.249 
30  64  57  0  0  0.5  0.25  1.040  1.040  0.261  0.193 
The algorithm computes the same search as in the Iteration 1 for all possible splits based on the RT1 and RT2 features using the updated gradient statistics.
g < d$g2
h < d$h2
# All Splits based on RT1
gain < c()
s < sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split < (d$RT1<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All Splits for RT2
gain < c()
s < sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split < (d$RT2<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

33.5  0.009  RT1 
79.5  0.001  RT2 
38.5  0.124  RT2 
36.5  0.248  RT2 
40.5  0.315  RT1 
30.5  0.350  RT2 
It seems the best split for the first branch of the second tree is based on a threshold value of 33.5 for the RT1 feature.
Similar to Iteration 1, the algorithm will compute a gain score for all potential splits on each leaf of the second tree to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves).
Let's first look at all potential splits on the first leaf (RT1 > 33.5).
# All RT1 Splits on Leaf 1
gain < c()
s < sort(d[d$RT1>33.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT1>33.5 & d$RT1<s[i])
gr2 < which(d$RT1>33.5 & d$RT1>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 1
gain < c()
s < sort(d[d$RT1>33.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT1>33.5 & d$RT2<s[i])
gr2 < which(d$RT1>33.5 & d$RT2>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

38.5  0.319  RT1 
36.5  0.517  RT1 
47.5  0.567  RT1 
66.5  0.631  RT1 
30.5  0.680  RT1 
83.5  0.702  RT1 
The results indicate that there is no split with a positive gain that can be added to the first leaf. Let's check the second leaf (RT1 < 33.5).
# All RT1 Splits on Leaf 2
gain < c()
s < sort(d[d$RT1<33.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT1<33.5 & d$RT1<s[i])
gr2 < which(d$RT1<33.5 & d$RT1>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 2
gain < c()
s < sort(d[d$RT1<33.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT1<33.5 & d$RT2<s[i])
gr2 < which(d$RT1<33.5 & d$RT2>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

79.5  0.610  RT1 
29.5  0.943  RT1 
33.5  1.000  RT1 
90.5  1.000  RT1 
19.5  1.169  RT1 
30.5  1.169  RT1 
There is also no split with a positive gain that can be added to the second leaf. Therefore, we stop and Iteration 2 ends. Our second tree has a depth of one. Let's compute the leaf weights on the second tree structure.
gr1 < which(d$RT1 > 33.5)
gr2 < which(d$RT1 < 33.5)
w1 = sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = sum(g[gr2])/(sum(h[gr2])+lambda)
w1
[1] 0.2951779
w2
[1] 0.4744527
These weights similarly represent how we will improve our predictions at the end of Iteration 2 before we move on to Iteration 3. Each observation in our dataset is now assigned to one of these weights depending on which leaf they belong to in Tree 2. In the end, we add these estimated leaf weights to the previous predictions (\(\hat{y}^1\)) and obtain the new predictions at the end of Iteration 2 (\(\hat{y}^2\)).
d$w < NA
d$y2 < NA
d[gr1,]$w = w1
d[gr2,]$w = w2
d$y2 < d[,9] + d$w
colnames(d)[12] < "$w^{2}$"
colnames(d)[13] < "$\\hat{y}^2$"
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4))
Iteration 1

Iteration 2



Obs  RT1  RT2  y  \(\hat{y}^0\)  g1  h1  \(w^{1}\)  \(\hat{y}^1\)  g2  h2  \(w^{2}\)  \(\hat{y}^2\) 
1  114  36  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449 
2  92  121  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.295  1.335 
3  49  24  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.295  0.505 
4  58  30  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449 
5  40  17  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.295  0.505 
6  67  66  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.295  1.335 
7  30  79  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.474  0.566 
8  19  14  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.474  1.274 
9  33  8  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.474  1.274 
10  172  38  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449 
11  55  37  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449 
12  103  79  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
13  68  117  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
14  194  17  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449 
15  68  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
16  48  127  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
17  48  200  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
18  61  47  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
19  37  155  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
20  58  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
21  139  97  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
22  99  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
23  47  50  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
24  41  54  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
25  29  90  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.474  0.566 
26  117  31  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449 
27  44  104  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
28  34  62  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
29  83  20  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449 
30  64  57  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335 
The value of the loss function based on \(y_i\) and \(\hat{y}^2\) at the end of Iteration 2 is 14.35 (reduced from 15.15).
y < d[,4]
y2 < d[,13]
loss2 < sum(y*log(1+exp(y2)) + (1y)*log(1+exp(y2)))
loss2
[1] 14.34646
Our tree ensemble model at the end of Iteration 2 now looks like this.
We keep repeating the same cycle. First, we update our first and secondorder gradient statistics based on the new predictions at the end of Iteration 2.
\[g_i^{(3)} = \frac{(y_i1)e^{\hat{y_i}^{(2)}}+y_i}{e^{\hat{y_i}^{(2)}}+1}\]
\[h_i^{(3)} = \frac{e^{\hat{y_i}^{(2)}}}{(e^{\hat{y_i}^{(2)}}+1)^2}\]
d$g3 < ((y1)*exp(y2)+y)/(exp(y2)+1)
d$h3 < exp(y2)/((exp(y2)+1)^2)
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4, "Iteration 3" = 2))
Iteration 1

Iteration 2

Iteration 3



Obs  RT1  RT2  y  \(\hat{y}^0\)  g1  h1  \(w^{1}\)  \(\hat{y}^1\)  g2  h2  \(w^{2}\)  \(\hat{y}^2\)  g3  h3 
1  114  36  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449  0.610  0.238 
2  92  121  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.295  1.335  0.792  0.165 
3  49  24  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.295  0.505  0.376  0.235 
4  58  30  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449  0.610  0.238 
5  40  17  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.295  0.505  0.376  0.235 
6  67  66  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.295  1.335  0.792  0.165 
7  30  79  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.474  0.566  0.638  0.231 
8  19  14  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.474  1.274  0.218  0.171 
9  33  8  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.474  1.274  0.218  0.171 
10  172  38  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449  0.610  0.238 
11  55  37  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238 
12  103  79  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
13  68  117  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
14  194  17  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238 
15  68  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
16  48  127  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
17  48  200  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
18  61  47  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
19  37  155  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
20  58  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
21  139  97  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
22  99  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
23  47  50  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
24  41  54  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
25  29  90  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.474  0.566  0.362  0.231 
26  117  31  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238 
27  44  104  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
28  34  62  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
29  83  20  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238 
30  64  57  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165 
The algorithm searches among all possible splits based on the RT1 and RT2 features using the updated gradient statistics.
g < d$g3
h < d$h3
# All Splits based on RT1
gain < c()
s < sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split < (d$RT1<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All Splits for RT2
gain < c()
s < sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split < (d$RT2<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

79.5  0.116  RT2 
38.5  0.131  RT2 
36.5  0.356  RT2 
47.5  0.382  RT2 
66.5  0.436  RT2 
30.5  0.542  RT2 
The best split was found based on a threshold value of 79..5 for the RT2 feature.
The algorithm computes a gain score for all potential splits on each leaf of the third tree to see if there is a new branch with a positive gain that can be added to any of the two leaves (or to both leaves).
Let's first look at all potential splits on the first leaf (RT2 > 79.5).
# All RT1 Splits on Leaf 1
gain < c()
s < sort(d[d$RT2>79.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2>79.5 & d$RT1<s[i])
gr2 < which(d$RT2>79.5 & d$RT1>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 1
gain < c()
s < sort(d[d$RT2>79.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2>79.5 & d$RT2<s[i])
gr2 < which(d$RT2>79.5 & d$RT2>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

68.5  0.399  RT1 
68.5  0.399  RT1 
117.5  0.650  RT1 
58.5  0.846  RT1 
104.5  0.846  RT1 
48.5  0.995  RT1 
The results indicate that there is no split with a positive gain that can be added to the first leaf. Let's check the second leaf (RT2 < 79.5).
# All RT1 Splits on Leaf 2
gain < c()
s < sort(d[d$RT2<79.5,]$RT1)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2<79.5 & d$RT1<s[i])
gr2 < which(d$RT2<79.5 & d$RT1>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All RT2 Splits on Leaf 2
gain < c()
s < sort(d[d$RT2<79.5,]$RT2)+.5
for(i in 1:length(s)) {
gr1 < which(d$RT2<79.5 & d$RT2<s[i])
gr2 < which(d$RT2<79.5 & d$RT2>s[i])
GL < sum(g[gr1])
GR < sum(g[gr2])
HL < sum(h[gr1])
HR < sum(h[gr2])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

172.5  0.610  RT1 
67.5  0.687  RT1 
62.5  0.743  RT1 
114.5  0.784  RT1 
40.5  0.842  RT1 
33.5  0.879  RT1 
There is also no split with a positive gain that can be added to the second leaf. Therefore, we stop and Iteration 3 ends. Our third tree has a depth of one as the second tree. Let's compute the leaf weights on the third tree structure.
gr1 < which(d$RT2 > 79.5)
gr2 < which(d$RT2 < 79.5)
w1 = sum(g[gr1])/(sum(h[gr1])+lambda)
w2 = sum(g[gr2])/(sum(h[gr2])+lambda)
w1
[1] 0.4275847
w2
[1] 0.3063322
We use these weights to update the predictions at the end of Iteration 3 before we move on to Iteration 4. Each observation in our dataset is assigned to one of these weights depending on which leaf they belong to in Tree 3. Then, we add these estimated leaf weights to the previous predictions (\(\hat{y}^2\)) and obtain the new predictions at the end of Iteration 3 (\(\hat{y}^3\)).
d$w < NA
d$y3 < NA
d[gr1,]$w = w1
d[gr2,]$w = w2
d$y3 < d[,13] + d$w
colnames(d)[16] < "$w^{3}$"
colnames(d)[17] < "$\\hat{y}^3$"
t = kable(d,digits=3,row.names=FALSE)
column_spec(t,column=1:7,width="1em") %>%
add_header_above(c(" "," "," "," "," ","Iteration 1" = 4, "Iteration 2" = 4, "Iteration 3" = 4))
Iteration 1

Iteration 2

Iteration 3



Obs  RT1  RT2  y  \(\hat{y}^0\)  g1  h1  \(w^{1}\)  \(\hat{y}^1\)  g2  h2  \(w^{2}\)  \(\hat{y}^2\)  g3  h3  \(w^{3}\)  \(\hat{y}^3\) 
1  114  36  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449  0.610  0.238  0.306  0.143 
2  92  121  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.295  1.335  0.792  0.165  0.428  1.763 
3  49  24  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.295  0.505  0.376  0.235  0.306  0.811 
4  58  30  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449  0.610  0.238  0.306  0.143 
5  40  17  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.295  0.505  0.376  0.235  0.306  0.811 
6  67  66  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.295  1.335  0.792  0.165  0.306  1.029 
7  30  79  1  0  0.5  0.25  1.040  1.040  0.739  0.193  0.474  0.566  0.638  0.231  0.306  0.259 
8  19  14  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.474  1.274  0.218  0.171  0.306  1.581 
9  33  8  1  0  0.5  0.25  0.800  0.800  0.310  0.214  0.474  1.274  0.218  0.171  0.306  1.581 
10  172  38  1  0  0.5  0.25  0.154  0.154  0.538  0.249  0.295  0.449  0.610  0.238  0.306  0.143 
11  55  37  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238  0.306  0.143 
12  103  79  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.306  1.029 
13  68  117  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
14  194  17  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238  0.306  0.143 
15  68  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
16  48  127  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
17  48  200  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
18  61  47  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.306  1.029 
19  37  155  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
20  58  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
21  139  97  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
22  99  84  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
23  47  50  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.306  1.029 
24  41  54  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.306  1.029 
25  29  90  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.474  0.566  0.362  0.231  0.428  0.993 
26  117  31  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238  0.306  0.143 
27  44  104  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.428  1.763 
28  34  62  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.306  1.029 
29  83  20  0  0  0.5  0.25  0.154  0.154  0.462  0.249  0.295  0.449  0.390  0.238  0.306  0.143 
30  64  57  0  0  0.5  0.25  1.040  1.040  0.261  0.193  0.295  1.335  0.208  0.165  0.306  1.029 
The value of the loss function based on \(y_i\) and \(\hat{y}^3\) at the end of Iteration 3 is 13.57 (reduced from 14.35).
y < d[,4]
y3 < d[,17]
loss3 < sum(y*log(1+exp(y3)) + (1y)*log(1+exp(y3)))
loss3
[1] 13.5685
Our tree ensemble model at the end of Iteration 3 looks like this.
We update our first and secondorder gradient statistics based on the new predictions at the end of Iteration 3. As can be seen below, the search process did not find any split with a positive gain score. Therefore, the whole search process ended because there is no fourth tree structure that can be added with an improvement in our predictions. So, our final model remains the structure at the end of Iteration 3.
\[g_i^{(4)} = \frac{(y_i1)e^{\hat{y_i}^{(3)}}+y_i}{e^{\hat{y_i}^{(3)}}+1}\]
\[h_i^{(4)} = \frac{e^{\hat{y_i}^{(3)}}}{(e^{\hat{y_i}^{(3)}}+1)^2}\]
d$g4 < ((y1)*exp(y3)+y)/(exp(y3)+1)
d$h4 < exp(y3)/((exp(y3)+1)^2)
g < d$g4
h < d$h4
# All Splits based on RT1
gain < c()
s < sort(unique(d$RT1))+.5
for(i in 1:length(s)) {
split < (d$RT1<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp1 = cbind(as.data.frame(cbind(s,gain)),"RT1")
colnames(pp1) < c("Threshold","Gain","Feature")
# All Splits for RT2
gain < c()
s < sort(unique(d$RT2))+.5
for(i in 1:length(s)) {
split < (d$RT2<s[i])*1
GL < sum(g[which(split==1)])
GR < sum(g[which(split==0)])
HL < sum(h[which(split==1)])
HR < sum(h[which(split==0)])
gain[i] < ((GL^2/(HL+lambda)) + (GR^2/(HR+lambda))  (((GL+GR)^2)/(HL+HR+lambda)))  gamma
}
pp2 = cbind(as.data.frame(cbind(s,gain)),"RT2")
colnames(pp2) < c("Threshold","Gain","Feature")
pp < rbind(pp1,pp2)
pp < pp[order(pp$Gain,decreasing=TRUE),]
kable(head(pp),digits=3,row.names=FALSE)
Threshold  Gain  Feature 

38.5  0.662  RT2 
33.5  0.690  RT1 
36.5  0.721  RT2 
30.5  0.769  RT2 
40.5  0.787  RT1 
79.5  0.797  RT2 
Since we just developed (trained) our model, we can now use it to predict the probability of the outcome for a new observation. Suppose a new test taker comes and responds the first item and second item in 80 and 40 seconds, respectively (RT1 = 80, RT2= 40). We first place this examinee on the corresponding leaf for each tree depending on these inputs.
We sum the weights for this person from all trees, \[(1.040) + (0.295) + (0.306)= 1.029.\] Note that this is on the logit scale, and we can transform it to a probability as \[\frac{e^{1.641}}{e^{1.641}+1} = 0.263.\] So, a person with this pattern of response time, our model predicts that there is a 26% chance that this person had item preknowledge.
Let's do another example. Suppose another test taker comes and responds the first item and second item in 25 and 15 seconds, respectively (RT1 = 25, RT2= 15). This pattern corresponds to a weight of 0.800 (RT2 < 38.5 and RT1 < 49.5) for the first tree, a weight of 0.474 (RT1 < 33.5) for the second tree, and a weight of 0.306 (RT < 79.5) for the third tree. The sum of the weights is 1.58, corresponding to a probability of 0.83. So, a person with this particular pattern of response time, our model predicts that there is a 83% chance that this person had item preknowledge.
Notice that the smaller the response time is (and exceeds certain threshold levels in the model), the more likely the predicted probability of item preknowledge is. This makes sense in this context, so the model works as it is supposed to.
xgboost
package in R
require(xgboost)
dtrain < xgb.DMatrix(data = data.matrix(d[,2:3]), label=d$y)
fit < xgb.train(data=dtrain,
nround = 5,
eta = 1,
min_child_weight = 0,
max_depth = 2,
gamma = 1,
max_delta_step=0,
subsample = 1,
colsample_bytree = 1,
lambda = 1.5,
alpha =0,
scale_pos_weight =1,
num_parallel_tree = 1,
nthread = 1,
objective = "binary:logistic",
eval_metric="rmse",predict=TRUE)
First, let me briefly comment on some of the parameters here, and you can find many other webpage to give you more information about them.
As you can see, I set the gamma = 1
and lambda = 1.5
. These are the same numbers I set the value for gamma and lambda in my demonstration above.
nround = 5
is the number of maximum iterations you want to run, so this is technically the maximum number of trees you want to have in your tree ensemble model.
eta
is the learning parameter. In the above demonstration, I use eta=1
without explicitly saying because I added the exact amount of weight to update the prediction at each iteration. Instead, one can decide to add a smaller fraction of the weight (such as half of it, or 10% of it) to update the prediction. In this case, for instance, you can set eta=.1
, and this tells the algorithm to add 10% of the leaf weight estimated at each iteration when updating the predictions.
min_child_weight
is the minimum sum of the secondorder gradient statistics to add a new branch.
max_depth
is the maximum depth of a tree. I set it to 2 as in my demo above.
alpha
is another regularization parameter such as lambda and gamma. The original paper does not mention it, so I am not quite sure how it changes the formulas I provided in this tutorial, but it can be in the source code the algorithm is using (https://github.com/dmlc/xgboost/blob/master/src/tree/param.h, see Line 45, Line 142145, Line 303310). I set it to 0, so the XGBoost runs the same way in my demonstration.
Now, let's check the tree structure the original XGBoost algorithm comes up.
xgb.plot.tree(model = fit)
Let's try to understand the numbers in this model calibrated by the XGboost function and check if it is aligned with the numbers in our illustration above. I will briefly go over the numbers for the first tree and will leave it to you to check the numbers for the other two trees.
There are two numbers listed in the first box of Tree 0.
Cover: 7.5
Gain: 4.51163387
What Cover: 7.5
here refers is the sum of the second order gradient statistics for all 30 observations in the first iteration. So, if you go back to the big table reported above and sum the values of h1, that gives you 7.5. The Gain score reported is 4.5116. If we go back and check the gain score we found, it is 3.512. If you check the formula for the Gain above, you will see the \(\gamma\) is substracted at the end in our computation. So, the difference of 1 (after rounding to the third decimal) between the Gain Score I computed and the Gain Score XGBoost reported is probably due to the \(\gamma\) parameter used, \(\gamma = 1\).
I believe the original XGBoost must be using the regularization term for the search as we specify \(\gamma = 1\) in the function; however, it is probably reporting the Gain Score without substracting \(\gamma\) when it comes to reporting in this plot.
First, the threshold to split the first branch based on RT2 value is reported as 42.5. In my search, I found it as 38.5. So, this is a bit confusing. However, if we sort the values available on RT2, we see that there is no value in the data available between 38 and 47.
sort(d$RT2)
[1] 8 14 17 17 20 24 30 31 36 37 38 47 50 54 57 62
[17] 66 79 79 84 84 84 90 97 104 117 121 127 155 200
So, practically, there is no difference in splitting this training sample based on a threshold value of 38.5 or a threshold value of 42.5. The number comes up in my computation by design. In my naive search, I added .5 to each unique value in the dataset and then did a bruteforce search for every possible score. That's why 38.5 came up in my search as the threshold value. I am not sure how XGBoost comes up with 42.5. In short, there is no difference between the two for this small training sample.
Within the ellipse, there are two values Cover:4.75
and Value:1.03999996
. Cover
is again the sum of the second gradient statistics for those observations on this particular leaf (RT2 > 42.5), and Value
corresponds to the leaf weight for those observations on this leaf. Note that, the leaf weight we computed above in my computation is 1.04. So, we got the same weight in our computation.
Within the box, there are two values Cover:2.75
and Gain:1.14751124
. Cover
is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5), and Gain is the gain score obtained for the next additional best split on this leaf. Note that the gain score for this split I computed above is 0.148. Again, there is a difference of 1 (\(\gamma = 1\)) because it seems XGBoost doesn't care about the regularization parameter when it comes to reporting the gain scores in this plot.
Again, the threshold value for the second split on the RT1 variable is 52. In my naive search, I came up with 49.5. Similarly, there is no value in the data between 49 and 55 for the RT1 variable. So, practically, there is no difference picking a threshold of 49.5 or 52 for this training sample.
sort(d$RT1)
[1] 19 29 30 33 34 37 40 41 44 47 48 48 49 55 58 58
[17] 61 64 67 68 68 83 92 99 103 114 117 139 172 194
Within the first ellipse above, we see Cover:1
and Value:0.80000000012
. Cover
is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5 and RT1 < 52) and Value
corresponds to the leaf weight. Note that we computed the same weight in our computation for this leaf.
Within the second ellipse above, we see Cover:1.75
and Value:0.15384616
. Cover
is the sum of the second gradient statistics for those observations on this particular leaf (RT2 < 42.5 and RT1 > 52) and Value
corresponds to the leaf weight. Again, note that we computed the exact same weight for this leaf as well.
I hope this detailed illustration helps interested people understand better the mechanics of XGBoost. It certainly helped me. I have a better grasp of what is happening behind the scene whenever I use XGBoost for most complex problems, and the numbers make more sense whenever I look at the calibrated tree ensemble model.
Please feel free to send me any feedback and comments. I am happy to address any errors/typos in the post.
Thanks for reading.