This one of our supplements to our whitepaper on open-source NMEC tools. The previous supplements discussed the basic TOWT model and the occupancy detection algorithm.
In this post, I’ll address some common questions about the Gradient Boosting Machine model used in RMV2.0, using code to find the answers.
This post will be mostly code, and I’ll skip the discussion.
For visualization functions of xgboost, you may need to set up some packages.
install.packages("igraph")
install.packages("DiagrammeR")
As before, I’ve already run the RMV2.0 add-in to create my model and save a project file. I will just load the file into this notebook.
#rds_file <- "C:/RMV2.0 Workshop/deteleme/Project_05.12.rds"
rds_file <- "C:/RMV2.0 Workshop/deteleme/Project_05.17.rds"
Project <- readRDS(rds_file)
The project has the following models.
print(names(Project$model_obj_list$models_list))
[1] "Data_pre_2_2.csv"
First, let’s look at the hyperparameters selected by the K-folds cross-validation grid search.
res_baseline = Project$model_obj_list$models_list[["Data_pre_2_2.csv"]]
print(res_baseline$tuned_parameters)
$best_iter
[1] 400
$best_depth
[1] 5
$best_lr
[1] 0.05
$best_subsample
[1] 0.5
print(res_baseline$gbm_cv_results)
Next, we’ll get a XGB object. There are several ways to do this, depending on how we saved the model. We are using function calls from the xgboost library. I’ll make them explicit for clarity.
#library(xgboost)
# This model was stored directly in the project file, from an element of the environment.
# This is available from RMV2.0.
model_as_is = res_baseline$gbm_model
# These are other ways to store a GBM, for example. These are not usually stored by RMV2.0.
#xgboost::xgb.save(gbm_model, "C:/RMV2.0 Workshop/deteleme/xgb.model")
gbm_model1=xgboost::xgb.load("C:/RMV2.0 Workshop/deteleme/xgb.model")
#model_raw <- xgboost::xgb.save.raw(gbm_model)
model_raw = res_baseline$gbm_model_raw
gbm_model2=xgboost::xgb.load.raw(model_raw)
#model_ser <- xgboost::xgb.serialize(gbm_model)
model_ser = res_baseline$gbm_model_serialized
gbm_model3=xgboost::xgb.unserialize(model_ser)
They don’t look like much yet.
print(model_as_is)
##### xgb.Booster
Handle is invalid! Suggest using xgb.Booster.complete
raw: 1.3 Mb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max_depth = ..1, eta = ..2, subsample = ..3,
nthread = 1)
params (as set within xgb.train):
max_depth = "5", eta = "0.05", subsample = "0.5", nthread = "1", validate_parameters = "1"
callbacks:
cb.evaluation.log()
# of features: 3
niter: 400
nfeatures : 3
evaluation_log:
print(gbm_model1)
##### xgb.Booster
raw: 1.3 Mb
xgb.attributes:
niter
niter: 399
print(gbm_model2)
<pointer: 0x000002dd866036f0>
attr(,"class")
[1] "xgb.Booster.handle"
print(gbm_model3)
<pointer: 0x000002dd86600940>
attr(,"class")
[1] "xgb.Booster.handle"
Can we use the saved model to generate predictions from performance period data?
train = res_baseline$train
variables = res_baseline$variables
train_input <- train[,variables]
print(head(train_input))
y_fit0 <- predict(model_as_is, as.matrix(train_input))
print(head(y_fit0))
[1] 286.9458 285.5898 287.0888 287.4875 304.9283 339.1770
y_fit1 <- predict(gbm_model1, as.matrix(train_input))
print(head(y_fit1))
[1] 286.9458 285.5898 287.0888 287.4875 304.9283 339.1770
# OK, even though this object looks empty, this still works.
y_fit2 <- predict(gbm_model2, as.matrix(train_input))
print(head(y_fit2))
[1] 286.9458 285.5898 287.0888 287.4875 304.9283 339.1770
Now, let’s apply some of XGBoost’s functions for inspecting the model.
# This only works when model was stored with xgb.load(), not load.raw or unserialize
xgboost::xgb.plot.deepness(gbm_model1)
# This doesn't work unless we set the cb.gblinear.history() callback:
try(
xgboost::xgb.gblinear.history(gbm_model1)
)
Error in xgboost::xgb.gblinear.history(gbm_model1) :
model must be trained while using the cb.gblinear.history() callback
# A data.table with columns Feature, Gain, Cover, Frequency
importance_matrix <- xgboost::xgb.importance(colnames(as.matrix(train_input)), model = gbm_model1)
xgboost::xgb.plot.importance(
importance_matrix = importance_matrix,
rel_to_first = TRUE, xlab = "Relative importance"
)
# An interesting list of the trees, and their nodes and leaves
# In the sample project, there are:
# 400 trees/boosters, and
# 23186-400 = 22786 non-root nodes, and
# on average, 57 non-root nodes per tree
# Would love to visualize one tree, for example
model_dump = xgboost::xgb.dump(gbm_model1, with_stats=T)
cat(paste(head(model_dump,45),"\n"),
"...\n",
paste(length(model_dump),"lines of text"),
sep="")
booster[0]
0:[f1<31.5] yes=1,no=2,missing=1,gain=148948096,cover=4394
1:[f0<20.8500004] yes=3,no=4,missing=3,gain=526056,cover=798
3:[f1<17.5] yes=7,no=8,missing=7,gain=499728,cover=752
7:[f1<10.5] yes=13,no=14,missing=13,gain=150288,cover=417
13:[f1<6.5] yes=25,no=26,missing=25,gain=110672,cover=268
25:leaf=13.4230042,cover=166
26:leaf=16.1880283,cover=102
14:leaf=12.1138716,cover=149
8:[f1<30.5] yes=15,no=16,missing=15,gain=148036,cover=335
15:leaf=16.0547409,cover=309
16:leaf=21.1384792,cover=26
4:leaf=20.8913136,cover=46
2:[f1<142.5] yes=5,no=6,missing=5,gain=89664768,cover=3596
5:[f0<15.4500008] yes=9,no=10,missing=9,gain=85160448,cover=2913
9:[f1<128.5] yes=17,no=18,missing=17,gain=29442944,cover=2268
17:[f1<118.5] yes=27,no=28,missing=27,gain=31947904,cover=2022
27:leaf=38.6704979,cover=1798
28:leaf=18.520731,cover=224
18:[f1<139.5] yes=29,no=30,missing=29,gain=6115168,cover=246
29:leaf=59.0120316,cover=197
30:leaf=37.60606,cover=49
10:[f2<0.5] yes=19,no=20,missing=19,gain=6733056,cover=645
19:[f0<19.7000008] yes=31,no=32,missing=31,gain=4600448,cover=633
31:leaf=55.8055,cover=382
32:leaf=65.7654495,cover=251
20:[f1<52] yes=33,no=34,missing=33,gain=137242.625,cover=12
33:leaf=23.7987671,cover=8
34:leaf=8.92444515,cover=4
6:[f1<151.5] yes=11,no=12,missing=11,gain=6792640,cover=683
11:[f1<143.5] yes=21,no=22,missing=21,gain=627056,cover=243
21:[f0<9.14999962] yes=35,no=36,missing=35,gain=68218,cover=34
35:leaf=26.868927,cover=15
36:leaf=18.1845856,cover=19
22:[f1<150.5] yes=37,no=38,missing=37,gain=203452,cover=209
37:leaf=14.1395741,cover=186
38:leaf=19.9934063,cover=23
12:[f1<159.5] yes=23,no=24,missing=23,gain=11340768,cover=440
23:[f1<152.5] yes=39,no=40,missing=39,gain=888928,cover=209
39:leaf=24.126606,cover=25
40:leaf=36.4227371,cover=184
24:[f1<162.5] yes=41,no=42,missing=41,gain=995096,cover=231
41:leaf=23.9780502,cover=71
42:leaf=16.4212399,cover=160
booster[1]
...
23212 lines of text
# Plot all the trees.
# Just kidding, we have 400 trees. Seriously, don't do it.
#xgboost::xgb.plot.tree(model = gbm_model1)
Now, tree visualizations. Zoom in to see the details on the nodes.
# Plot only the first tree and display the node IDs:
xgboost::xgb.plot.tree(model = gbm_model1, trees = 0, show_node_id = TRUE)
# Plot only the next tree and display the node IDs:
xgboost::xgb.plot.tree(model = gbm_model1, trees = 1, show_node_id = TRUE)
# Plot only the first tree and display the node IDs:
xgboost::xgb.plot.tree(model = gbm_model1, trees = 398, show_node_id = TRUE)
# Plot only the first tree and display the node IDs:
xgboost::xgb.plot.tree(model = gbm_model1, trees = 399, show_node_id = TRUE)
Let’s try the function xgboost::xgb.plot.multi.trees
. From the help file, here’s what it does.
This function tries to capture the complexity of a gradient boosted tree model in a cohesive way by compressing an ensemble of trees into a single tree-graph representation. The goal is to improve the interpretability of a model generally seen as black box."
#"This function tries to capture the complexity of a gradient boosted tree model
#in a cohesive way by compressing an ensemble of trees into a single tree-graph
#representation. The goal is to improve the interpretability of a model generally
#seen as black box."
xgboost::xgb.plot.multi.trees(model = gbm_model1, feature_names = variables)
To summarize, we explored the GBM model, and ways to save and load a GBM model created by XGBoost in R. We saw that the model generally stores a large amount of information, usually stored in binary, but that you can export a text representation. And we saw what one regression tree looks like, and the kind of branching rules that were automatically generated from our data.
Thank you for reading. Go back to article or supplements.