Stephen Wade
🚧 Under construction 🚧
literanger
is an adaption of the ranger
R package for training and predicting from random forest models within
multiple imputation algorithms. ranger
is a fast
implementation of random forests (Breiman, 2001) or
recursive partitioning, particularly suited for high dimensional data
(Wright et al,
2017). literanger
enables random forests to be embedded
in the fully conditional specification framework for multiple imputation
known as ‘Multiple Imputation via Chained Equations’ (Van Buuren,
2007).
Implementations of multiple imputation with random forests include:
mice
which uses random forests to predict in a similar fashion to Doove et al,
(2014), i.e. for each observation, a draw is taken from the sample
of all values that belong to the terminal node of a randomly drawn
tree.miceRanger
and missRanger
which use predictive mean matching.This package enables a minor variation on mice
’s use of
random forests. The prediction can be drawn from the in-bag samples in
the terminal node for each missing data point. Thus, the
computational effort during prediction then scales with the number of
missing values, rather than with the product of the size of the whole
dataset and the number of trees (as in mice
).
A more general advantage of this package is re-cycling of the trained
forest object and the separation of the (training) data from the forest,
see ranger
issue #304.
A multiple imputation algorithm using this package is under
development: called mimputest
.
require(literanger)
<- sample(nrow(iris), 2/3 * nrow(iris))
train_idx <- iris[ train_idx, ]
iris_train <- iris[-train_idx, ]
iris_test <- train(data=iris_train, response_name="Species")
rf_iris <- predict(rf_iris, newdata=iris_test,
pred_iris_bagged prediction_type="bagged")
<- predict(rf_iris, newdata=iris_test,
pred_iris_inbag prediction_type="inbag")
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)
Installation is easy using devtools
:
library(devtools)
install_github('stephematician/literanger')
The cpp11
package is also required, available on CRAN:
install.packages('cpp11')
Not exhaustive:
Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.
Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.
Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.