literanger: A fast implementation of random forests for multiple imputation

Stephen Wade

🚧 Under construction 🚧

literanger is an adaption of the ranger R package for training and predicting from random forest models within multiple imputation algorithms. ranger is a fast implementation of random forests (Breiman, 2001) or recursive partitioning, particularly suited for high dimensional data (Wright et al, 2017). literanger enables random forests to be embedded in the fully conditional specification framework for multiple imputation known as ‘Multiple Imputation via Chained Equations’ (Van Buuren, 2007).

Implementations of multiple imputation with random forests include:

mice which uses random forests to predict in a similar fashion to Doove et al, (2014), i.e. for each observation, a draw is taken from the sample of all values that belong to the terminal node of a randomly drawn tree.
miceRanger and missRanger which use predictive mean matching.

This package enables a minor variation on mice’s use of random forests. The prediction can be drawn from the in-bag samples in the terminal node for each missing data point. Thus, the computational effort during prediction then scales with the number of missing values, rather than with the product of the size of the whole dataset and the number of trees (as in mice).

A more general advantage of this package is re-cycling of the trained forest object and the separation of the (training) data from the forest, see ranger issue #304.

A multiple imputation algorithm using this package is under development: called mimputest.

Example

require(literanger)

train_idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris_train <- iris[ train_idx, ]
iris_test  <- iris[-train_idx, ]
rf_iris <- train(data=iris_train, response_name="Species")
pred_iris_bagged <- predict(rf_iris, newdata=iris_test,
                            prediction_type="bagged")
pred_iris_inbag  <- predict(rf_iris, newdata=iris_test,
                            prediction_type="inbag")
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)

Installation

Installation is easy using devtools:

library(devtools)
install_github('stephematician/literanger')

The cpp11 package is also required, available on CRAN:

install.packages('cpp11')

To-do

Not exhaustive:

~~prediction type: terminal nodes for every tree (e.g. for mice algorithm);~~
~~finish documentation, e.g. this README~~;
prepare CRAN submission;
implement variable importance measures;
probability and survival forests.

References

Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.

Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.