An R package for random-forest-empowered imputation of missing Data
This is the repository for R package RfEmpImp
, for multiple imputation using random forests (RF).
This R package is an implementation for the RfPred
and RfNode
algorithms and currently operates under the multiple imputation computation framework mice
.
The R package contains both newly proposed and improved existing algorithms for random-forest-based multiple imputation of missing data.
For details of the newly proposed algorithms, please refer to: arXiv:2004.14823 (further updates pending).
With version 2.0.0
, the names of parameters were further simplified, please refer to the documentation for details.
For data with mixed types of variables, RfEmp
method is a short cut for using RfPred.Emp
for continuous variables and RfPred.Cate
for categorical variables (of type logical
or factor
).
Example:
For continuous variables, in RfPred.Emp
method, the empirical distribution of random forest’s out-of-bag prediction errors is used to construct the conditional distributions of the variable under imputation, providing conditional distributions with better quality. Users can set method = "rfpred.emp"
in function call to mice
to use it.
Also, in RfPred.Norm
method, normality was assumed for RF prediction errors, as proposed by Shah et al., and users can set method = "rfpred.norm"
in function call to mice
to use it.
For categorical variables, in RfPred.Cate
method, the probability machine theory is used, and the predictions of missing categories are based on the predicted probabilities for each missing observation. Users can set method = "rfpred.cate"
in function call to mice
to use it.
For both continuous variables, the observations under the predicting nodes of random forest are used as candidates for imputation.
Two methods are now available for the RfNode
algorithm.
Example:
# Prepare data
df <- nhanes
df[, c("age", "hyp")] <- lapply(X = nhanes[, c("age", "hyp")], FUN = as.factor)
# Do imputation
imp <- imp.rfnode.cond(df)
# Or: imp <- imp.rfnode.prox(df)
# Do analyses
regObj <- with(imp, lm(chl ~ bmi + hyp))
# Pool analyzed results
poolObj <- pool(regObj)
# Extract estimates
res <- reg.ests(poolObj)
RfNode.Cond
uses the conditional distribution formed by the prediction nodes, i.e. the weight changes of observations caused by the bootstrapping of random forest are considered, and uses “in-bag” observations only. Users can set method = "rfnode.cond"
in function call to mice
to use it.
RfNode.Prox
uses the concepts of proximity matrices of random forests, and observations fall under the same predicting nodes are used as candidates for imputation. Users can set method = "rfnode.prox"
in function call to mice
to use it.
The model building for random forest is accelerated using parallel computation powered by ranger
. The ranger software package provides support for parallel computation using native C++. In our simulations, parallel computation can provide impressive performance boost for multiple imputation process (about 4x faster on a quad-core laptop).