This vignette demonstrates the use of the oosse package for estimating for estimating the out-of-sample R² and its standard error through resampling algorithms, described in “The out-of-sample R²: estimation and inference” by Hawinkel et al. 2023.
install.packages("oosse")
library(oosse)
The R2oosse function works with any pair of fitting and prediction functions. Here we illustrate a number of them, but any prediction function implemented in R can be used. The built-in dataset Brassica is used, which contains rlog-transformed gene expression measurements for the 1,000 most expressed genes in the Expr slot, as well as 5 outcome phenotypes in the Pheno slot.
data(Brassica)
The fitting model must accept at least an outcome vector y and a regressor matrix x:
= function(y, x){lm.fit(y = y, x = cbind(1, x))} fitFunLM
The predictive model must accept arguments mod (the fitted model) and x, the regressor matrix for a new set of observations.
= function(mod, x) {cbind(1,x) %*% mod$coef} predFunLM
Now that these functions have been defined, we apply the prediction model for leaf_8_width using the first 10 genes. Multithreading is used automatically using the BiocParallel package. Change the following setup depending on your system.
library(BiocParallel)
= 2 # For cRAN build
nCores register(MulticoreParam(nCores))
Now estimate the \(R^2\), also a rough estimate of the computation time is given. Remember to provide the cluster for multithreading.
= R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:10],
R2lm fitFun = fitFunLM, predFun = predFunLM)
## Fitting and evaluating the model once took 0 seconds.
## You requested 200 repeats of 10-fold cross-validation with 10 cores, which is expected to last for roughly
## 2.05 seconds
Estimates and standard error of the different components are now available.
#R2
$R2 R2lm
## R2 R2SE
## 0.4332216 0.1647254
#MSE
$MSE R2lm
## MSE MSESE
## 3.2415675 0.7110744
#MST
$MST R2lm
## MST MSTSE
## 5.719286 1.035600
Also confidence intervals can be constructed:
# R2
buildConfInt(R2lm)
## 2.5% 97.5%
## 0.1103657 0.7560775
#MSE, 90% confidence interval
buildConfInt(R2lm, what = "MSE", conf = 0.9)
## 5% 95%
## 2.071954 4.411181
By default, cross-validation (CV) is used to estimate the MSE, and nonparametric bootstrapping is used to estimate the correlation between MSE and MST estimators. Other choices can be supplied though, e.g. for bootstrap .632 estimation of the MSE and jackknife estimation of the correlation:
= R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, 1:10],
R2lm632jn fitFun = fitFunLM, predFun = predFunLM, methodMSE = "bootstrap",
methodCor = "jackknife")
## Fitting and evaluating the model once took 0 seconds.
## You requested 200 .632 bootstrap instances with 10 cores, which is expected to last for roughly
## 2.56 seconds
For high-dimensional problems, such as the Brassica dataset, a regularised linear model is better suited to incorporate information for all genes. We use the cv.glmnet function from the glmnet package, which includes internal cross-validation for tuning the penalty parameter. Following custom function definitions are needed to fit in with the naming convention of the oosse package.
= function(y, x, ...) {cv.glmnet(y = y, x = x, ...)}
fitFunReg = function(mod, x, ...){predict(mod, newx = x, ...)} predFunReg
We adapt the parameter settings a bit to reduce computation time of the vignette.
if(require(glmnet)){
= R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, seq_len(5e1)], #Subset genes for speed
R2pen nFolds = 4, cvReps = 1e2, nBootstrapsCor = 30,
fitFun = fitFunReg, predFun = predFunReg, alpha = 1)#Lasso model
$R2
R2pen }
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 4.1-6
## Fitting and evaluating the model once took 0.17 seconds.
## You requested 100 repeats of 4-fold cross-validation with 10 cores, which is expected to last for roughly
## 29.41 seconds
## R2 R2SE
## 0.62688448 0.09036653
As a final example we use a random forest as a prediction model. We use the implementation from the randomForest package.
if(require(randomForest)){
= function(y, x, ...){randomForest(y = y, x, ...)}
fitFunrf = function(mod, x, ...){predict(mod, x, ...)}
predFunrf = R2oosse(y = Brassica$Pheno$Leaf_8_width, x = Brassica$Expr[, seq_len(5e1)],
R2rf nFolds = 4, cvReps = 1e2, nBootstrapsCor = 30,
fitFun = fitFunrf, predFun = predFunrf)
$R2
R2rf }
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## Fitting and evaluating the model once took 0.08 seconds.
## You requested 100 repeats of 4-fold cross-validation with 10 cores, which is expected to last for roughly
## 13.59 seconds
## R2 R2SE
## 0.67076588 0.08850596
The \(R^2\) estimate is comparable to that of the penalised regression model.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.1 glmnet_4.1-6 Matrix_1.5-3
## [4] BiocParallel_1.32.6 oosse_1.0.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 rstudioapi_0.14 knitr_1.42 splines_4.2.1
## [5] lattice_0.20-45 R6_2.5.1 rlang_1.1.0 foreach_1.5.2
## [9] fastmap_1.1.1 tools_4.2.1 rbibutils_2.2.13 parallel_4.2.1
## [13] grid_4.2.1 xfun_0.37 cli_3.6.0 jquerylib_0.1.4
## [17] iterators_1.0.14 htmltools_0.5.4 survival_3.5-5 yaml_2.3.7
## [21] digest_0.6.31 sass_0.4.5 Rdpack_2.4 codetools_0.2-19
## [25] shape_1.4.6 cachem_1.0.7 evaluate_0.20 rmarkdown_2.20
## [29] compiler_4.2.1 bslib_0.4.2 jsonlite_1.8.4