This vignette is adapted from the homepage of the SimEngine website.
library(SimEngine)
#> Loading required package: magrittr
#> Welcome to SimEngine! Full package documentation can be found at:
#> https://avi-kenny.github.io/SimEngine
SimEngine is an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments. Emphasis is placed on thorough documentation, and scalability.
The goal of many statistical simulations is to test how a new statistical method performs against existing methods. Most statistical simulations include three basic phases: (1) generate some data, (2) run one or more methods using the generated data, and (3) compare the performance of the methods.
To briefly illustrate how these phases are implemented using SimEngine, we will use the example of estimating the average treatment effect of a drug in the context of a randomized controlled trial (RCT).
The simulation object (an R object of class sim_obj) will contain all data, functions, and results related to your simulation. Note that we make extensive use of the pipe operators (%>%
and %<>%
) from the magrittr package; if you have never used pipes, check out the magrittr documentation.
library(SimEngine)
<- new_sim() sim
In SimEngine, functions that generate data are called creators. Our creator will simulate data from an RCT in which we compare a continuous outcome (e.g. blood pressure) between two groups (the “treatment group” versus the “control group”). We generate the data by looping through individuals, assigning them randomly to either the treatment group or the control group, and generating their outcome according to a simple model (note: although we use a for-loop for illustrative purposes, vectorized methods are often faster).
<- function (num_patients) {
create_rct_data <- data.frame(
df "patient_id" = integer(),
"group" = character(),
"outcome" = double(),
stringsAsFactors = FALSE
)for (i in 1:num_patients) {
<- ifelse(sample(c(0,1), size=1)==1, "treatment", "control")
group <- ifelse(group=="treatment", -7, 0)
treatment_effect <- rnorm(n=1, mean=130, sd=5) + treatment_effect
outcome <- list(i, group, outcome)
df[i,]
}return (df)
}
# Test our creator function
create_rct_data(5)
#> patient_id group outcome
#> 1 1 control 137.4695
#> 2 2 control 128.2048
#> 3 3 treatment 120.0571
#> 4 4 control 127.7421
#> 5 5 control 126.9210
Once you have declared your creator function, add it to your simulation object using the add_creator()
function.
%<>% add_creator(create_rct_data) sim
In this example, we test two different estimators of the average treatment effect. The first estimator uses the known probability of being assigned to the treatment group (0.5), whereas the second estimator uses an estimate of this probability based on the observed data. Don’t worry too much about the mathematical details; the important thing is that both methods attempt to take in the dataset generated by the create_rct_data()
function and return an estimate of the true treatment effect, which in this case is -7.
<- function(df) {
estimator_1 <- nrow(df)
n <- 0.5
true_prob <- sum(df$outcome * (df$group=="treatment"))
sum_t <- sum(df$outcome * (df$group=="control"))
sum_c return ( sum_t/(n*true_prob) - sum_c/(n*(1-true_prob)) )
}<- function(df) {
estimator_2 <- nrow(df)
n <- sum(df$group=="treatment") / n
est_prob <- sum(df$outcome * (df$group=="treatment"))
sum_t <- sum(df$outcome * (df$group=="control"))
sum_c return ( sum_t/(n*est_prob) - sum_c/(n*(1-est_prob)) )
}
# Test our estimator functions
<- create_rct_data(10000)
df estimator_1(df)
#> [1] -4.83141
estimator_2(df)
#> [1] -6.856129
Next, add the methods to your simulation object using the add_method()
function.
%<>% add_method(estimator_1)
sim %<>% add_method(estimator_2) sim
Often, we want to run the same simulation multiple times (with each run referred to as a “simulation replicate”), but with certain things changed. In this example, perhaps we want to vary the number of patients and the method used to estimate the average treatment effect. We refer to the things that vary as “simulation levels”. By default, SimEngine will run our simulation 1,000 times for each level combination. Below, since there are two methods and three values of num_patients, we have six level combinations and so SimEngine will run a total of 6,000 simulation replicates.
%<>% set_levels(
sim estimator = c("estimator_1", "estimator_2"),
num_patients = c(50, 200, 1000)
)
The simulation script is a function that runs a single simulation replicate and returns the results. Within a script, you can reference the current simulation level values using the variable L. For example, when the first simulation replicate is running, L$estimator
will equal “estimator_1” and L$num_patients
will equal 50. In the last simulation replicate, L$estimator
will equal “estimator_2” and L$num_patients
will equal 1,000. This can be used in conjunction with use_method()
to dynamically run different methods within different simulation replicates (as is illustrated below). Your script will have access to any creators and methods that have been added to your simulation object.
%<>% set_script(function() {
sim <- create_rct_data(L$num_patients)
df <- use_method(L$estimator, list(df))
estimate return (list("estimate"=estimate))
})
Your script should always return a named list, although your list can be complex and contain dataframes, multiple levels of nesting, etc.
This controls options related to your entire simulation, such as the number of simulation replicates to run for each level combination and how to parallelize your code. This is discussed in detail on the set_config
page. We set num_sim
to 10, and so SimEngine will run a total of 60 simulation replicates (10 for each level combination).
%<>% set_config(
sim num_sim = 10,
n_cores = 2,
parallel = "outer"
)
All 600 replicates are run at once and results are stored in the simulation object.
%<>% run()
sim #> Done. No errors or warnings detected.
Once the simulations have finished, use the summarize()
function to calculate common summary statistics, such as bias, variance, MSE, and coverage.
%>% summarize(
sim bias = list(name="bias_ate", truth=-7, estimate="estimate"),
mse = list(name="mse_ate", truth=-7, estimate="estimate")
)#> level_id estimator num_patients bias_ate mse_ate
#> 1 1 estimator_1 50 -7.7614791 444.8814693
#> 2 2 estimator_2 50 -0.3634045 1.3219396
#> 3 3 estimator_1 200 -5.2242565 200.0395729
#> 4 4 estimator_2 200 -0.2626850 0.3503838
#> 5 5 estimator_1 1000 2.0829173 55.2037179
#> 6 6 estimator_2 1000 -0.2683631 0.1852299
In this example, we see that the MSE of estimator 1 is much higher than that of estimator 2 and that MSE decreases with increasing sample size for both estimators, as expected.
You can also directly access the results for individual simulation replicates.
head(sim$results)
#> sim_uid level_id sim_id estimator num_patients runtime estimate
#> 1 1 1 1 estimator_1 50 0.017210007 -37.745086
#> 2 2 1 2 estimator_1 50 0.089748144 -8.037211
#> 3 3 1 3 estimator_1 50 0.007508039 -36.700368
#> 4 4 1 4 estimator_1 50 0.005068064 -25.745564
#> 5 5 1 5 estimator_1 50 0.003932953 -5.914498
#> 6 6 1 6 estimator_1 50 0.012669086 -8.192042