Tidy Simulation with simglm

2019-01-23

Tidy Simulation with simglm

Introducing a new framework for simulation and power analysis with the simglm package, tidy simulation. The package now has helper functions in which the first argument is always the data and the second argument is always the simulation arguments. A second vignette is written that contains a more exhuastive list of simluation arguments that are allowed. This vignette will give the basics needed to specify simulation arguments to generate simulated data.

Functions for Basic Simulation

There are four primary functions to be used for basic simulation functionality. These include:

We will dive into what each of these component represent in a second, but first let’s start with an example that simulates a linear regression model.

First Linear Regression Example

Let us assume for this first example that our outcome of interest is continuous and that the data is not clustered. In this example, our model would look like: \(Y_{j} = X_{j} \beta + e_{j}\). In this equation, \(Y_{j}\) represents the continuous outcome, \(X_{j} \beta\) represents the fixed portion of the model comprised of regression coefficients (\(\beta\)) and a design matrix (\(X_{j}\)), and \(e_{j}\) represents random error.

Fixed Portion of Model

The fixed portion of the model represents variables that are treated as fixed. This means that the values observed in the data are the only values we are interested in, will generalize to, or that we consider values of interest in our population. Let us consider the following formula: y ~ 1 + x1 + x2. In this formula there are a total of two varaibles that are assumed to be fixed, x1 and x2. These variables together with an intercept would make up the design matrix \(X_{j}\). Let’s generate some example data using the simglm package and the simulate_fixed function.

##    X.Intercept.       x1       x2 level1_id
## 1             1 231.1471 41.73851         1
## 2             1 158.6388 47.42296         2
## 3             1 171.6605 40.94163         3
## 4             1 176.4105 52.21630         4
## 5             1 176.2812 34.23280         5
## 6             1 188.0455 35.97664         6
## 7             1 201.8052 42.28035         7
## 8             1 186.9941 42.10166         8
## 9             1 190.1734 42.88792         9
## 10            1 163.4426 42.23178        10

The first three columns of the resulting data frame is the design matrix described above, \(X_{j}\). You may have noticed that the simulate_fixed function needs three elements defined in the simulation arguments (called sim_arguments) above. These elements are:

  • formula: this argument represents a R formula that is used to represent the model that is wished to be simulated. For the simulate_fixed argument, only the right hand side is used.
  • fixed: these represent the specific details for generating the variables (other than the intercept) in the right hand side of the formula simulation argument. Each variable is specified as its own list by name in the formula argument and the var_type specifies the type of variable to generate. The vary_type argument is required for each fixed variable to simulate. Optional arguments, for example mean = and sd = above, will be discussed in more detail later.
  • sample_size: this argument tells the function how many responses to generate.

The columns x1 and x2 would represent variables that we would gather if these data were real. To reflect a real life scenario, consider the following fixed simuation.

##    X.Intercept.   weight      age level1_id
## 1             1 231.1471 41.73851         1
## 2             1 158.6388 47.42296         2
## 3             1 171.6605 40.94163         3
## 4             1 176.4105 52.21630         4
## 5             1 176.2812 34.23280         5
## 6             1 188.0455 35.97664         6
## 7             1 201.8052 42.28035         7
## 8             1 186.9941 42.10166         8
## 9             1 190.1734 42.88792         9
## 10            1 163.4426 42.23178        10

Now instead of the variables being called x1 and x2, they now reflect variables weight (measured in lbs) and age (measured continuously, not rounded to whole digits). If we wished to change the 'age' variable to be rounded to a whole integer, we could change the variable type to 'ordinal' as such.

##    X.Intercept.   weight age level1_id
## 1             1 231.1471  49         1
## 2             1 158.6388  60         2
## 3             1 171.6605  58         3
## 4             1 176.4105  45         4
## 5             1 176.2812  47         5
## 6             1 188.0455  53         6
## 7             1 201.8052  60         7
## 8             1 186.9941  43         8
## 9             1 190.1734  33         9
## 10            1 163.4426  48        10

When specifying a var_type as 'ordinal', an additional levels argument is needed to determine which values are possible for the simulation. The last type of variable that is useful to discuss now would be factor or categorical variables. These variables can be generated by setting the var_type = 'factor'.

##    X.Intercept.   weight age sex  weight1 age1   sex1 level1_id
## 1             1 231.1471  49   1 231.1471   49   male         1
## 2             1 158.6388  60   0 158.6388   60 female         2
## 3             1 171.6605  58   0 171.6605   58 female         3
## 4             1 176.4105  45   0 176.4105   45 female         4
## 5             1 176.2812  47   0 176.2812   47 female         5
## 6             1 188.0455  53   1 188.0455   53   male         6
## 7             1 201.8052  60   0 201.8052   60 female         7
## 8             1 186.9941  43   1 186.9941   43   male         8
## 9             1 190.1734  33   0 190.1734   33 female         9
## 10            1 163.4426  48   0 163.4426   48 female        10

The required arguments when a factor variable are identical to an ordinal variable, however for factor variables the levels argument can either be an integer or can be a character vector where the character labels are specified directly. As you can see from the output, when a character vector is used, two variables are returned, one that contains the variable represented numerically and another that is the variable represented as a character vector. More details on this behavior to follow.

Simulate Random Error

The simulation of random error (\(e_{j}\) from the equation above) is a bit simpler than generating the fixed effects. Suppose for example, we want to simply simulate random errors that are normally distributed with a mean of 0 and a variance of 1. This can be done with the simulate_error function.

##         error level1_id
## 1   1.7049032         1
## 2  -0.7120386         2
## 3  -0.2779849         3
## 4  -0.1196490         4
## 5  -0.1239606         5
## 6   0.2681838         6
## 7   0.7268415         7
## 8   0.2331354         8
## 9   0.3391139         9
## 10 -0.5519147        10

The simulate_error function only needs to know how many data values to simulate. By default, the rnorm function is used to generate random error and this function assumes a mean of 0 and standard deviation of 1 by default. I personally prefer the slightly more verbose code however.

##         error level1_id
## 1   1.7049032         1
## 2  -0.7120386         2
## 3  -0.2779849         3
## 4  -0.1196490         4
## 5  -0.1239606         5
## 6   0.2681838         6
## 7   0.7268415         7
## 8   0.2331354         8
## 9   0.3391139         9
## 10 -0.5519147        10

This code makes it clearer when the variance of the errors is wished to be specified as some other value. For example:

##         error level1_id
## 1   8.5245161         1
## 2  -3.5601928         2
## 3  -1.3899246         3
## 4  -0.5982451         4
## 5  -0.6198031         5
## 6   1.3409189         6
## 7   3.6342074         7
## 8   1.1656771         8
## 9   1.6955694         9
## 10 -2.7595733        10

Generate Response Variable

Now that we have seen the basics of simulating fixed variables (covariates) and random error, we can now generate the response by combining the previous two steps and then using the generate_response function. What makes this a tidy simulation is that the pipe from magrittr, %>% can be used to combine steps into a simulation pipeline.

##    X.Intercept.   weight age sex  weight1 age1   sex1 level1_id
## 1             1 231.1471  49   1 231.1471   49   male         1
## 2             1 158.6388  60   0 158.6388   60 female         2
## 3             1 171.6605  58   0 171.6605   58 female         3
## 4             1 176.4105  45   0 176.4105   45 female         4
## 5             1 176.2812  47   0 176.2812   47 female         5
## 6             1 188.0455  53   1 188.0455   53   male         6
## 7             1 201.8052  60   0 201.8052   60 female         7
## 8             1 186.9941  43   1 186.9941   43   male         8
## 9             1 190.1734  33   0 190.1734   33 female         9
## 10            1 163.4426  48   0 163.4426   48 female        10
##          error fixed_outcome random_effects        y
## 1    4.5862776      66.94413              0 71.53041
## 2   -0.5353077      43.59165              0 43.05635
## 3    4.9416770      47.69814              0 52.63981
## 4   -5.3611940      50.42316              0 45.06196
## 5   -3.7900764      50.18435              0 46.39428
## 6    0.4750036      53.61365              0 54.08866
## 7  -11.6546559      56.54157              0 44.88692
## 8    2.0875799      54.29822              0 56.38580
## 9   -5.6016371      55.75202              0 50.15039
## 10  -2.3734235      46.23277              0 43.85934

The only additional argument that is needed for the generate_response function is the reg_weights argument. This argument represents the regression coeficients associated with \(\beta\) in the equation \(Y_{j} = X_{j} \beta + e_{j}\). The regression coefficients are multiplied by the design matrix to generate the column labeled “fixed_outcome” in the output. The output also contains the column, “random_effects” which are all 0 here indicating there are no random effects and the response variable, “y”.

Non-normal Outcomes

Non-normal outcomes are possible with simglm. Two non-normal outcomes are currently supported with more support coming in the future. Binary and count outcomes are supported and can be specified with the outcome_type simulation argument. If outcome_type = 'logistic' or outcome_type = 'binary' then a binary outcome is generated (ie. 0/1 variable) using the rbinom function. If outcome_type = 'count' or outcome_type = 'poisson' then the outcome is transformed to be a count variable (ie. discrete variable; 0, 1, 2, etc.).

Here is an example of generating a binary outcome.

##    X.Intercept.   weight age sex  weight1 age1   sex1 level1_id
## 1             1 231.1471  49   1 231.1471   49   male         1
## 2             1 158.6388  60   0 158.6388   60 female         2
## 3             1 171.6605  58   0 171.6605   58 female         3
## 4             1 176.4105  45   0 176.4105   45 female         4
## 5             1 176.2812  47   0 176.2812   47 female         5
## 6             1 188.0455  53   1 188.0455   53   male         6
## 7             1 201.8052  60   0 201.8052   60 female         7
## 8             1 186.9941  43   1 186.9941   43   male         8
## 9             1 190.1734  33   0 190.1734   33 female         9
## 10            1 163.4426  48   0 163.4426   48 female        10
##          error fixed_outcome random_effects untransformed_outcome y
## 1    4.5862776      66.94413              0              71.53041 1
## 2   -0.5353077      43.59165              0              43.05635 1
## 3    4.9416770      47.69814              0              52.63981 1
## 4   -5.3611940      50.42316              0              45.06196 1
## 5   -3.7900764      50.18435              0              46.39428 1
## 6    0.4750036      53.61365              0              54.08866 1
## 7  -11.6546559      56.54157              0              44.88692 1
## 8    2.0875799      54.29822              0              56.38580 1
## 9   -5.6016371      55.75202              0              50.15039 1
## 10  -2.3734235      46.23277              0              43.85934 1

And finally, an example of generating a count outcome. Note, the weight variable here has been grand mean centered in the generation (ie. mean = 0). This helps to ensure that the counts are not too large.

##    X.Intercept.     weight sex    weight1   sex1 level1_id      error
## 1             1  51.147097   0  51.147097 female         1 -4.0233584
## 2             1 -21.361157   0 -21.361157 female         2  2.2803457
## 3             1  -8.339547   0  -8.339547 female         3  2.1016629
## 4             1  -3.589471   1  -3.589471   male         4  2.8879225
## 5             1  -3.718819   0  -3.718819 female         5  2.2317803
## 6             1   8.045513   0   8.045513 female         6  4.5862776
## 7             1  21.805245   0  21.805245 female         7 -0.5353077
## 8             1   6.994062   1   6.994062   male         8  4.9416770
## 9             1  10.173416   1  10.173416   male         9 -5.3611940
## 10            1 -16.557440   0 -16.557440 female        10 -3.7900764
##    fixed_outcome random_effects untransformed_outcome    y
## 1       2.511471              0             -1.511887    0
## 2       1.786388              0              4.066734   53
## 3       1.916605              0              4.018267   58
## 4       2.464105              0              5.352028  194
## 5       1.962812              0              4.194592   70
## 6       2.080455              0              6.666733  743
## 7       2.218052              0              1.682745    7
## 8       2.569941              0              7.511618 1767
## 9       2.601734              0             -2.759460    0
## 10      1.834426              0             -1.955651    0

Functions for Power Analysis

Now that the basics of tidy simulation have been shown in the context of a linear regression model, let’s explore an example in which the power for this model is to be evaluated. In particular, suppose we are interested in estimating empirical power for the three fixed effects based on the following formula: y ~ 1 + weight + age + sex. More specifically, we are interested in estimating power for “weight”, “age”, and “sex” variables. A few additional functions are needed for this step including:

To fit a model and extract coefficients, we could do the following building off the example from the previous section:

set.seed(321) 

sim_arguments <- list(
  formula = y ~ 1 + weight + age + sex,
  fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
               age = list(var_type = 'ordinal', levels = 30:60),
               sex = list(var_type = 'factor', levels = c('male', 'female'))),
  error = list(variance = 25),
  sample_size = 10,
  reg_weights = c(2, 0.3, -0.1, 0.5)
)

simulate_fixed(data = NULL, sim_arguments) %>%
  simulate_error(sim_arguments) %>%
  generate_response(sim_arguments) %>% 
  model_fit(sim_arguments) %>%
  extract_coefficients()
## # A tibble: 4 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  11.6      20.2        0.573  0.588 
## 2 weight        0.208     0.0945     2.20   0.0698
## 3 age          -0.0362    0.191     -0.190  0.856 
## 4 sex           8.79      4.07       2.16   0.0741

This output contains the model output for a single data simulation, more specifically we can see the parameter name, parameter estimate, the standard error for the parameter estimate, the test statistics, and the p-value. These were estimated using the lm function based on the same formula defined in sim_arguments. It is possible to specify your own formula, model fitting function, and regression weights to the model_fit function. For example, suppose we knew that weight was an important predictor, but are unable to collect it in real life. We could then specify an alternative model when evaluating power, but include the variable in the data generation step.

set.seed(321) 

sim_arguments <- list(
  formula = y ~ 1 + weight + age + sex,
  fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
               age = list(var_type = 'ordinal', levels = 30:60),
               sex = list(var_type = 'factor', levels = c('male', 'female'))),
  error = list(variance = 25),
  sample_size = 10,
  reg_weights = c(2, 0.3, -0.1, 0.5),
  model_fit = list(formula = y ~ 1 + age + sex,
                   model_function = 'lm',
                   reg_weights = c(2, -0.1, 0.5))
)

simulate_fixed(data = NULL, sim_arguments) %>%
  simulate_error(sim_arguments) %>%
  generate_response(sim_arguments) %>% 
  model_fit(sim_arguments) %>%
  extract_coefficients()
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded
## # A tibble: 3 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  50.6       12.0       4.20  0.00403
## 2 age          -0.0802     0.236    -0.340 0.744  
## 3 sex          13.9        4.15      3.36  0.0120

Notice that we now get very different parameter estimates for this single data generation process which reflects the contribution of the variable “weight” that is not taken into account in the model fitting.

Replicate Analysis

When evaluating empirical power, it is essential to replicate the analysis just like a Monte Carlo study to avoid extreme simulation conditions. The simglm package offers functions to aid in the replication given the simulation conditions and the desired simulation framework. To do this, two additional functions are used:

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded
## # A tibble: 3 x 10
##   term  avg_estimate power avg_test_stat type_1_error avg_adjtest_stat
##   <chr>        <dbl> <dbl>         <dbl>        <dbl>            <dbl>
## 1 (Int…       60.4     0.8         3.55           0.7            3.43 
## 2 age         -0.189   0.2        -0.420          0.2           -0.154
## 3 sex         -1.95    0.3        -0.208          0.3           -0.287
## # … with 4 more variables: param_estimate_sd <dbl>,
## #   avg_standard_error <dbl>, precision_ratio <dbl>, replications <dbl>

As can be seen from the output, the default behavior is to return statistics for power, average test statistic, type I error rate, adjusted average test statistic, standard deviation of parameter estimate, average standard error, precision ration (standard deviation of parameter estimate divided by average standard error), and the number of replications for each term.

To generate this output, only the number of replications is needed. Here only 10 replications were done, in practice many more replications would be done to ensure there are accurate results.

The default power parameter values used are: Normal distribution, two-tailed alternative hypotheses, and alpha of 0.05. Additional power parameters can be passed directly to override default values by including a power argument within the simulation arguments. For example, if a t-distribution with one degree of freedom an alpha of 0.02 is desired, these can be added as follows:

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'reg_weights' will be disregarded
## # A tibble: 3 x 10
##   term  avg_estimate power avg_test_stat type_1_error avg_adjtest_stat
##   <chr>        <dbl> <dbl>         <dbl>        <dbl>            <dbl>
## 1 (Int…       60.4       0         3.55             0            3.43 
## 2 age         -0.189     0        -0.420            0           -0.154
## 3 sex         -1.95      0        -0.208            0           -0.287
## # … with 4 more variables: param_estimate_sd <dbl>,
## #   avg_standard_error <dbl>, precision_ratio <dbl>, replications <dbl>

Nested Designs

Nested designs are ones in which data belong to multiple levels. An example could be individuals nested within neighborhoods. In this example, a specific individual is tied directly to one neighborhood. These types of data often include correlations between individuals within a neighborhood that need to be taken into account when modeling the data. These types of data can be generated with simglm.

To do this, the formula syntax introduced above is modified slightly to include information on the nesting structure. In the example below, the nesting structure in the formula is specified in the portion within the parentheses. How the part within parentheses could be read is, add a random intercept effect for each neighborhood. This random intercept effect is similar to random error found in regression models, except the error is associated with neighborhoods and is the same values for all individuals within that neighborhood. The simulation arguments, randomeffect provides information about this term, where the variance can be specified.

Finally, the sample_size argument needs to be modified to include information about the two levels of nested. You could read the sample_size = list(level1 = 10, level2 = 20) argument below as: create 20 neighborhoods (ie. level2 = 20) and within each neighborhood have 10 individuals (ie. level1 = 10). Therefore the total sample size (ie. rows in the data) would be 200.

set.seed(321) 

sim_arguments <- list(
  formula = y ~ 1 + weight + age + sex + (1 | neighborhood),
  reg_weights = c(4, -0.03, 0.2, 0.33),
  fixed = list(weight = list(var_type = 'continuous', mean = 180, sd = 30),
               age = list(var_type = 'ordinal', levels = 30:60),
               sex = list(var_type = 'factor', levels = c('male', 'female'))),
  randomeffect = list(int_neighborhood = list(variance = 8, var_level = 2)),
  sample_size = list(level1 = 10, level2 = 20)
)

nested_data <- sim_arguments %>%
  simulate_fixed(data = NULL, .) %>%
  simulate_randomeffect(sim_arguments) %>%
  simulate_error(sim_arguments) %>%
  generate_response(sim_arguments)

head(nested_data, n = 10)
##    X.Intercept.   weight age sex  weight1 age1   sex1 level1_id
## 1             1 231.1471  33   0 231.1471   33 female         1
## 2             1 158.6388  51   1 158.6388   51   male         2
## 3             1 171.6605  60   0 171.6605   60 female         3
## 4             1 176.4105  38   1 176.4105   38   male         4
## 5             1 176.2812  51   1 176.2812   51   male         5
## 6             1 188.0455  60   0 188.0455   60 female         6
## 7             1 201.8052  47   1 201.8052   47   male         7
## 8             1 186.9941  33   1 186.9941   33   male         8
## 9             1 190.1734  37   0 190.1734   37 female         9
## 10            1 163.4426  36   0 163.4426   36 female        10
##    neighborhood int_neighborhood       error fixed_outcome random_effects
## 1             1        -2.018169  0.03970688      3.665587      -2.018169
## 2             1        -2.018169  1.10030811      9.770835      -2.018169
## 3             1        -2.018169  1.48822364     10.850186      -2.018169
## 4             1        -2.018169 -0.29997335      6.637684      -2.018169
## 5             1        -2.018169  0.47789336      9.241565      -2.018169
## 6             1        -2.018169 -0.38778144     10.358635      -2.018169
## 7             1        -2.018169 -0.35268182      7.675843      -2.018169
## 8             1        -2.018169  0.93313126      5.320178      -2.018169
## 9             1        -2.018169 -1.39912533      5.694798      -2.018169
## 10            1        -2.018169 -0.01603532      6.296723      -2.018169
##            y
## 1   1.687125
## 2   8.852974
## 3  10.320241
## 4   4.319542
## 5   7.701289
## 6   7.952684
## 7   5.304992
## 8   4.235141
## 9   2.277503
## 10  4.262519
nrow(nested_data)
## [1] 200

The vignette simulation_arguments contains more example of specifying random effects and additional nesting designs including three levels of nested and cross-classified models.