saeSim is developed to make the data simulation process more compact and yet flexible enough for customization in the context of small area estimation.

An initial Example

Consider a linear mixed model. It contains two components. A fixed effects part, and an error component. The error component can be split into a random effects part and a model error.

library(saeSim)
setup <- sim_base() %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  sim_resp_eq(y = 100 + 2 * x + e) %>% 
  sim_simName("Doku")
setup
## data.frame [10,000 x 5]
## 
##    idD idU          x         e         y
## 1    1   1 -2.5058152 -3.217326  91.77104
## 2    1   2  0.7345733 -4.226103  97.24304
## 3    1   3 -3.3425144 -4.141583  89.17339
## 4    1   4  6.3811232 -4.742241 108.02001
## 5    1   5  1.3180311 -2.001758 100.63430
## 6    1   6 -3.2818735 -2.099955  91.33630
## .. ... ...        ...       ...       ...

sim_base() is responsible to supply a data.frame to which variables can be added. The default is to create a data.frame with indicator variables idD and idU (2-level-model), which uniquely identify observations. sim_resp will add a variable y as response.

The setup itself does not contain the simulated data but the functions to process the data. For starting a simulation use the function sim. It will return a list containing data.frames as elements:

dataList <- sim(setup, R = 10)

You can coerce a simulation setup to a data.frame with as.data.frame.

simData <- sim_base() %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  as.data.frame

Naming and structure

Components in a simulation setup should be added using the pipe operator %>% from magrittr. A component defines ‘when’ a specific function will be applied in a chain of functions. To add a component you can use one of: sim_gen, sim_resp, sim_comp_pop, sim_sample, sim_comp_sample, sim_agg and sim_comp_agg. They all expect a simulation setup as first argument and a function as second and will take care of the order in which the functions are called. The reason for this is the following:

setup <- sim_base() %>% 
  sim_gen_x() %>% 
  sim_gen_e() %>% 
  sim_resp_eq(y = 100 + 2 * x + e)

setup1 <- setup %>% sim_sample(sample_fraction(0.05))
setup2 <- setup %>% sim_sample(sample_number(5))

You can define a rudimentary scenario and only have to explain how scenarios differ. You do not have to redefine them. setup1 and setup2 only differ in the way samples are drawn. sim_sample will take care, that the sampling will take place at the appropriate place in the chain of functions no matter how setup was composed.

If you can’t remember all function names, use auto-complete. All functions to add components start with sim_. And all functions meant to be used in a given phase will start with the corresponding prefix, i.e. if you set the sampling scheme you use sim_sample – all functions to control sampling have the prefix sample.

Exploring the setup

You will want to check your results regularly when working with sim_setup objects. Some methods are supplied to do that:

setup <- sim_base_lmm()
plot(setup)

autoplot(setup)

autoplot(setup, "e")

autoplot(setup %>% sim_gen_vc())

sim_gen

Semi-custom data

saeSim has a interface to supply any random number generator. If things get more complex you can always define new generator functions.

base_id(2, 3) %>% sim_gen(gen_generic(rnorm, mean = 5, sd = 10, name = "x", groupVars = "idD"))
## data.frame [6 x 3]
## 
##   idD idU         x
## 1   1   1 15.041606
## 2   1   2 15.041606
## 3   1   3 15.041606
## 4   2   1  6.311484
## 5   2   2  6.311484
## 6   2   3  6.311484

You can supply any random number generator to gen_generic and since we are in small area estimation you have an optional group variable to generate ‘area-level’ variables. Some short cuts for data generation are sim_gen_x, sim_gen_v and sim_gen_e which add normally distributed variables named ‘x’, ‘e’ and ‘v’ respectively. Also there are some function with the prefix ‘gen’ which will be extended in the future.

library(saeSim)
setup <- sim_base() %>% 
  sim_gen_x() %>% # Variable 'x'
  sim_gen_e() %>% # Variable 'e'
  sim_gen_v() %>% # Variable 'v' as a random-effect
  sim_gen(gen_v_sar(name = "vSp")) %>% # Variable 'vSp' as a random-effect following a SAR(1)
  sim_resp_eq(y = 100 + x + v + vSp + e) # Computing 'y'
setup
## data.frame [10,000 x 7]
## 
##    idD idU         x          e         v       vSp         y
## 1    1   1 -1.799064  4.2836539 0.4444671 -2.132081 100.79698
## 2    1   2  5.018624  1.5867889 0.4444671 -2.132081 104.91780
## 3    1   3 -5.371760  1.7791534 0.4444671 -2.132081  94.71978
## 4    1   4  4.588712 -2.6134366 0.4444671 -2.132081 100.28766
## 5    1   5  2.167278  0.8112708 0.4444671 -2.132081 101.29093
## 6    1   6 -3.344273 -2.8669573 0.4444671 -2.132081  92.10116
## .. ... ...       ...        ...       ...       ...       ...

Contaminated data

For contaminated data you can use the same generator functions, however, instead of using sim_gen you use sim_gen_cont which will have some extra arguments to control the contamination. To extend the above setup with a contaminated spatially correlated error component you can add the following:

contSetup <- setup %>% 
  sim_gen_cont(gen_v_sar(sd = 40, name = "vSp"), 
               nCont = 0.05, type = "area", areaVar = "idD", fixed = TRUE)

Note that the generator is the same but with a higher standard deviation. The argument nCont controls how much observations are contaminated. Values < 1 are interpreted as probability. A single number as the number of contaminated units (can be areas or observations in each area or observations). When given with length(nCont) > 1 it will be interpreted as number of contaminated observations in each area. Use the following example to see how these things play together:

sim(base_id(3, 4) %>% sim_gen_x() %>% sim_gen_e() %>% 
      sim_gen_ec(mean = 0, sd = 150, name = "eCont", nCont = c(1, 2, 3)))
## [[1]]
## Source: local data frame [12 x 8]
## 
##    idD idU          x          e      eCont   idC idR simName
## 1    1   1 -7.0869715 -1.2010541    0.00000 FALSE   1        
## 2    1   2  5.0377895 -4.9334098    0.00000 FALSE   1        
## 3    1   3 -1.2496585 -2.9116419    0.00000 FALSE   1        
## 4    1   4  6.2790116  5.9097423   21.38064  TRUE   1        
## 5    2   1 -3.0381640  0.9991914    0.00000 FALSE   1        
## 6    2   2 -0.7691143  0.5233775    0.00000 FALSE   1        
## 7    2   3 -5.3370109  1.1227648   21.56736  TRUE   1        
## 8    2   4 -2.0164550 -0.9282183 -200.65968  TRUE   1        
## 9    3   1 -4.9522436  2.4741763    0.00000 FALSE   1        
## 10   3   2  2.3734406  1.3206359  -17.28150  TRUE   1        
## 11   3   3  4.6538017 -7.0182881   58.63521  TRUE   1        
## 12   3   4 -0.3226721 -2.1961317   40.20155  TRUE   1

sim_comp

Here follow some examples how to add components for computation to a sim_setup. Three points can be accessed with

base_id(2, 3) %>% sim_gen_x() %>% sim_gen_e() %>% sim_gen_ec() %>% 
  sim_resp_eq(y = 100 + x + e) %>%
  sim_comp_pop(comp_var(popMean = mean(y)), by = "idD")
## data.frame [6 x 7]
## 
##   idD idU         x         e   idC         y   popMean
## 1   1   1 -2.866610 -5.620434 FALSE  91.51296  93.78281
## 2   1   2 -2.856790 -4.529366 FALSE  92.61384  93.78281
## 3   1   3 -4.123865  1.345498 FALSE  97.22163  93.78281
## 4   2   1  3.616128 -1.409045 FALSE 102.20708 101.79017
## 5   2   2 -3.304737  4.973058 FALSE 101.66832 101.79017
## 6   2   3 -1.389154  2.884267 FALSE 101.49511 101.79017

The function comp_var is a wrapper around dplyr::mutate so you can add simple data manipulations. The argument by is a little extension and lets you define operations in the scope of groups identified by a variable in the data. In this case the mean of the variable ‘y’ is computed for every group identified with the variable ‘idD’. This is done before sampling, hence the prefix ‘pop’ for population.

Add custom computation functions

By adding computation functions you can extend the functionality of a sim_setup to wrap up your whole simulation. This will separate the utility of this package from only generating data.

comp_linearPredictor <- function(dat) {
  dat$linearPredictor <- lm(y ~ x, dat) %>% predict
  dat
}

sim_base_lm() %>% 
  sim_comp_pop(comp_linearPredictor)
## data.frame [10,000 x 6]
## 
##    idD idU           x         e        y linearPredictor
## 1    1   1  6.34140993  5.661696 112.0031       106.43504
## 2    1   2 10.45746460  1.219927 111.6774       110.59657
## 3    1   3 -1.16609862 -2.339601  96.4943        98.84459
## 4    1   4 -0.06835324  1.526961 101.4586        99.95446
## 5    1   5  3.76945746  1.719263 105.4887       103.83467
## 6    1   6  0.72752929  6.816419 107.5439       100.75914
## .. ... ...         ...       ...      ...             ...

Or, should this be desirable, directly produce a list of lm objects or add them as attribute to the data. The intended way of writing functions is that they will return the modified data of class ‘data.frame’.

sim_base_lm() %>% 
  sim_comp_pop(function(dat) lm(y ~ x, dat)) %>%
  sim(R = 1)
## [[1]]
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##      99.974        1.026
comp_linearModelAsAttr <- function(dat) {
  attr(dat, "linearModel") <- lm(y ~ x, dat)
  dat
}

dat <- sim_base_lm() %>% 
  sim_comp_pop(comp_linearModelAsAttr) %>%
  as.data.frame

attr(dat, "linearModel")
## 
## Call:
## lm(formula = y ~ x, data = dat)
## 
## Coefficients:
## (Intercept)            x  
##      100.02         1.01

If you use any kind of sampling, the ‘linearPredictor’ can be added after sampling. This is where small area models are supposed to be applied.

sim_base_lm() %>% 
  sim_sample() %>%
  sim_comp_sample(comp_linearPredictor)
## data.frame [500 x 6]
## 
##    idD idU          x         e         y linearPredictor
## 1    1  76 -2.2947855  5.431366 103.13658        97.52755
## 2    1   8 -0.4504411 -5.447749  94.10181        99.52448
## 3    1  29  0.2154760 -3.260434  96.95504       100.24549
## 4    1  45 -2.6594728  1.178989  98.51952        97.13269
## 5    1  86 -1.8721674 -7.583018  90.54481        97.98513
## 6    2  97 -3.9807029  8.498786 104.51808        95.70215
## .. ... ...        ...       ...       ...             ...

Should you want to apply area level models, use sim_comp_agg instead.

sim_base_lm() %>% 
  sim_sample() %>%
  sim_agg() %>% 
  sim_comp_agg(comp_linearPredictor)
## data.frame [100 x 5]
## 
##    idD          x          e         y linearPredictor
## 1    1  3.3735775 -0.1181419 103.25544       103.41059
## 2    2 -3.6610769 -0.5894422  95.74948        96.20928
## 3    3 -1.1068366 -1.1722528  97.72091        98.82403
## 4    4  2.5583782 -0.7228991 101.83548       102.57607
## 5    5  2.2590130 -1.5862100 100.67280       102.26962
## 6    6  0.6274665 -3.4695357  97.15793       100.59942
## .. ...        ...        ...       ...             ...

sim_sample

After the data generation you may want to draw a sample from the population. Use the function sim_sample() to add a sampling component to your sim_setup. Two predefined functions for sampling are available:

base_id(3, 4) %>% sim_gen_x() %>% sim_sample(sample_number(1L))
## data.frame [1 x 3]
## 
##   idD idU          x
## 1   3   4 -0.1312856
base_id(3, 4) %>% sim_gen_x() %>% sim_sample(sample_number(1L, groupVars = "idD"))
## data.frame [3 x 3]
## 
##   idD idU         x
## 1   1   2 -1.177060
## 2   2   4  2.645778
## 3   3   3  1.598545
# simple random sampling:
sim_base_lm() %>% sim_sample(sample_number(size = 10L))
## data.frame [10 x 5]
## 
##    idD idU          x           e         y
## 1   66  67 -0.5944115  0.03267922  99.43827
## 2    1 100 -2.9433992 -0.54127630  96.51532
## 3   73  99  6.3808657  0.79340777 107.17427
## 4   62  20 -7.1363313  2.63308637  95.49676
## 5   49  60  3.3553125  8.18254749 111.53786
## 6   69  37  2.2515884  8.01779244 110.26938
## .. ... ...        ...         ...       ...
sim_base_lm() %>% sim_sample(sample_fraction(size = 0.05))
## data.frame [500 x 5]
## 
##    idD idU          x         e         y
## 1   49  75  2.5751745  6.063451 108.63863
## 2   75  10  5.3654064 -2.038573 103.32683
## 3   79   5  3.4695434 -3.051016 100.41853
## 4   87  66 -1.9807620 -9.068360  88.95088
## 5   71   2  0.2441391  4.913220 105.15736
## 6   23  80 -2.7194351  2.817812 100.09838
## .. ... ...        ...       ...       ...
# srs in each domain/cluster
sim_base_lm() %>% sim_sample(sample_number(size = 10L, groupVars = "idD"))
## data.frame [1,000 x 5]
## 
##    idD idU          x           e        y
## 1    1  72 -3.5624987  1.90023162 98.33773
## 2    1  15 -0.8577756  0.41381975 99.55604
## 3    1  58 -5.6696066  1.78688735 96.11728
## 4    1  26 -1.5296847 -6.99181690 91.47850
## 5    1  41 -1.2100313  0.06752227 98.85749
## 6    1  74  1.9328002 -6.95022260 94.98258
## .. ... ...        ...         ...      ...
sim_base_lm() %>% sim_sample(sample_fraction(size = 0.05, groupVars = "idD"))
## data.frame [500 x 5]
## 
##    idD idU         x         e         y
## 1    1  78 -1.208951 -3.158383  95.63267
## 2    1  14 -6.380093 10.393940 104.01385
## 3    1  85 -1.678192  4.664943 102.98675
## 4    1  59  7.653317  2.664625 110.31794
## 5    1  67  8.611572 -4.419287 104.19228
## 6    2  87 -9.100455 -5.013380  85.88617
## .. ... ...       ...       ...       ...