Data is generated by combining two aspects. First it must be specified how data adds to the model, then it needs to be specified what data it is. You can use the following functions for that:
sim_gen_x(mean, sd, name = "x")
sim_gen_e(mean, sd name = "e")
sim_gen_v(mean, sd, name = "v")
What data is added can be specified for given generators:
gen_norm(mean = 0, sd = 1, name)
- normally distributed variablegen_v_norm(mean = 0, sd = 1, name)
- normally distributed variable but constant within domaingen_v_sar(mean = 0, sd = 1, rho = 0.5, type = "rook", name)
- normally distributed variable following a SAR(1), constant within domainMeaning we can specify a linear mixed model set-up with the regressor x
, the model error e
a random effect v
and a spatially correlated random effect vSp
as follows:
library(saeSim)
setup <- sim_base() %>% sim_gen_x() %>% sim_gen_e() %>% sim_gen_v() %>%
sim_gen(gen_v_sar(name = "vSp")) %>% sim_resp_eq(y = 100 + x + v + vSp + e)
setup
## idD idU x e v vSp y
## 1 1 1 -4.13032 -5.765 1.106 -0.8515 90.36
## 2 1 2 -0.03954 -5.174 1.106 -0.8515 95.04
## 3 1 3 -6.52239 3.600 1.106 -0.8515 97.33
## 4 1 4 -0.76599 2.028 1.106 -0.8515 101.52
## 5 1 5 -1.82341 1.158 1.106 -0.8515 99.59
## 6 1 6 5.55645 10.608 1.106 -0.8515 116.42
To get the simulated data as a list:
dataList <- sim(setup, R = 500)
When interested in contamination it is important to know, that the contamination adds additively to the values in the data. This means how data is added to the model changes, the data generators stay the same. If you want a contaminated spatially correlated error component you can add the following to the setup
object from above:
contSetup <- setup %>%
sim_gen_cont(gen_v_sar(sd = 40, name = "vSp"), nCont = 0.05, type = "area", areaVar = "idD", fixed = TRUE)
Note that the generator is the same but with a higher standard deviation. The argument nCont
controls how much observations are contaminated. Values < 1 are interpreted as probability. A single number as the number of contaminated units (can be areas or observations in each area or observations). When given with length(nCont) > 1
it will be interpreted as number of contaminated observations in each area. Use the following example to see how these things play together:
sim(base_id(3, 4) %>% sim_gen_x() %>% sim_gen_e() %>%
sim_gen_ec(mean = 0, sd = 150, name = "eCont", nCont = c(1, 2, 3)))
## [[1]]
## Source: local data frame [12 x 8]
##
## idD idU x e eCont idC idR simName
## 1 1 1 3.0426745 -8.895868 0.000 FALSE 1
## 2 1 2 -2.1870423 1.312302 0.000 FALSE 1
## 3 1 3 -6.6599026 1.871466 0.000 FALSE 1
## 4 1 4 -9.4016207 1.041706 5.916 TRUE 1
## 5 2 1 -2.2121344 -3.195060 0.000 FALSE 1
## 6 2 2 6.5940150 3.988802 0.000 FALSE 1
## 7 2 3 -0.0279630 6.005806 -191.437 TRUE 1
## 8 2 4 1.5433492 -0.006177 13.658 TRUE 1
## 9 3 1 10.1495513 -1.645273 0.000 FALSE 1
## 10 3 2 -6.5109211 0.205836 310.542 TRUE 1
## 11 3 3 -1.3331190 -4.534455 292.471 TRUE 1
## 12 3 4 -0.0001271 -8.006533 -265.403 TRUE 1