The package `superb`

includes the function
`GRD()`

. This function is used to easily generate random data
sets. With a few options, it is possible to obtain data from any design,
with any effects. This function, first created for SPSS Harding & Cousineau (2015) was exported to R
(Calderini & Harding, 2019). A brief
report shows one possible use in the class for teaching statistics to
undergrads (Cousineau, 2020).

This vignette illustrate some of its use.

The simplest use relies on the default value:

```
## id DV
## 1 1 0.1371554
## 2 2 1.6912893
## 3 3 1.3174311
## 4 4 -0.6777181
## 5 5 0.8912996
## 6 6 1.6578969
```

By default, one hundred scores are generated from a normal
distribution with mean 0 and standard deviation of 1. In other words, it
generate 100 z scores. The dependent variable, the last column in the
dataframe that will be generated is called by default `DV`

.
The first column is an “id” column containing a number identifying each
*simulated* participant. To change the dependent variable’s name,
use

To add various groups to the dataset, use the argument
`BSFactors`

, as in

There will be 100 random z scores in each of three groups, for a
total of 300 data. The group number will be given in an additional
column, here called `Group`

. A factorial design can be
generated with more than one factors, such as

which will results in 2 \(\times\) 3, that is, 6 different groups, crossing all the levels of Surgery (1 and 2) and all the levels of Therapy (1, 2 and 3). The levels can receive names rather than number, as in

```
dta <- GRD(
BSFactors = c('Surgery(yes, no)', 'Therapy(CBT,Control,Exercise)')
)
unique(dta$Surgery)
```

`## [1] "yes" "no"`

`## [1] "CBT" "Control" "Exercise"`

Finally, within-subject factors can also be given, as in

```
dta <- GRD(
BSFactors = c('Surgery(yes,no)', 'Therapy(CBT, Control,Exercise)'),
WSFactors = 'Contrast(C1,C2,C3)',
)
```

For within-subject designs, the repeated measures will appear in
distinct columns (here “DV.C1”, “DV.C2”, and “DV.C3” ). This format is
called **wide** format, meaning that the repeated measures
are all on the same line for a given *simulated* participant.

The default is to generate 100 participants in each between-subject
groups. This default can be changed with `SubjectsPerGroup`

.
The most straigthforward specification is, e.g.,
`SubjectsPerGroup = 25`

for 25 participants in each groups.
Unequal group sizes can be specified with:

```
## id Therapy DV
## 1 1 1 0.07961340
## 2 2 1 0.48631236
## 3 3 2 -0.06054161
## 4 4 2 -0.74075781
## 5 5 2 0.70536687
## 6 6 2 1.28486467
## 7 7 2 -0.31966069
## 8 8 3 -0.86372677
```

To sample random data, it is necessary to specify a theoretical
population distribution. The default is to use a normal distribution
(the famous “bell-shaped” curve). That population has a grand mean
(`GM`

, \(\mu\)) given by the
element `mean`

and standard deviation (\(\sigma\)) given by the element
`stddev`

. These can be redefined using the argument
`Population`

with a list of the relevant elements. In the
following example, IQ are being simulated with :

(increase the number of participants using
`SubjectsPerGroup`

to say 10,000, and the bell-shape curve
will be evident!).

Internally, the above call to `GRD()`

will use
`rnorm`

to generate the scores, passing along for the mean
parameter the grand mean (internally called `GM`

) and for the
standard deviation parameter the provided standard deviation (internally
called `STDDEV`

). This can be explicitly stated using the
element `scores`

as in:

```
dta <- GRD(
BSFactors = "Group(2)",
Population = list(
mean = 100, # this set GM to 100
stddev = 15, # this set STDDEV to 15
scores = "rnorm(1, mean = GM, sd = STDDEV )"
)
)
```

Using `scores`

, it is possible to alter the parameters,
for example, have a mean proportional to the group number, or the
standard deviation proportional to the group number, as in:

```
dta <- GRD(
BSFactors = "Group(2)",
Population = list(
mean = 100, # this set GM to 100
stddev = 15, # this set STDDEV to 15
scores = "rnorm(1, mean = GM, sd = Group * STDDEV )"
)
)
superbPlot(dta,
BSFactors = "Group",
variables = "DV",
plotStyle = "pointjitterviolin" )
```

Any valid R instruction could be placed in the `scores`

arguments, such as
`scores = "rnorm(1, mean = GM, sd = ifelse(Group==1,10,50) )"`

to select the standard deviation according to `Group`

or
`scores = "1"`

to generate constants. Other theoretical
distributions can also be chosen, as in:

It is possible to generate non-null effects on the factors using the
argument `Effects`

. Effects can be `slope(x)`

(an
increase of x points for each level of the factor),
`extent(x)`

(a total increase of `x`

over all the
levels), `custom(x, y, etc)`

for an effect of `x`

point for the first level of the factor, `y`

point for the
second, etc.

Here is a slope effect:

```
dta <- GRD(
BSFactors = 'Therapy(CBT, Control, Exercise)',
WSFactors = 'Contrast(3)',
SubjectsPerGroup = 1000,
Effects = list('Contrast' = slope(2))
)
superbPlot(dta,
BSFactors = "Therapy", WSFactors = "Contrast(3)",
variables = c("DV.1","DV.2","DV.3"),
plotStyle = "line" )
```

Effects can also be any R code manipulating the factors, using
`Rexpression`

. One example:

```
dta <- GRD(
BSFactors = 'Therapy(CBT,Control,Exercise)',
WSFactors = 'Contrast(3) ',
SubjectsPerGroup = 1000,
Effects = list(
"code1"=Rexpression("if (Therapy =='CBT'){-1} else {0}"),
"code2"=Rexpression("if (Contrast ==3) {+1} else {0}")
)
)
superbPlot(dta,
BSFactors = "Therapy", WSFactors = "Contrast(3)",
variables = c("DV.1","DV.2","DV.3"),
plotStyle = "line" )
```

Repeated measures can also be generated from a multivariate normal
distribution with a correlation `rho`

, with, e.g.,

```
dta <- GRD(
WSFactors = 'Difficulty(1, 2)',
SubjectsPerGroup = 1000,
Population=list(mean = 0,stddev = 20, rho = 0.5)
)
plot(dta$DV.1, dta$DV.2)
```

In the case of a multivariate normal distribution, the parameters for the mean and the standard deviations can be vectors of length equal to the number of repeated measures. However, covariances are constants.

Contaminants can be inserted in the simulated data using
`Contaminant`

. This argument works exactly like
`Population`

except for the additional option
`proportion`

which indicates the proportion of contaminants
in the samples:

```
dta <- GRD(SubjectsPerGroup = 5000,
Population= list( mean=100, stddev = 15 ),
Contaminant=list( mean=200, stddev = 15, proportion = 0.10 )
)
hist(dta$DV,breaks=seq(-25,300,by=2.5))
```

Contaminants can be normally distributed (as above) or come from any theoretical distribution which can be simulated in R:

```
dta <- GRD(SubjectsPerGroup = 10000,
Population=list( mean=100, stddev = 15 ),
Contaminant=list( proportion = 0.10,
scores="rweibull(1,shape=1.5, scale=30)+1.5*GM")
)
hist(dta$DV,breaks=seq(0,365,by=2.5))
```

Finally, contaminants can be used to add missing data (missing completely at random) with:

`GRD()`

is a convenient function to generate about any
sorts of data sets with any form of effects. The data can simulate any
factorial designs involving between-subject designs, repeated-measure
designs, and multivariate data.

One use if of course in the classroom: students can test their skill by generating random data sets and run statistical procedures. To illustrate type-I errors, it become then easy to generate data with no effect whatsoever and ask the students who obtain a rejection decision to raise their hand.

Calderini, M., & Harding, B. (2019). GRD for
R: An intuitive tool for generating random data in
R. *The Quantitative Methods for Psychology*,
*15*(1), 1–11. https://doi.org/10.20982/tqmp.15.1.p001

Cousineau, D. (2020). In-class activity comparing standard errors as a
function of sample size with SPSS. *The Quantitative Methods for
Psychology*, *16*(2), v4–v7. https://doi.org/10.20982/tqmp.16.2.v009

Harding, B., & Cousineau, D. (2014). GRD: An SPSS
extension command for generating random data. *The Quantitative
Methods for Psychology*, *10*(2), 80–94. https://doi.org/10.20982/tqmp.10.2.p080

Harding, B., & Cousineau, D. (2015). An extended SPSS
extension command for generating random data. *The Quantitative
Methods for Psychology*, *11*(3), 127–138. https://doi.org/10.20982/tqmp.11.3.p127