Getting started

Maarten Kruijver

2022-05-04

Simulation studies are often used in forensic genetics to study the behaviour of probabilistic genotyping software. In many cases, researchers use simplified models that ignore complexities such as peak height variability and stutters. The goal of this package is to make it straightforward to use probabilistic genotyping models in simulation studies.

For a general introduction to forensic genetics and probabilistic genotyping, see (Buckleton, Bright, and Taylor 2018). See also (Gill et al. 2021) for a recent review of probabilistic genotyping systems.

Basic log normal example

Below, we load allele frequencies provided with the package and demonstrate how mixtures can be sampled using a log normal model for peak heights (refer to (Taylor, Bright, and Buckleton 2013) for details on the model). The template parameters for each contributor are picked uniformly between 50 and 10,000. The degradation parameter is picked from a gamma distribution with shape 2.5 and scale 0.001.

freqs <- read_allele_freqs(system.file("extdata","FBI_extended_Cauc.csv",
                                       package = "simDNAmixtures"))
data(gf)

sampling_parameters <- list(min_template = 50., max_template = 10000.,
                            degradation_shape = 2.5, degradation_scale = 1e-3)

After the setup, a single function call is sufficient to generate a set of samples. We set n=2 to generate two samples. The contributors to the sample are named U1 and U2 which means that each mixed sample has two unrelated contributors.

set.seed(1)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
                            sampling_parameters = sampling_parameters,
                            model_settings = gf$log_normal_bwfw_settings,
                            sample_model = sample_log_normal_model)

The mixtures object contains the simulation results. The sample_mixtures function has an optional results_directory argument. If the path to a directory is provided, then all simulation results will be persisted on disk in file formats that are readily imported in probabilistic genotyping software. We inspect the parameter_summary property of the result of sample_mixtures.

knitr::kable(mixtures$parameter_summary[1:5])
SampleName template1 template2 degradation1 degradation2
sample_0001_N2_9584_4870_bc 9584.164 4869.979 0.0032058 0.004264
sample_0002_N2_6366_5585_ba 6366.157 5584.630 0.0031578 0.001269

The first sample is in approximately a 2:1 ratio. We print the first 10 peaks of the profile below.

knitr::kable(head(mixtures$samples[[1]]$mixture, 10))
Locus Allele Height Size
D3S1358 13 216 113.25
D3S1358 14 4760 117.33
D3S1358 15 18947 121.40
D3S1358 16 107 125.48
vWA 14 218 168.84
vWA 15 4246 172.87
vWA 16 3629 176.91
vWA 17 6545 180.95
vWA 18 6905 184.99
vWA 19 97 189.02

Basic gamma example

We repeat the example above with a gamma distribution for peak heights instead of a log normal one. The mean peak height parameter mu is chosen uniformly between 50 and 5,000. The variance parameter cv parameter (\(\sigma\) in (Bleka, Storvik, and Gill 2016)) is chosen uniformly between 0.05 and 0.35. The degradation parameter is sampled from a Beta distribution with parameters 10 and 1, which means most profile will be only mildly degraded.

gamma_sampling_parameters <- list(min_mu = 50., max_mu = 5e3,
                            min_cv = 0.05, max_cv = 0.35,
                            degradation_shape1 = 10, degradation_shape2 = 1)

We now invoke the sample_mixtures function again to sample two mixtures; both consisting of two unrelated contributors. We do not sample stutters.

set.seed(2)
mixtures <- sample_mixtures(n = 2, contributors = c("U1", "U2"), freqs = freqs,
                            sampling_parameters = gamma_sampling_parameters,
                            model_settings = gf$gamma_settings_no_stutter,
                            sample_model = sample_gamma_model)

We see that the first sample has \(\mu\) around 1,236 and is in a 65:35 ratio.

knitr::kable(mixtures$parameter_summary[1:4]) 
SampleName mixture_proportions1 mixture_proportions2 mu
sample_0001_N2_1236_65_35_aa 0.6512712 0.3487288 1235.641
sample_0002_N2_2906_82_18_aa 0.8158866 0.1841134 2905.758

The coefficient of variation for a full heterozygote is about 9% for the first sample.

knitr::kable(mixtures$parameter_summary[c(1,5:7)])
SampleName cv degradation_beta1 degradation_beta2
sample_0001_N2_1236_65_35_aa 0.0913336 0.8311816 0.8311816
sample_0002_N2_2906_82_18_aa 0.3151883 0.8186755 0.8186755

We print the first 10 peaks of the profile below.

knitr::kable(head(mixtures$samples[[1]]$mixture, 10))
Locus Allele Height Size
D3S1358 15 1394 121.40
D3S1358 16 358 125.48
D3S1358 18 535 133.64
vWA 14 928 168.84
vWA 16 1077 176.91
vWA 17 334 180.95
D16S539 11 1607 251.67
D16S539 12 341 255.70
CSF1PO 10 920 298.34
CSF1PO 12 456 306.26

References

Bleka, Øyvind, Geir Storvik, and Peter Gill. 2016. “EuroForMix: An Open Source Software Based on a Continuous Model to Evaluate STR DNA Profiles from a Mixture of Contributors with Artefacts.” Forensic Science International: Genetics 21: 35–44. https://doi.org/10.1016/j.fsigen.2015.11.008.
Buckleton, John S, Jo-Anne Bright, and Duncan Taylor. 2018. Forensic DNA Evidence Interpretation. CRC press.
Gill, Peter, Corina Benschop, John Buckleton, Øyvind Bleka, and Duncan Taylor. 2021. “A Review of Probabilistic Genotyping Systems: EuroForMix, DNAStatistX and STRmix™.” Genes 12 (10): 1559. https://doi.org/10.3390/genes12101559.
Taylor, Duncan, Jo-Anne Bright, and John Buckleton. 2013. “The Interpretation of Single Source and Mixed DNA Profiles.” Forensic Science International: Genetics 7 (5): 516–28. https://doi.org/10.1016/j.fsigen.2013.05.011.