This document provides brief examples of how jackalope
can be used to generate sequencing data that can inform some common sampling decisions for HTS studies.
For an example reference assembly, I simply simulated one of size 10 Mb using the following code:
This resulted in the following ref_genome
object:
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#> name sequence length
#> seq0 TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT 1000000
#> seq1 AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA 1000000
#> seq2 TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC 1000000
#> seq3 CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC 1000000
#> seq4 GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA 1000000
#> seq5 TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA 1000000
#> seq6 GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG 1000000
#> seq7 CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT 1000000
#> seq8 AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA 1000000
#> seq9 GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT 1000000
For molecular-evolution information, I used the TN93 model, an insertion rate of 2e-5
for lengths from 1 to 10, and a deletion rate of 1e-5
for lengths from 1 to 40.
sub <- sub_TN93(pi_tcag = c(0.1, 0.2, 0.3, 0.4),
alpha_1 = 0.0001, alpha_2 = 0.0002,
beta = 0.00015)
ins <- indels(rate = 2e-5, max_length = 10)
del <- indels(rate = 1e-5, max_length = 40)
The example here produces FASTQ files from the known reference assembly that could test strategies for how to assemble a similar genome using HTS data.
The first strategy is to use only PacBio sequencing. The PacBio Sequel system produces up to 500,000 reads per Single Molecule, Real-Time (SMRT) cell, so you could run the following for two cells:
An alternative, hybrid strategy uses 1 SMRT cell of PacBio sequencing and 1 lane (\(\sim 500\) million reads) of \(2 \times 100\)bp Illumina sequencing on the HiSeq 2500 system (the default Illumina system in jackalope
):
pacbio(ref, out_prefix = "pacbio", n_reads = 500e3)
illumina(ref, out_prefix = "illumina", n_reads = 500e6, paired = TRUE,
read_length = 100)
The last strategy combines 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing with 1 flow cell of \(2 \times 250\)bp mate-pair sequencing on an Illumina MiSeq v3. The mate-pair sequencing uses longer fragments (defaults are mean of 400 and standard deviation of 100) to better cover highly repetitive regions.
Here, I will demonstrate how to generate population-genomic data of a type that might be used to estimate the divergence between two populations. I first use the scrm
package to conduct coalescent simulations that will generate segregating sites for 40 haploid variants from the reference genome. Twenty of the variants are from one population, twenty from another. The symmetrical migration rate is 100 individuals per generation. To generate many mutations, I set the population-scaled mutation rate (\(\theta = 4 N_0 \mu\)) to 100
for this example.
Using the previously created objects for molecular evolution information (sub
, ins
, and del
), I create variants from the reference genome:
This results in the following set of variants:
#> << Variants object >>
#> # Variants: 40
#> # Mutations: 76,851
#>
#> << Reference genome info: >>
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#> name sequence length
#> seq0 TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT 1000000
#> seq1 AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA 1000000
#> seq2 TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC 1000000
#> seq3 CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC 1000000
#> seq4 GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA 1000000
#> seq5 TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA 1000000
#> seq6 GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG 1000000
#> seq7 CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT 1000000
#> seq8 AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA 1000000
#> seq9 GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT 1000000
For a file of true divergences from the reference genome, the write_vcf
function writes the variants
object to a VCF file:
Lastly, I simulate 1 lane of \(2 \times 100\)bp Illumina HiSeq 2500 sequencing. In this case, individuals within a population are pooled, and the population sequences are derived from are identified by barcodes.
illumina(vars, out_prefix = "vars_illumina", n_reads = 500e6, paired = TRUE,
read_length = 100, barcodes = c(rep("AACCGCGG", 20),
rep("GGTTATAA", 20)))
The below example instead has each individual variant’s reads output to separate FASTQ files:
This section shows how jackalope
can generate variants from a phylogeny, then simulate sequencing data from those variants to test phylogeny reconstruction methods. First, I generated a random coalescent tree of 10 species using the ape
package.
Then I input that to the create_variants
function alongside the molecular evolution information.
This results in the following variants
object:
#> << Variants object >>
#> # Variants: 10
#> # Mutations: 36,395
#>
#> << Reference genome info: >>
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#> name sequence length
#> seq0 TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT 1000000
#> seq1 AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA 1000000
#> seq2 TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC 1000000
#> seq3 CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC 1000000
#> seq4 GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA 1000000
#> seq5 TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA 1000000
#> seq6 GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG 1000000
#> seq7 CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT 1000000
#> seq8 AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA 1000000
#> seq9 GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT 1000000
Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.
Similar to the section above, the ultimate goal here is to test phylogeny reconstruction methods. The difference in this section is that instead of using a single, straightforward phylogeny, I use multiple gene trees per sequence. In the species used in these simulations, species 3 diverged from 1 and 2 at \(t = 0.5\), where \(t\) indicates time into the past and is in units of \(4 N_0\) generations. Species 1 and 2 diverged at \(t = 0.25\). I assume a recombination rate of \(1 / (4 N_0)\) recombination events per sequence per generation. Because each sequence is a different length and the length is required for including a recombination rate, I had to run scrm
separately for each sequence. There are 4 diploid individuals sampled per species.
# Run scrm for one sequence size:
one_seq <- function(size) {
sims <- scrm(
paste("24 1",
# Output gene trees:
"-T",
# Recombination:
"-r 1", size,
# 3 species with no ongoing migration:
"-I 3", paste(rep("8", 3), collapse = " "), "0",
# Species 2 derived from 1 at time 1.0:
"-ej 0.5 2 1",
# Species 3 derived from 2 at time 0.5:
"-ej 0.25 3 2"
))
return(sims$trees[[1]])
}
# For all sequences:
gtrees <- list(trees = lapply(ref$sizes(), one_seq))
This results in the following variants
object:
#> << Variants object >>
#> # Variants: 24
#> # Mutations: 38,130
#>
#> << Reference genome info: >>
#> < Set of 10 sequences >
#> # Total size: 10,000,000 bp
#> name sequence length
#> seq0 TGTTCTCACGGAAAACGGGAAGTTT...TACGGTGCGGGGCTCGCTAGGCGGAT 1000000
#> seq1 AGAATACCTTGGACGACAGGCAACA...TTGGTACTACTGAACTCAAATCTCGA 1000000
#> seq2 TTTGAGGATCAGCGTTAGTGTCAAC...AGATATCGCCTAGCGTACCGCCCGGC 1000000
#> seq3 CTCTACAATCATACCGACTCATCCG...TTTTGGTTGAGCCCAATCTTTCAAGC 1000000
#> seq4 GAACGATTAGAACATGACCTTGCCG...AACTAACCCGCTTAGGCGCGATGATA 1000000
#> seq5 TGAACACTGTCTTTGTCCTTAACGC...CTTTCGGGATTTGCGAAACCAATCGA 1000000
#> seq6 GGGTGCCAACCCCCCTAATTACTAA...CCTAGCAGACGTCGTCTCTATTACAG 1000000
#> seq7 CGAATAGGGGTGCCTCTTACCTTCA...TCCTAGTCTCAGGACGGATAACCAGT 1000000
#> seq8 AGAAATATGCTTACGCATTAGCAGT...CGCCGAAAGACACCTCCCGTCAGTGA 1000000
#> seq9 GCAGTTACATTTACGCATCCTTACC...GGGGGCCTCGTGAGCTAGAGCTAGGT 1000000
To store mutation information by diploid sample, the write_vcf
function writes the variants
object to a VCF file. It assigns haplotypes to diploid samples using a matrix for the sample_matrix
argument:
write_vcf(vars, out_prefix = "var_gtrees",
sample_matrix = matrix(1:vars$n_vars(), ncol = 2, byrow = TRUE))
Now I can generate data for 1 flow cell of \(2 \times 250\)bp sequencing on an Illumina MiSeq v3.