The gphmm package implements the generalized pair hidden Markov chain model (GPHMM).
#1. Commandline tool
../inst/gphmm --help
##
##
## ----- ----- ----- -----
##
## gphmm version 0.99.0
##
## Biostrings version 2.45.4
##
## jsonlite version 1.5
##
## docopt version 0.4.5
##
## ----- ----- ----- -----
##
##
## Compute or train a GPHMM model.
##
## Usage:
## gphmm compute <fasta> <csv> [options] [--param=<param>] [out=<out>]
## gphmm train <fasta> <csv> [options] [--llthreshold=<l>] [--maxit=<i>]
##
## Options:
## <fasta> path to fasta file with DNA sequences.
## <csv> path to csv file with columns (1) querie, (2) ref. seq., (optional 3) querie QV. See vignette.
## --seed=<seed> integer, seed.
## --ncores=<n> integer, number of cores [Default: 0].
## --verbose if TRUE, print things.
## --param=<param> path to json file with GPHMM parameters [Default: ].
## --out=<out> path to the output file [Default: ].
## --llthreshold=<l> algo stops when diff. of log ll between 2 iterations is below the thr [Default: 0.00001].
## --maxit=<i> maximum number of iterations [Default: 10].
There are two ways to use the command line tool. You can compute the GPHMM probabilities using a set of parameters or train the model to estimate the parameters using noisy reads and the non noisy version of the sequences.
The main inputs of the commandline tool are
The name of the queries and reference sequences in the csv file should have a sequence with the same name in the fasta file. Then, for each row in the csv file, the GPHMM probability is computed for the corresponding query and reference sequence.
#2. Compute GPHMM probabilities
To compute the GPHMM probabilities, we need to have access to a set of parameters for the model. GPHMM parameters need to be estimated using a training set where the queries have been sequenced from known reference sequences (see next section). For the moment, let's use function initializeGphmm
to get the set of GPHMM parameters used as initial parameters during the training phase
paramgphmm = initializeGphmm()
paramgphmm
## $qR
## [1] 0.25 0.25 0.25 0.25
##
## $qX
## [1] 0.25 0.25 0.25 0.25
##
## $qY
## [1] 0.25 0.25 0.25 0.25
##
## $deltaX
## [1] -0.2 -0.2
##
## $deltaY
## [1] -0.2 -0.2
##
## $epsX
## [1] 0.1
##
## $epsY
## [1] 0.1
##
## $tau
## [1] 0.0006666667
##
## $eta
## [1] 0.0006666667
##
## $pp
## [,1] [,2] [,3] [,4]
## [1,] 0.91 0.03 0.03 0.03
## [2,] 0.03 0.91 0.03 0.03
## [3,] 0.03 0.03 0.91 0.03
## [4,] 0.03 0.03 0.03 0.91
Let's compute the GPHMM probability (log scale) for the two following sequences
Query
read = 'ATGCGATGCA'
read
## [1] "ATGCGATGCA"
Reference sequence
ref = 'ATGTACGATGA'
ref
## [1] "ATGTACGATGA"
GPHMM probability
computegphmm(read = read,
ref = ref,
parameters = paramgphmm,
output = "short")
## [1] -23.35523
You can also look a more detailed output where the path (i.e., the sequence of hidden states) is specified. Hidden states can take the following values:
M
means Match or Mismatch state,I
means Insertion state,D
means Deletion state.computegphmm(read = read,
ref = ref,
qv = 20,
parameters = paramgphmm,
output = "long")
## $V
## [1] -23.35523
##
## $path
## [1] "MMMDDMMMMMIM"
##
## $read
## [1] "ATGCGATGCA"
##
## $ref
## [1] "ATGTACGATGA"
n = 100
Let's randomly generate 100 sequences
seqs = generateRandomSequences(n = n, meanLen = 100, sdLen = 2,
seed = 7373)
writeXStringSet(seqs, 'queries.fasta')
seqs
## A DNAStringSet instance of length 100
## width seq names
## [1] 98 ACGTAGTTAGGGGGTGGCGA...AACAAAGTGTAATCCGCCA s1
## [2] 97 TCTGCCTTTTCAATGCGTGG...TCGAGCGGCTTGGCGTTGT s2
## [3] 100 AAGAAAATGGACTGACAAGC...AAATCTAGTTCGGTCTGCA s3
## [4] 103 TACAAATTGCTCCAACATGC...GGTAAATAAGGTTGTTTAT s4
## [5] 98 ACTACCACCGCGTACGCAAA...GGGGACAGCAAAATGAATC s5
## ... ... ...
## [96] 99 CAGGGGAGCACCGCCACGTC...CAGGAGGTTTCCTGGGCGA s96
## [97] 100 ACGCACTTCTCATGCTAGGT...CCGATTACATATCGACTGG s97
## [98] 99 CGACTCCGTGTACCATGACA...GAGGTTAGCTAGCGCCCGC s98
## [99] 101 GCATGTGCACTTCGAGAGGA...ACACTTCACCCCCGGCTTT s99
## [100] 100 CTGTTTGCTGGAGTCTGGGC...CTAACACTATTAAACGATA s100
We are going to compute GPHMM for all pairs of sequences. We did not include a column for the quality values, so a default quality value of 20 is used.
toCompute = data.frame(query = rep(paste0('s', 1:n), n),
ref = rep(paste0('s', 1:n), each = n))
write.table(toCompute, 'toCompute.csv')
head(toCompute)
## query ref
## 1 s1 s1
## 2 s2 s1
## 3 s3 s1
## 4 s4 s1
## 5 s5 s1
## 6 s6 s1
So, there are 10000 GPHMM probabilities to compute. Let's call the commandline tool gphmm compute
../inst/gphmm compute queries.fasta toCompute.csv --verbose --ncores=1
##
##
## ----- ----- ----- -----
##
## gphmm version 0.99.0
##
## Biostrings version 2.45.4
##
## jsonlite version 1.5
##
## docopt version 0.4.5
##
## ----- ----- ----- -----
##
##
## [1] "We are going to : compute"
## [1] "Number of cores used : 1"
## [1] "Writting output to : toCompute_gphmm.csv"
## Time difference of 39 secs
The output file is the same as the input csv file, but a column has been added with the estimated (log) GPHMM probabilities
out = read.table('toCompute_gphmm.csv', stringsAsFactors = F)
head(out)
## query ref qv gphmm
## 1 s1 s1 20 -19.43737
## 2 s2 s1 20 -241.37551
## 3 s3 s1 20 -215.89889
## 4 s4 s1 20 -251.33904
## 5 s5 s1 20 -220.39076
## 6 s6 s1 20 -238.91572
The GPHMM is a generative model, therefore noisy reads can be generated from true reference sequences, a GPHMM model, and a set of chosen emission and transition probabilities. Parameters are then estimated using our training algorithm. Note that in real life true emission and transition probabilities are unknown. Then, GPHMM parameters need to be estimated from a training set generated in the lab where noisy reads have been sequenced from known reference sequences.
n = 50
Our generative model has the following variables
paramgphmm = initializeGphmm()
paramgphmm
## $qR
## [1] 0.25 0.25 0.25 0.25
##
## $qX
## [1] 0.25 0.25 0.25 0.25
##
## $qY
## [1] 0.25 0.25 0.25 0.25
##
## $deltaX
## [1] -0.2 -0.2
##
## $deltaY
## [1] -0.2 -0.2
##
## $epsX
## [1] 0.1
##
## $epsY
## [1] 0.1
##
## $tau
## [1] 0.0006666667
##
## $eta
## [1] 0.0006666667
##
## $pp
## [,1] [,2] [,3] [,4]
## [1,] 0.91 0.03 0.03 0.03
## [2,] 0.03 0.91 0.03 0.03
## [3,] 0.03 0.03 0.91 0.03
## [4,] 0.03 0.03 0.03 0.91
50 true reference sequences are randomly generated from our GPHMM model. More sequences would be needed to estimate unbiased parameters, but for this vignette we use a small number of sequences to reduce the computation time.
seqs = generateRandomSequences(n = n, meanLen = 100, sdLen = 5,
prob = paramgphmm$qR, seed = 7373)
seqs
## A DNAStringSet instance of length 50
## width seq names
## [1] 95 CGGGGACGCCGCTGCTACCG...CTTGTATGGGAGATTCCCTC s1
## [2] 94 GGAGTACGTAGTTAGGGGGT...CAGTCTGGGCAACAAAGTGT s2
## [3] 99 AATCCGCCATCTGCCTTTTC...TTCACCGTTCGAGCGGCTTG s3
## [4] 107 GCGTTGTAAGAAAATGGACT...AAAATCTAGTTCGGTCTGCA s4
## [5] 96 TACAAATTGCTCCAACATGC...TTGTACGAGGTAAATAAGGT s5
## ... ... ...
## [46] 95 TGTCTCATCACGAGCAAGTC...CTGCATTTATCTAAAATCAT s46
## [47] 100 AAAATGACAAAAAACCGTAG...GGCCAACTCTCAAGGGGGGG s47
## [48] 97 CATTGAGAGCACCGGTCCTG...GTCAAAACCGAAGAGCCATT s48
## [49] 96 GCAGATAAGATTCCTGGTCG...GTGTTAAAGTTATGTTCTAT s49
## [50] 95 TGTGGATCATGAAAAAGGTG...TCCTGAACGTGACTGTGTCG s50
Let's now similate sequencing errors in the true reference sequences using our model and true emission and transition probabilities
qv = rnorm(n, 20, 5)
qv[qv < 5] = 5
reads = list()
for (i in 1:n){
reads[[i]] = generateRead(seq = as.character(seqs[i]),
paramgphmm = paramgphmm,
qv = qv[i],
seed = i)
}
train = c(seqs, DNAStringSet(sapply(reads, '[[', 1)))
names(train) = c(names(train)[1:n], gsub('s', 't', names(train)[1:n]))
csv = data.frame(reads = paste0('t', 1:n), ref = paste0('s', 1:n), qv = qv)
# write files
writeXStringSet(train, 'train.fasta')
write.table(csv, 'train.csv')
plot(density(qv), main = 'Read QV (Phred)')
plot(density(width(seqs)), main = 'Sequence length')
plot(density(width(train[grepl('t', names(train))])), main = 'Read length')
The true counts for the emission and transition matrices are
emiTrans = lapply(reads, function(x) computeCounts(x))
emiTrans = lapply(lapply(c(1:4), function(i) lapply(emiTrans, '[[', i)), function(x) Reduce('+', x))
names(emiTrans) = c('counts_emissionM', 'counts_emissionD',
'counts_emissionI', 'counts_transition')
emiTrans
## $counts_emissionM
## A C G T N
## A 1123 44 33 39 0
## C 35 1087 29 28 0
## G 35 32 1119 39 0
## T 51 38 23 1074 0
## N 0 0 0 0 0
##
## $counts_emissionD
## A C G T N
## 34 29 21 33 0
##
## $counts_emissionI
## A C G T N
## 37 26 25 32 0
##
## $counts_transition
## MM II IM DD DM MI MD
## 4569 11 109 11 103 108 105
Commandline gphmm train
can be used to estimate the parameters of the model
../inst/gphmm train train.fasta train.csv --verbose --maxit=5 --ncores=1
##
##
## ----- ----- ----- -----
##
## gphmm version 0.99.0
##
## Biostrings version 2.45.4
##
## jsonlite version 1.5
##
## docopt version 0.4.5
##
## ----- ----- ----- -----
##
##
## [1] "We are going to : train"
## [1] "Number of cores used : 1"
## [1] "Training step 1"
## [1] 69.13625 -69.13625
## [1] "Training step 2"
## [1] 0.2744344 -68.8618172
## [1] "Training step 3"
## [1] 0.002517262 -68.864334461
## [1] "Training step 4"
## [1] 0.00000 -68.86433
## [1] "Writting gphmm parameters in : train_paramgphmm.json"
## [1] "Writting ll in : train_llgphmm.json"
## Time difference of 1 secs
The estimated parameters are
nucl = c('A', 'C', 'G', 'T')
estimator = fromJSON('train_paramgphmm.json')
names(estimator[['qR']]) = names(estimator[['qX']]) = names(estimator[['qY']]) =
colnames(estimator[['pp']]) = rownames(estimator[['pp']]) = nucl
names(estimator[['deltaX']]) = names(estimator[['deltaY']]) =
c('intercept', 'slope')
estimator
## $qR
## A C G T
## 0.2559 0.2483 0.2499 0.2458
##
## $qX
## A C G T
## 0.3085 0.1915 0.1915 0.3085
##
## $qY
## A C G T
## 0.3696 0.2609 0.1413 0.2283
##
## $deltaX
## intercept slope
## 0.3522 -0.2431
##
## $deltaY
## intercept slope
## -1.0936 -0.1592
##
## $epsX
## [1] 0.1064
##
## $epsY
## [1] 0.1348
##
## $tau
## [1] 7e-04
##
## $eta
## [1] 7e-04
##
## $pp
## A C G T
## A 0.8858 0.0394 0.0306 0.0442
## C 0.0405 0.8970 0.0338 0.0287
## G 0.0300 0.0333 0.8953 0.0414
## T 0.0463 0.0429 0.0261 0.8848
The log likelihood increases at each iteration of the training procedure. You should see a plateau at the end of the training procedure
ll = fromJSON('train_llgphmm.json')
plot(1:length(ll), ll, xlab = 'Iterations', ylab = 'Log Likelihood',
type = 'l', main = 'Log likelihood')
As parameters were estimated from a set of known emission and transition probabilities, the performance of our training procedure can be assessed using the bias (estimated - true) probabilities
bias = unlist(mapply('-', estimator, paramgphmm, SIMPLIFY = FALSE))
ppNames = paste0('pp.', paste0(rep(nucl, each = 4), rep(nucl, 4)))
names(bias)[grep('^pp', names(bias))] = ppNames
emission = grepl('A|C|G|T', names(bias))
plot(bias[emission], xaxt = "n", main = 'Emission parameters',
xlab = '', ylab = 'Bias')
axis(1, at = 1:length(bias[emission]), las = 2, labels = names(bias[emission]))
abline(h = 0)
transition = !emission
plot(bias[transition], xaxt = "n", main = 'Transition parameters',
xlab = '', ylab = 'Bias')
axis(1, at = 1:length(bias[transition]), las = 2, labels = names(bias[transition]))
abline(h = 0)
# remove generated files
system('rm queries.fasta')
system('rm toCompute_gphmm.csv')
system('rm toCompute.csv')
system('rm train.csv')
system('rm train.fasta')
system('rm train_paramgphmm.json')
system('rm train_llgphmm.json')