Ancestral Sequence Reconstruction

Klaus Schliep

Graz University of Technology
klaus.schliep@gmail.com

2023-01-23

Introduction

These notes describe the ancestral sequence reconstruction using the phangorn package (Schliep 2011). phangorn provides several methods to estimate ancestral character states with either Maximum Parsimony (MP) or Maximum Likelihood (ML). For more background on all the methods see e.g. (Felsenstein 2004) or (Yang 2006).

Parsimony reconstructions

To reconstruct ancestral sequences we first load some data and reconstruct a tree:

library(phangorn)
fdir <- system.file("extdata/trees", package = "phangorn")
primates <- read.phyDat(file.path(fdir, "primates.dna"),
                        format = "interleaved")
tree <- pratchet(primates, trace=0) |> acctran(primates)
parsimony(tree, primates)
## [1] 746

For parsimony analysis of the edge length represent the observed number of changes. Reconstructing ancestral states therefore defines also the edge lengths of a tree. However there can exist several equally parsimonious reconstructions or states can be ambiguous and therefore edge length can differ. “MPR” reconstructs the ancestral states for each (internal) node as if the tree would be rooted in that node. However the nodes are not independent of each other. If one chooses one state for a specific node, this can restrict the choice of neighboring nodes (figures 2 and 3). The function acctran (accelerated transformation) assigns edge length and internal nodes to the tree (Swofford and Maddison 1987).

anc.acctran <- ancestral.pars(tree, primates, "ACCTRAN")
anc.mpr <- ancestral.pars(tree, primates, "MPR")

All the ancestral reconstructions for parsimony are based on the fitch algorithm and so far only bifurcating trees are allowed. However trees can get pruned afterwards using the function multi2di from ape.

The seqLogo function from the seqLogo package from Bioconductor provides a neat way to show proportions of a nucleotides of ancestral states (see figure 1).

library(seqLogo)
seqLogo( t(subset(anc.mpr, getRoot(tree), 1:20)[[1]]), ic.scale=FALSE)

You may need to install seqLogo before

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("seqLogo")
plotAnc(tree, anc.mpr, 17)
title("MPR")

Fig 2. Ancestral reconstruction using MPR.

plotAnc(tree, anc.acctran, 17)
title("ACCTRAN")

Fig 3. Ancestral reconstruction using ACCTRAN.

Likelihood reconstructions

phangorn also offers the possibility to estimate ancestral states using a ML. The advantages of ML over parsimony is that the reconstruction accounts for different edge lengths. So far only a marginal construction is implemented (see (Yang 2006)).

fit <- pml(tree, primates)
fit <- optim.pml(fit, model="F81", control = pml.control(trace=0))

We can assign the ancestral states according to the highest likelihood (“ml”): \[ P(x_r = A) = \frac{L(x_r=A)}{\sum_{k \in \{A,C,G,T\}}L(x_r=k)} \] and the highest posterior probability (“bayes”) criterion: \[ P(x_r=A) = \frac{\pi_A L(x_r=A)}{\sum_{k \in \{A,C,G,T\}}\pi_k L(x_r=k)}, \] where \(L(x_r)\) is the joint probability of states at the tips and the state at the root \(x_r\) and \(\pi_i\) are the estimated base frequencies of state \(i\). Both methods agree if all states (base frequencies) have equal probabilities.

anc.ml <- ancestral.pml(fit, "ml")
anc.bayes <- ancestral.pml(fit, "bayes")

The differences of the two approaches for a specific site (17) are represented in the following figures.

plotAnc(tree, anc.ml, 17)
title("ML")

Fig 4. Ancestral reconstruction the using the maximum likelihood.

plotAnc(tree, anc.bayes, 17)
title("Bayes")

Fig 5. Ancestral reconstruction using (empirical) Bayes.

Session info

## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=de_AT.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=de_AT.UTF-8   
##  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] phangorn_2.11.1 ape_5.6-2      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9       bslib_0.4.2      compiler_4.2.2   jquerylib_0.1.4 
##  [5] highr_0.10       tools_4.2.2      digest_0.6.31    jsonlite_1.8.4  
##  [9] evaluate_0.20    lifecycle_1.0.3  nlme_3.1-161     lattice_0.20-45 
## [13] pkgconfig_2.0.3  rlang_1.0.6      Matrix_1.5-3     fastmatch_1.1-3 
## [17] igraph_1.3.5     cli_3.6.0        rstudioapi_0.14  yaml_2.3.6      
## [21] parallel_4.2.2   xfun_0.36        fastmap_1.1.0    stringr_1.5.0   
## [25] knitr_1.41       generics_0.1.3   vctrs_0.5.1      sass_0.4.4      
## [29] grid_4.2.2       glue_1.6.2       R6_2.5.1         rmarkdown_2.20  
## [33] magrittr_2.0.3   codetools_0.2-18 htmltools_0.5.4  quadprog_1.5-8  
## [37] stringi_1.7.12   cachem_1.0.6

References

Felsenstein, Joseph. 2004. Inferring Phylogenies. Sunderland: Sinauer Associates.
Schliep, Klaus Peter. 2011. “Phangorn: Phylogenetic Analysis in R.” Bioinformatics 27 (4): 592–93. https://doi.org/10.1093/bioinformatics/btq706.
Swofford, D. L., and W. P. Maddison. 1987. “Reconstructing Ancestral Character States Under Wagner Parsimony.” Math. Biosci. 87: 199–229.
Yang, Ziheng. 2006. Computational Molecular Evolution. Oxford: Oxford University Press.