Register      Login
Australian Systematic Botany Australian Systematic Botany Society
Taxonomy, biogeography and evolution of plants
L. A. S. JOHNSON REVIEW

Construction and annotation of large phylogenetic trees

Michael J. Sanderson
+ Author Affiliations
- Author Affiliations

Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA.

Australian Systematic Botany 20(4) 287-301 https://doi.org/10.1071/SB07006
Submitted: 28 February 2007  Accepted: 22 May 2007   Published: 5 September 2007

Abstract

Broad availability of molecular sequence data allows construction of phylogenetic trees with 1000s or even 10 000s of taxa. This paper reviews methodological, technological and empirical issues raised in phylogenetic inference at this scale. Numerous algorithmic and computational challenges have been identified surrounding the core problem of reconstructing large trees accurately from sequence data, but many other obstacles, both upstream and downstream of this step, are less well understood. Before phylogenetic analysis, data must be generated de novo or extracted from existing databases, compiled into blocks of homologous data with controlled properties, aligned, examined for the presence of gene duplications or other kinds of complicating factors, and finally, combined with other evidence via supermatrix or supertree approaches. After phylogenetic analysis, confidence assessments are usually reported, along with other kinds of annotations, such as clade names, or annotations requiring additional inference procedures, such as trait evolution or divergence time estimates. Prospects for partial automation of large-tree construction are also discussed, as well as risks associated with ‘outsourcing’ phylogenetic inference beyond the systematics community.


References


Aho AV, Sagiv Y, Szymanski TG, Ullman JD (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal of Computing 10, 405–421.
Crossref | GoogleScholarGoogle Scholar | [verified 17 July 2007].

Maddison WP , Maddison DR (2000) ‘MacClade 4: analysis of phylogeny and character evolution.’ (Sinauer: Sunderland, MA)

Maddison WP , Maddison DR (2007) Mesquite: a modular system for evolutionary analysis. http://mesquiteproject.org/mesquite/mesquite.html [verified 17 July 2007].

McCubbin AG, Roalson EH (2005) Construction of bacterial artificial chromosome libraries for use in phylogenetic studies. Methods in Enzymology 395, 384–400.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

McMahon MM, Sanderson MJ (2006) Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic Biology 55, 818–836.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Minh BQ, Vinh LS, von Haeseler A, Schmidt HA (2005) pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics (Oxford, England) 21, 3794–3796.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Moles A, Ackerly D, Webb C, Tweddle J, Dickie J, Westoby M (2005) A brief history of seed size. Science 307, 576–580.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Moore B, Smith S, Donoghue MJ (2006) Increasing data transparency and estimating phylogenetic uncertainty in supertrees: approaches using nonparametric bootstrapping. Systematic Biology 55, 662–676.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Mort ME, Soltis PS, Soltis DE, Mabry ML (2000) Comparison of three methods for estimating internal support on phylogenetic trees. Systematic Biology 49, 160–171.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Mossel E (2007) Distorted metrics on trees and phylogenetic forests. IEEE/ACM Transactions on Computational Biology and Bioinformatics 4, 108–116.
Crossref | GoogleScholarGoogle Scholar | open url image1

Mossel E , Steel M (2005) How much can evolved characters tell us about the tree that generated them? In ‘Mathematics of evolution and phylogeny’. (Eds O Gascuel, M Steel) pp. 384–412. (Oxford University Press: New York)

Mower JP, Stefanovic S, Young GJ, Palmer JD (2004) Plant genetics—Gene transfer from parasitic to host plants. Nature 432, 165–166.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Munzner T (1998) Exploring large graphs in 3D hyperbolic space. IEEE Computer Graphics and Applications 18, 18–23.
Crossref | GoogleScholarGoogle Scholar | open url image1

Munzner T, Guimbretiere F, Tasiran S, Zhang L, Zhou YH (2003) TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility. ACM Transactions on Graphics 22, 453–462.
Crossref | GoogleScholarGoogle Scholar | open url image1

Myers DS, Cummings MP (2003) Necessity is the mother of invention: a simple grid computing system using commodity tools. Journal of Parallel and Distributed Computing 63, 578–589.
Crossref | GoogleScholarGoogle Scholar | open url image1

Nilsson RH, Rajashekar B, Larsson KH, Ursing BM (2004) GalaxieEST: addressing EST identity through automated phylogenetic analysis. BMC Bioinformatics 5,
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Page RDM (1998) GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14, 819–820.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Page RDM, Charleston MA (1998) Trees within trees: phylogeny and historical associations. Trends in Ecology & Evolution 13, 356–359.
Crossref | GoogleScholarGoogle Scholar | open url image1

Parmentier G, Trystram D, Zola J (2006) Large scale multiple sequence alignment with simultaneous phylogeny inference. Journal of Parallel and Distributed Computing 66, 1534–1545.
Crossref |
open url image1

Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (1999) The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402, 404–407.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Qiu YL, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi-Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW, Renner SS, Soltis DE, Soltis PS, Zanis MJ, Cannone JJ, Gutell RR, Powell M, Savolainen V, Chatrou LW, Chase MW (2005) Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. International Journal of Plant Sciences 166, 815–842.
Crossref | GoogleScholarGoogle Scholar | open url image1

de Queiroz A, Donoghue MJ, Kim J (1995) Separate versus combined analysis of phylogenetic evidence. Annual Review of Ecology and Systematics 26, 657–681.
Crossref | GoogleScholarGoogle Scholar | open url image1

Rice KA, Donoghue MJ, Olmstead RG (1997) Analyzing large data sets: rbcL 500 revisited. Systematic Biology 46, 554–563.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Robbertse B, Reeves JB, Schoch CL, Spatafora JW (2006) A phylogenomic analysis of the Ascomycota. Fungal Genetics and Biology 43, 715–725.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Rokas A, Williams B, King N, Carroll S (2003) Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Ross HA, Lento GM, Dalebout ML, Goode M, Ewing G, McLaren P, Rodrigo AG, Lavery S, Baker CS (2003) DNA Surveillance: web-based molecular identification of whales, dolphins and porpoises. Journal of Heredity 94, 111–114.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Rutschmann F (2006) Molecular dating of phylogenetic trees: a brief review of current methods that estimate divergence times. Diversity & Distributions 12, 35–48.
Crossref | GoogleScholarGoogle Scholar | open url image1

Salamin N, Hodkinson TR, Savolainen V (2002) Building supertrees: an empirical assessment using the grass family (Poaceae). Systematic Biology 51, 136–150.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Salamin N, Chase MW, Hodkinson TR, Savolainen V (2003) Assessing internal support with large phylogenetic DNA matrices. Molecular Phylogenetics and Evolution 27, 528–539.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Sanderson MJ (2006) Paloverde: an OpenGL 3D phylogeny browser. Bioinformatics 22, 1004–1006.
Crossref | PubMed |
open url image1

Sanderson MJ, McMahon MM (2007) Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology Suppl. 1 7, S3.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Sanderson MJ, Wojciechowski MF (2000) Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Systematic Biology 49, 671–685.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Sanderson MJ, Wojciechowski MF, Hu JM, Khan TS, Brady SG (2000) Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Molecular Biology and Evolution 17, 782–797.
PubMed |
open url image1

Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S (2003) Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Molecular Biology and Evolution 20, 1036–1042.
Crossref | PubMed |
open url image1

Sanderson MJ , Ané C , Eulenstein O , Fernandez-Baca D , Kim J , McMahon MM , Piaggio-Talice R (2007) Fragmentation of large data sets in phylogenetic analysis. In ‘Mathematics of evolution and phylogeny II’. (Eds O Gascuel, M Steel) (Oxford University Press: Oxford)

Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC (2004) Mining EST databases to resolve evolutionary events in major crop species. Genome 47, 868–876.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18, 502–504.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Semple C, Daniel P, Hordijk W, Page RDM, Steel M (2004) Supertree algorithms for ancestral divergence dates and nested taxa. Bioinformatics 20, 2355–2360.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Shimodaira H (2002) An approximately unbiased test of phylogenetic tree selection. Systematic Biology 51, 492–508.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Sneath P , Sokal R (1973) ‘Numerical taxonomy.’ (WH Freeman and Co.: San Francisco)

Soltis DE, Soltis PS, Nickrent DL, Johnson LA, Hahn WJ, Hoot SB, Sweere JA (1997) Angiosperm phylogeny inferred from 18S ribosomal sequences. Annals of the Missouri Botanical Garden 84, 1–49.
Crossref | GoogleScholarGoogle Scholar | open url image1

Soltis PS, Soltis DE, Wolf PG, Nickrent DL, Chaw S-M, Chapman RL (1999) The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal? Molecular Biology and Evolution 16, 1774–1784.
PubMed |
open url image1

Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688–2690.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Stamatakis A, Ludwig T, Meier H (2005) RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456–463.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Storm CEV, Sonnhammer ELL (2002) Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18, 92–99.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Tehler A, Little DP, Farris JS (2003) The full-length phylogenetic tree from 1551 ribosomal sequences of chitinous fungi. Mycological Research 107, 901–916.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Till M, Zhou BB, Zomaya A, Jermiin LS (2004) Phylogenetic analysis using maximum likelihood methods in homogeneous parallel environments. Lecture Notes in Computer Science 3320, 274–279. open url image1

de la Torre J, Egan M, Katari M, Brenner E, Stevenson D, Coruzzi G, Desalle R (2006) ESTimating plant phylogeny: lessons from partitioning. BMC Evolutionary Biology 6, 48.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Vilgalys R (2003) Taxonomic misidentification in public DNA databases. New Phytologist 160, 4–5.
Crossref | GoogleScholarGoogle Scholar | open url image1

Vogl C, Badger J, Kearney P, Li M, Clegg M, Jian T (2003) Probabilistic analysis indicates discordant gene trees in chloroplast evolution. Journal of Molecular Evolution 56, 330–340.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Walters JD, Casavant TL, Robinson JP, Bair TB, Braun TA, Scheetz TE (2005) XenoCluster: a grid computing approach to finding ancient evolutionary genetic anomalies. Lecture Notes in Computer Science 3606, 355–366. open url image1

Webb CO, Donoghue MJ (2005) Phylomatic: tree assembly for applied phylogenetics. Molecular Ecology Notes 5, 181–183.
Crossref | GoogleScholarGoogle Scholar | open url image1

Webb CO, Losos JB, Agrawal AA (2006) Integrating phylogenies into community ecology. Ecology 87, S1–S2.
Crossref | GoogleScholarGoogle Scholar | open url image1

Wojciechowski MF , Sanderson MJ , Steel KP , Liston A (2000) Molecular phylogeny of the ‘temperate herbaceous tribes’ of papilionoid legumes: a supertree approach. In ‘Advances in legume systematics’. (Eds PS Herendeen, A Bruneau) pp. 277–298. (Royal Botanic Gardens, Kew: London)

Yan CH, Burleigh JG, Eulenstein O (2005) Identifying optimal incomplete phylogenetic data sets from sequence databases. Molecular Phylogenetics and Evolution 35, 528–535.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Yang ZH, Rannala B (2006) Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. Molecular Biology and Evolution 23, 212–226.
Crossref | GoogleScholarGoogle Scholar | PubMed | open url image1

Yesson C, Culham A (2006) A phyloclimatic study of cyclamen. BMC Evolutionary Biology 6, 72.
Crossref | PubMed |
open url image1

Zwickl DJ (2006) Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD Dissertation, University of Texas at Austin, Austin, TX.









1 1Reflecting on one of many raging arguments over phenetic systematics in the late 1960s, L.A.S. Johnson argued that problems of homology (‘matching’) would not all be whisked away by large oceans of data: ‘…even if we knew the entire nucleotide sequences over a set of organisms we should still have to make many decisions on matching…’ (Johnson 1970: p. 227, based on his presidential address for the Linnean Society of New South Wales in 1968). At the time the prospects for studying such complete genome sequences must have seemed remote. Now the data are here, and the newest genomics technologies (e.g. 454 Life Sciences’s FLX system) promise to deliver 100 million base pairs of sequence in an eight hour run (50 chloroplast genomes or one entire Arabidopsis genome…). However, the number of ‘decisions’ to be made regarding the analysis of such data has grown along with the quantity of information.