Using MODELTEST and PAUP* to Select a Model of Nucleotide Substitution

David Posada1

1 Universidad of de Vigo, Vigo, Spain
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 6.5
DOI:  10.1002/0471250953.bi0605s00
Online Posting Date:  February, 2003
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Models of nucleotide substitution are commonly used in the analysis of DNA sequences. This unit describes the use of the program MODELTEST (coupled with PAUP*) to find the best‐fit model of substitution for the sequence alignment at hand. An example data file is analyzed and the interpretation of the results is discussed. Some background theory on model selection and a discussion of the relevance of models is included at the end of the unit.

PDF or HTML at Wiley Online Library

Table of Contents

  • Guidelines for Understanding Results
  • Commentary
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
   Adachi, J. and Hasegawa, M. 1995. Improved dating of the human/chimpanzee separation in the mitochondrial DNA tree: Heterogeneity among amino acid sites. J. Mol. Evol. 40:622‐628.
   Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. Autom. Contr. 19:716‐723.
   Bruno, W.J. and Halpern, A.L. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16:564‐566.
   Buckley, T.R. 2002. Model misspecification and probabilistic tests of topology: Evidence from empirical data sets. Syst. Biol. 51:509‐523.
   Buckley, T.R. and Cunningham, C.W. 2002. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Mol. Biol. Evol. 19:394‐405.
   Buckley, T.R., Simon, C., and Chambers, G.K. 2001. Exploring among‐site rate variation models in a maximum likelihood framework using empirical data: The effects of model assumptions on estimates of topology, edge lengths, and bootstrap support. Syst. Biol. 50:67‐86.
   Burnham, K.P. and Anderson, D.R. 1998. Model Selection and Inference: A Practical Information‐Theoretic Approach. Springer‐Verlag, New York.
   Cunningham, C.W., Zhu, H., and Hillis, D.M. 1998. Best‐fit maximum‐likelihood models for phylogenetic inference: Empirical tests with known phylogenies. Evolution 52:978‐987.
   Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401‐410.
   Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368‐376.
   Frati, F., Simon, C., Sullivan, J., and Swofford, D.L. 1997. Gene evolution and phylogeny of the mitochondrial cytochrome oxidase gene in Collembola. J. Mol. Evol. 44:145‐158.
   Fukami‐Kobayashi, K. and Tateno, Y. 1991. Robustness of maximum likelihood tree estimation against different patterns of base substitutions. J. Mol. Evol. 32:79‐91.
   Gaut, B.S. and Lewis, P.O. 1995. Success of maximum likelihood phylogeny inference in the four‐taxon case. Mol. Biol. Evol. 12:152‐162.
   Goldman, N. 1993. Simple diagnostic statistical test of models of DNA substitution. J. Mol. Evol. 37:650‐661.
   Goldman, N. and Yang, Z. 1994. A codon‐based model of nucleotide substitution for protein‐coding DNA sequences. Mol. Biol. Evol. 11:725‐736.
   Hasegawa, M., Kishino, K., and Yano, T. 1985. Dating the human‐ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160‐174.
   Huelsenbeck, J.P. 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44:17‐48.
   Huelsenbeck, J.P. 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19:698‐707.
   Huelsenbeck, J.P. and Hillis, D.M. 1993. Success of phylogenetic methods in the four‐taxon case. Syst. Biol. 42:247‐264.
   Huelsenbeck, J.P. and Crandall, K.A. 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28:437‐466.
   Huelsenbeck, J.P. and Rannala, B. 1997. Phylogenetic methods come of age: Testing hypothesis in an evolutionary context. Science 276:227‐232.
   Huelsenbeck, J.P. and Nielsen, R. 1999. Variation in the pattern of nucleotide substitution across sites. J. Mol. Evol. 48:86‐93.
   Jukes, T.H. and Cantor, C.R. 1969. Evolution of protein molecules. In Mammalian Protein Metabolism (H.M. Munro, eds.) pp.21‐132. Academic Press, New York.
   Kelsey, C.R., Crandall, K.A., and Voevodin, A.F. 1999. Different models, different trees: The geographic origin of PTLV‐I. Mol. Phylogenet. Evol. 13:336‐347.
   Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111‐120.
   Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454‐458.
   Kullback, S. and Leibler, R.A. 1951. On information and sufficiency. Ann. Math. Stat. 22:79‐86.
   Leitner, T., Kumar, S., and Albert, J. 1997. Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. J. Virol. 71:4761‐4770.
   Muse, S.V. and Gaut, B.S. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715‐724.
   Posada, D. 2001. The effect of branch length variation on the selection of models of molecular evolution. J. Mol. Evol. 52:434‐444.
   Posada, D. and Crandall, K.A. 1998. Modeltest: Testing the model of DNA substitution. Bioinformatics 14:817‐818.
   Posada, D. and Crandall, K.A. 2001a. Selecting the best‐fit model of nucleotide substitution. Syst. Biol. 50:1‐22.
   Posada, D. and Crandall, K.A. 2001b. Selecting models of nucleotide substitution: An application to human immunodeficiency virus 1 (HIV‐1). Mol. Biol. Evol. 18:897‐906.
   Posada, D. and Crandall, K.A. 2001c. Simple (wrong) models for complex trees: Empirical bias. Mol. Biol. Evol. 18:271‐275.
   Rodríguez, F., Oliver, J.F., Marín, A., and Medina, J.R. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol. 142:485‐501.
   Schöniger, M. and von Haeseler, A. 1994. A stochastic model for the evaluation of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240‐247.
   Sullivan, J. and Swofford, D.L. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenies. J. Mamm. Evol. 4:77‐86.
   Sullivan, J. and Swofford, D.L. 2002. Should we use model‐based methods for phylogenetic inference when we know that assumptions about among‐site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723‐729.
   Swofford, D.L. 2000. PAUP*. Phylogenetic Analysis Using Parsimony (* and Other Methods). Sinauer Associates, Sunderland, Mass.
   Swofford, D.L., Olsen, G.J., Waddell, P.J., and Hillis, D.M. 1996. Phylogenetic inference. In Molecular Systematics (D.M. Hillis, C. Moritz, and B.K. Mable, eds.) pp.407‐514. Sinauer Associates, sunderland, Mass.
   Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition‐transversion and G+C content biases. Mol. Biol. Evol. 9:678‐687.
   Tamura, K. and Nei, M. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512‐526.
   Thorne, J., Kishino, H., and Painter, I.S. 1998. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. 15:1647‐1657.
   Tuffley, C. and Steel, M. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63‐91.
   Wakeley, J. 1994. Substitution‐rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. 11:436‐442.
   Whelan, S. and Goldman, N. 1999. Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol. Biol. Evol. 16:1292‐1299.
   Xia, X. 2000. Phylogenetic relationships among horseshoe crab species: Effect of substitution models in phylogenetic analysis. Syst. Biol. 49:87‐100.
   Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39:306‐314.
   Yang, Z. 1996. Among‐site rate variation and its impact on phylogenetic analysis. Trends Ecol. Evol. 11:367‐372.
   Yang, Z. 1997. How often do wrong models produce better phylogenies? Mol. Biol. Evol. 14:105‐108.
   Yang, Z., Goldman, N., and Friday, A. 1994. Comparison of models for nucleotide substitution used in maximum‐likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316‐324.
   Yang, Z., Goldman, N., and Friday, A. 1995. Maximum likelihood trees from DNA sequences: A peculiar statistical estimation problem. Syst. Biol. 44:384‐399.
   Zhang, J. 1999. Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Mol. Biol. Evol. 16:868‐875.
   Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315‐329.
Key References
   Burnham and Anderson. 1998 See above.
  This book provides a very clear and accessible explanation of different issues around model selection, particularly for the AIC. The book is written by ecologists and it includes many biological examples. A fundamental reference for any biologist doing data analysis.
   Posada and Crandall. 2001a. See above.
  A simulation study of the performance of different strategies for selecting models of substitution. Includes a detailed description of the different selection strategies.
   Swofford et al., 1996. See above.
  This chapter is still the most comprehensive review of phylogenetic inference to date. It provides a detailed description of several substitution models and their use in phylogenetics.
Internet Resources
  The MODELTEST Web site.
  The PAUP* Web site.
PDF or HTML at Wiley Online Library