Inference of Episodic Changes in Natural Selection Acting on Protein Coding Sequences via CODEML

Joseph P. Bielawski1, Jennifer L. Baker2, Joseph Mingrone1

1 Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia, 2 Center for Research on Genomics and Global Health, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 6.15
DOI:  10.1002/cpbi.2
Online Posting Date:  June, 2016
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


This unit provides protocols for using the CODEML program from the PAML package to make inferences about episodic natural selection in protein‐coding sequences. The protocols cover inference tasks such as maximum likelihood estimation of selection intensity, testing the hypothesis of episodic positive selection, and identifying sites with a history of episodic evolution. We provide protocols for using the rich set of models implemented in CODEML to assess robustness, and for using bootstrapping to assess if the requirements for reliable statistical inference have been met. An example dataset is used to illustrate how the protocols are used with real protein‐coding sequences. The workflow of this design, through automation, is readily extendable to a larger‐scale evolutionary survey. © 2016 by John Wiley & Sons, Inc.

Keywords: codon model; natural selection; episodic evolution; maximum likelihood; dN/dS ratio; experimental design

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Maximum Likelihood Estimation of Episodic Selection Intensity
  • Basic Protocol 2: Using the Bootstrap to Assess if the Requirements For Inference Have Been Met
  • Basic Protocol 3: Testing the Hypothesis of Episodic Evolution and Making Site‐Specific Inferences
  • Support Protocol 1: Obtain and Install Paml
  • Support Protocol 2: Obtain and Install CODEML_SBA FOR UNIX/UNIX‐LIKE and OS X Systems
  • Support Protocol 3: Labeling the Foreground Branch of a Newick Tree
  • Support Protocol 4: Assess Robustness of Results to Alternative Models for Codon Frequencies
  • Support Protocol 5: Smoothed Bootstrap Aggregation for Identifying Sites with a History of Postive Selection
  • Guidelines for Undertanding Results
  • Commentary
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
  Anisimova, M. and Liberles, D. 2012. Detecting and understanding natural selection. Codon Evolution: Mechanisms and Models, pp. 73‐96. Oxford University Press, Oxford.
  Anisimova, M., Nielsen, R., and Yang, Z. 2003. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164:1229‐1236.[*CE: Please provide DOI Number.]
  Aris‐Brosou, S. and Bielawski, J.P. 2006. Large‐scale analyses of synonymous substitution rates can be sensitive to assumptions about the process of mutation. Gene 378:58‐64. doi: 10.1016/j.gene.2006.04.024.
  Baker, J.L., Dunn, K., Mingrone, J., Wood, B.A., Karpinski, B.A., Sherwood, C.C., Wildman, D.E., Maynard, T.M., Bielawski, J.P. 2016. Functional divergence of the nuclear receptor NR2C1 as a modulator of pluripotentiality during hominid evolution. Genetics (available at
  Bao, L., Gu, H., Dunn, K.A., and Bielawski, J.P. 2008. Likelihood‐based clustering (LiBaC) for codon models, a method for grouping sites according to similarities in the underlying process of evolution. Mol. Biol. Evol. 25:1995‐2007. doi: 10.1093/molbev/msn145.
  Bay, R.A. and Bielawski, J.P. 2011. Recombination detection under evolutionary scenarios relevant to functional divergence. J. Mol. Evol. 73:273‐286. doi: 10.1007/s00239‐011‐9473‐0.
  Bickel, P.J. and Kjell, A. 2015. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics, volume I. Vol. 117. CRC Press. Boca Raton, Fla.
  Bielawski, J.P. 2013. Detecting the signatures of adaptive evolution in protein‐coding genes. Curr. Protoc. Mol. Biol. 2013:19‐1. doi: 10.1002/0471142727.mb1901s101.
  Bielawski, J.P. and Yang, Z. 2004. A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution.” J. Mol. Evol. 59:121‐132. doi: 10.1007/s00239‐004‐2597‐8.
  Bugge A., Feng D., Everett L.J., Briggs E.R., Mullican S.E., Wang F., Jager J., and Lazar M.A. 2012. Rev‐erbalpha and Rev‐erbbeta coordinately protect the circadian clock and normal metabolic function. Genes Dev. 26:657‐667. doi: 10.1101/gad.186858.112.
  Chen, L., Chen, Z., Baker, K., Halvorsen, E.M., da Cunha, A.P., Flak, M.B., Gerber, G., Huang, Y.H., Hosomi, S., Arthur, J.C., and Dery, K.J., 2012. The short isoform of the CEACAM1 receptor in intestinal t cells regulates mucosal immunity and homeostasis via Tfh cell induction. Immunity 37:930‐946. doi: 10.1016/j.immuni.2012.07.016.
  Desper, R. and Gascuel, O. 2006. Getting a tree fast: Neighbor joining, FastME, and distance‐based methods. Curr. Protoc. Bioinform. 15:6.3.1‐6.3.28.
  Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts.
  Fletcher, W. and Yang, Z. 2010. The effect of insertions, deletions, and alignment errors on the branch‐site test of positive selection. Mol. Biol. Evol. 27:2257‐2267. doi: 10.1093/molbev/msq115.
  Goldman, N. and Yang, Z. 1994. A codon‐based model of nucleotide substitution for protein‐coding DNA sequences. Mol. Biol. Evol. 11:725‐736.
  Gray‐Owen, S.D. and Blumberg, R.S. 2006. CEACAM1: Contact‐dependent control of immunity. Nature Rev. Immunol. 6:433‐446. doi: 10.1038/nri1864.
  Kosakovsky Pond, S. and Muse, S.V. 2005. Site‐to‐site variation of synonymous substitution rates. Mol. Biol. Evol. 22:2375‐2385. doi: 10.1093/molbev/msi232.
  Kosakovsky Pond, S.L., Posada, D., Gravenor, M.B., Woelk, C.H., and Frost, S.D.W. 2006. GARD: A genetic algorithm for recombination detection. Bioinformatics 22:3096‐3098. doi: 10.1093/bioinformatics/btl474.
  Mindell, D.P. and Meyer, A. 2001. Homology evolving. TREE 16:434‐440.
  Muse, S.V. and Gaut, B.S. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715‐724.
  Notredame, C. 2010. Computing multiple sequence/structure alignments with the T‐Coffee package. Curr. Protoc. Bioinform. 29:3.8.1‐3.8.25.
  Page, R.D. 2003. Introduction to inferring evolutionary relationships. Curr. Protoc. Bioinform. 00:6.1.1‐6.1.13.
  Redelings, B. 2014. Erasing errors due to alignment ambiguity when estimating positive selection. Mol. Biol. Evol. 31:1979‐1993. doi: 10.1093/molbev/msu174.
  Rubinstein, N.D., Doron‐Faigenboim, A., Mayrose, I., and Pupko, T. 2011. Evolutionary models accounting for layers of selection in protein‐coding genes and their impact on the inference of positive selection. Mol. Biol. Evol. 28:3297‐3308. doi: 10.1093/molbev/msr162.
  Sawyer, S. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526‐538.[*CE: Please provide doi number.]
  Scheffler, K., Darren P.M., and Seoighe, C. 2006. Robust inference of positive selection from recombining coding sequences. Bioinformatics 22:2493‐2499. doi: 10.1093/bioinformatics/btl427.
  Schmidt, H.A. and von Haeseler, A. 2007. Maximum‐likelihood analysis using TREE‐PUZZLE. Curr. Protoc. Bioinform. 17:6.6.1‐6.6.23.
  Schneider, A., Souvorov, A., Sabath, N., Landan, G., Gonnet, G.H., and Graur, D. 2009. Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol. Evol. 1:114‐118. doi: 10.1093/gbe/evp012.
  Schott, R.K., Refvik, S.P., Hauser, F.E., López‐Fernández, H., and Chang, B.S.W. 2014. Divergent positive selection in rhodopsin from lake and riverine cichlid fishes. Mol. Biol. Evol. 31:1149‐1165. doi: 10.1093/molbev/msu064.
  Shriner, D., Nickle, D.C., Jensen, M.A., and Mullins, J.I. 2003. Potential impact of recombination on sitewise approaches for detecting positive natural selection. Genet. Res. 81:115‐121. doi: 10.1017/S0016672303006128.
  Stamatakis, A. 2014. RAxML version 8: A tool for phylogenetic analysis and post‐analysis of large phylogenies. Bioinformatics 30:1312‐1313. doi: 10.1093/bioinformatics/btu033.
  Storey, J.D. 2002. A direct approach to false discovery rates. J. Roy. Sta.t Soc. B. 64:479‐498. doi: 10.1111/1467‐9868.00346.
  Suzuki, Y. and Nei, M. 2004. False‐positive selection identified by ML‐based methods: Examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T‐cell lymphotropic virus. Mol. Biol. Evol. 21:914‐921. doi: 10.1093/molbev/msh098.
  Wilgenbusch, J. C. and Swofford, D. 2003. Inferring evolutionary trees with PAUP. Curr. Protoc. Bioinform. 00:6.4.1‐6.4.28.
  Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568‐573. doi: 10.1093/oxfordjournals.molbev.a025957.
  Yang, Z. 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586‐1591. doi: 10.1093/molbev/msm088.
  Yang, Z. and Bielawski, J.P. 2000. Statistical methods for detecting molecular adaptation. TREE 15:496‐503.
  Yang, Z. and Dos Reis, M. 2011. Statistical properties of the branch‐site test of positive selection. Mol. Biol. Evol. 28:1217‐1228. doi: 10.1093/molbev/msq303.
  Yang, Z. and Nielsen, R. 2002. Codon‐substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908‐917. doi: 10.1093/oxfordjournals.molbev.a004148.
  Yang, Z., Wong, W.S.W., and Nielsen, R. 2005. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:1107‐1118. doi: 10.1093/molbev/msi097.
  Yang, Z., Nielsen, R., Goldman, N., and Pedersen, A‐M.K. 2000. Codon‐substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431‐449.
  Zhang, J., Nielsen, R., and Yang, Z. 2005. Evaluation of an improved branch‐site likelihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22:2472‐2479. doi: 10.1093/molbev/msi237.
  Zhang, Y., Fang, B., Emmett, M.J., Damle, M., Sun, Z., Feng, D., Armour, S.M., Remsberg, J.R., Jager, J., Soccio, R.E., Steger, D.J., and Lazar, M.A. 2015. Discrete functions of nuclear receptor Rev‐erb‐alpha couple metabolism to the clock. Science 348:1488‐1492. doi: 10.1126/science.aab3021.
PDF or HTML at Wiley Online Library