Phylogenomic Inference of Protein Molecular Function

Nandini Krishnamurthy1, Kimmen Sjölander1

1 University of California, Berkeley, California
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 6.9
DOI:  10.1002/0471250953.bi0609s11
Online Posting Date:  October, 2005
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


With the explosion in sequence data, accurate prediction of protein function has become a vital task in prioritizing experimental investigation. While computationally efficient methods for homology‐based function prediction have been developed to make this approach feasible in high‐throughput mode, it is not without its dangers. Biological processes such as gene duplication, domain shuffling, and speciation produce families of related genes whose gene products can have vastly different molecular functions. Standard sequence‐comparison approaches may not discriminate effectively among these candidate homologs, leading to errors in database annotations. In this unit, we describe phylogenomic approaches to reduce the error rate in function prediction. Phylogenomic inference of protein molecular function consists of a series of subtasks. Once a cluster of homologs is identified, a multiple sequence alignment and phylogenetic tree are constructed. Finally, the phylogenetic tree is overlaid with experimental data culled for the members of the family, and changes in biochemical function can be traced along the evolutionary tree.

Keywords: Evolution; Homolog; Ortholog; Paralog; Function prediction; Phylogenomic; Subfamily; Phylogenetic

PDF or HTML at Wiley Online Library

Table of Contents

  • Basic Protocol 1: Identifying Homologs and Constructing a Multiple Sequence Alignment Using FlowerPower and MUSCLE
  • Basic Protocol 2: Multiple Sequence Alignment Analysis and Editing Using Belvu
  • Support Protocol 1: Downloading and Installing the Belvu Software
  • Basic Protocol 3: Constructing a Phylogenetic Tree using Bete
  • Basic Protocol 4: Phylogenomic Inference of Molecular Function using TreeNotator
  • Commentary
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
   Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O'Donovan, C., Redaschi, N., and Yeh, L.S. 2004. UniProt: The Universal Protein knowledgebase. Nucl. Acids Res. 32:D115‐D119.
   Benson, D.A., Karsch‐Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2004. GenBank: Update. Nucl. Acids Res. 32:D23‐D26.
   Bork, P. and Koonin, E.V. 1998. Predicting functions from protein sequences—where are the bottlenecks? Nature Genet. 18:313‐318.
   Brenner, S.E. 1999. Errors in genome annotation. Trends Genet. 15:132‐133.
   Brown, D., Krishnamurthy, N., Dale, J.M., Christopher, W., and Sjölander, K. 2005. Subfamily HMMs in functional genomics. Pac. Symp. Biocomput. 322‐333.
   Citerne, H.L., Luo, D., Pennington, R.T., Coen, E., and Cronk, Q.C. 2003. A phylogenomic investigation of CYCLOIDEA‐like TCP genes in the Leguminosae. Plant Physiol. 131:1042‐1053.
   Devos, D. and Valencia, A. 2001. Intrinsic errors in genome annotation. Trends Genet. 17:429‐431.
   Doolittle, R.F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64:287‐314.
   Doolittle, R.F. and Bork, P. 1993. Evolutionarily mobile modules in proteins. Sci. Am. 269:50‐56.
   Edgar, R.C. 2004. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113.
   Eisen, J.A. 1998. Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 8:163‐167.
   Eisen, J.A. and Wu, M. 2002. Phylogenetic analysis and gene functional predictions: Phylogenomics in action. Theor. Popul. Biol. 61:481‐487.
   Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annu. Rev. Genet. 22:521‐565.
   Fitch, W.M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99‐113.
   Galperin, M.Y. and Koonin, E.V. 1998. Sources of systematic error in functional annotation of genomes: Domain rearrangement, non‐orthologous gene displacement and operon disruption. In Silico Biol. 1:55‐67.
   Gerlt, J.A. and Babbitt, P.C. 2001. Divergent evolution of enzymatic function: Mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu. Rev. Biochem. 70:209‐246.
   Gilks, W.R., Audit, B., De Angelis, D., Tsoka, S., and Ouzounis, C.A. 2002. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 18:1641‐1649.
   Hasegawa, M. and Fujiwara, M. 1993. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor‐joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. 2:1‐5.
   Hollich, V., Storm, C.E., and Sonnhammer, E.L. 2002. OrthoGUI: Graphical presentation of Orthostrapper results. Bioinformatics 18:1272‐1273.
   Huelsenbeck, J.P. and Ronquist, F. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754‐755.
   Koonin, E.V. 2001. An apology for orthologs—or brave new memes. Genome Biol. 2:COMMENT1005.
   Kuhner, M.K. and Felsenstein, J. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459‐468.
   McClure, M.A., Vasi, T.K., and Fitch, W.M. 1994. Comparative analysis of multiple protein‐sequence alignment methods. Mol. Biol. Evol. 11:571‐592.
   Sander, C. and Schneider, R. 1991. Database of homology‐derived protein structures and the structural meaning of sequence alignment. Proteins 9:56‐68.
   Saitou, N. and Nei, M. 1987. The neighbor‐joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406‐425.
   Sjölander, K. 1998. Phylogenetic inference in protein superfamilies: Analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6:165‐174.
   Sjölander, K. 2004. Phylogenomic inference of protein molecular function: Advances and challenges. Bioinformatics 20:170‐179.
   Sjölander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S., and Haussler, D. 1996. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 12:327‐345.
   Storm, C.E. and Sonnhammer, E.L. 2002. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18:92‐99.
   Swofford, D. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Mass.
   Thompson, J.D., Plewniak, F., and Poch, O. 1999. A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27:2682‐2690.
   Wheeler, W.C., Gatesy, J., and DeSalle, R. 1995. Elision: A method for accommodating multiple molecular sequence alignments with alignment‐ambiguous sites. Mol. Phylogenet. Evol. 4:1‐9.
   Zmasek, C.M. and Eddy, S.R. 2001. ATV: Display and manipulation of annotated phylogenetic trees. Bioinformatics 17:383‐384.
   Zmasek, C.M. and Eddy, S.R. 2002. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 16:3(1):14.
Key References
   Bork and Koonin, 1998. See above.
  The authors of this paper identify common problems associated with function prediction by homology and present ways to avoid these errors.
   Eisen, 1998. See above.
  Jonathan Eisen's cogent presentation of the raison d'etre behind phylogenomic analysis for improving prediction of gene function.
   Sjölander, 2004. See above.
  A detailed view of the challenges in phylogenomic analysis, with a description of new methods for key tasks in a phylogenomic pipeline.
Internet Resources
  The BPG resources Web site includes a variety of user‐friendly resources for phylogenomic inference of protein molecular function. A description of all the available tools can also be found on the Web site.
PDF or HTML at Wiley Online Library