Protein Function Prediction: Problems and Pitfalls

William R. Pearson1

1 University of Virginia School of Medicine, Charlottesville
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 4.12
DOI:  10.1002/0471250953.bi0412s51
Online Posting Date:  September, 2015
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of “functional similarity,” and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer “one‐size‐fits‐all” solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood. © 2015 by John Wiley & Sons, Inc.

Keywords: homology; orthology; paralogy; function prediction; gene ontology; EC numbers

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Annotating Function
  • Homologs, Orthologs, and Paralogs
  • Function Prediction and Evolutionary Distance
  • Similarity Search, Database Size, and Database Redundancy
  • Summary
  • Literature Cited
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
  Gene Ontology Consortium 2001. Creating the gene ontology resource: Design and implementation. Genome Res. 11:1425‐1433. doi: 10.1101/gr.180801.
  Gene Ontology Consortium 2014. Guide to GO evidence codes (‐go‐evidence‐codes).
  Altenhoff, A.M., Studer, R.A., Robinson‐Rechavi, M., and Dessimoz, C. 2012. Resolving the ortholog conjecture: Orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comput. Biol. 8:e1002514. doi: 10.1371/journal.pcbi.1002514.
  Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389‐3402. doi: 10.1093/nar/25.17.3389.
  Blake, J.A. and Harris, M.A. 2008. The gene ontology (GO) project: Structured vocabularies for molecular biology and their application to genome and expression analysis. Curr. Protoc. Bioinform. 23:7.2:7.2.1‐7.2.9.
  Chen, X. and Zhang, J. 2012. The ortholog conjecture is untestable by the current gene ontology but is supported by RNA sequencing data. PLoS Comput. Biol. 8:e1002784. doi: 10.1371/journal.pcbi.1002784.
  Dalquen, D.A. and Dessimoz, C. 2013. Bidirectional best hits miss many orthologs in duplication‐rich clades such as plants and animals. Genome Biol. Evol. 5:1800‐1806.
  Dessimoz, C., Skunca, N., and Thomas, P.D. 2013. CAFA and the open world of protein function predictions. Trends Genet. 29:609‐610. doi: 10.1016/j.tig.2013.09.005.
  Devos, D. and Valencia, A. 2000. Practical limits of function prediction. Proteins 41:98‐107. doi: 10.1002/1097‐0134(20001001)41:1%3c98::AID‐PROT120%3e3.0.CO;2‐S.
  Eddy, S.R. 2011. Accelerated profile hmm searches. PLoS Comput. Biol. 7:e1002195. doi: 10.1371/journal.pcbi.1002195.
  Fischer, S., Brunk, B.P., Chen, F., Gao, X., Harb, O.S., Iodice, J.B., Shanmugam, D., Roos, D.S., and Stoeckert, C.J. 2011. Using OrthoMCL to assign proteins to OrthoMCL‐DB groups or to cluster proteomes into new ortholog groups. Curr. Protoc. Bioinform. 35:6.12.1‐6.12.19.
  Galperin, M.Y. and Koonin, E.V. 2012. Divergence and convergence in enzyme evolution. J. Biol. Chem., 287:21‐28. doi: 10.1074/jbc.R111.241976.
  Gerlt, J.A. and Babbitt, P.C. 2000. Can sequence determine function? Genome Biol. 1:reviews0005.1‐10. doi: 10.1186/gb‐2000‐1‐5‐reviews0005.
  Jensen, R.A. 2001. Orthologs and paralogs ‐ we need to get it right. Genome Biol. 2:interactions1002.1‐3. doi: 10.1186/gb‐2001‐2‐8‐interactions1002.
  Magrane, M. and Uniprot Consortium 2011. UniProt knowledgebase: A hub of integrated protein data. Database 2011:bar009. doi: 10.1093/database/bar009.
  Nehrt, N.L., Clark, W.T., Radivojac, P., and Hahn, M.W. 2011. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comp. Biol. 7:e1002073. doi: 10.1371/journal.pcbi.1002073.
  Omelchenko, M.V., Galperin, M.Y., Wolf, Y.I., and Koonin, E.V. 2010. Non‐homologous isofunctional enzymes: A systematic analysis of alternative solutions in enzyme evolution. Biol. Direct 5:31. doi: 10.1186/1745‐6150‐5‐31.
  Pearson, W.R. 2013. An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinform. 42:3.1.1‐3.1.8.
  Skunca, N., Altenhoff, A., and Dessimoz, C. 2012. Quality of computationally inferred gene ontology annotations. PLoS Comput. Biol. 8:e1002533. doi: 10.1371/journal.pcbi.1002533.
  Thomas, P.D., Wood, V., Mungall, C.J., Lewis, S.E., Blake, J.A., and Consortium, G.O. 2012. On the use of gene ontology annotations to assess functional similarity among orthologs and paralogs: A short report. PLoS Comput. Biol. 8:e1002386. doi: 10.1371/journal.pcbi.1002386.
  Webb, E.C., (Ed.) 1992. Enzyme nomenclature 1992: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the nomenclature and classification of enzymes. Academic Press, San Diego.
  Wilson, C.A., Kreychman, J., and Gerstein, M. 2000. Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297:233‐249. doi: 10.1006/jmbi.2000.3550.
PDF or HTML at Wiley Online Library