Computational Methods for Protein Sequence Comparison and Search

Dong Xu1

1 Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri‐Columbia, Columbia, Missouri
Publication Name:  Current Protocols in Protein Science
Unit Number:  Unit 2.1
DOI:  10.1002/0471140864.ps0201s56
Online Posting Date:  April, 2009
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Protein sequence comparison and search has become commonplace not only for bioinformatics researchers but also for experimentalists in many cases. Because of the exponential growth in sequence data, sequence comparison in particular has become an increasingly important tool. Relating a new gene sequence to other known sequences often reveals its function, structure, and evolution. Many sequence comparison and search tools are available through public Web servers, and biologists can use them easily with little knowledge of computers or bioinformatics. This unit provides some theoretical background and describes popular tools for dot plot, sequence search against a database, multiple sequence alignments, protein tree construction, and protein family and motif search. Step‐by‐step examples are provided to illustrate how to use some of the most well‐known tools. Finally, some general advice is given on combining different sequence analysis tools for biological inference. Curr. Protoc. Protein Sci. 56:2.1.1‐2.1.27. © 2009 by John Wiley & Sons, Inc.

Keywords: protein sequence comparison; dot plot; multiple sequence alignment; protein tree; protein family; motif search

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Theoretical Background for Protein Sequence Analysis
  • Matrix Methods for Sequence Comparison: Dot Plots
  • Sequence Similarity Searching
  • Multiple Alignments
  • Protein Trees
  • Protein Family and Functional Site Identification
  • General Strategy for Sequence Analyses
  • Acknowledgement
  • Internet Resources
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25:3389‐3402.
   Argos, P. 1987. A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193:385‐396.
   Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A. and Zygouri, C. 2003. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31:400‐402.
   Bairoch, A. 1992. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res. 19:2241‐2245.
   Barton, G.J. 1990. Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol. 183:403‐428.
   Borodovsky, M. and Ekisheva, S. 2006. Problems and Solutions in Biological Sequence Analysis. Cambridge University Press.
   Brendel, V., Bucher, P., Nourbaksh, I.R., Blaisdell, B.E., and Karlin, S. 1992. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 89:2002‐2006.
   Burks, C. 1990. The flow of nucleotide sequence data into data banks: Role and impact of large‐scale sequencing projects. In Computers and DNA, Santa Fe Institute (G. Bell and T. Marr, eds.) pp. 35‐45. Addison‐Wesley, Reading, Mass.
   Chou, P.Y. and Fasman, G.D. 1974. Prediction of protein conformation. Biochemistry 13:222‐244.
   Corpet, F., Servant, F., Gouzy, J., and Kahn, D. 2000. ProDom and ProDom‐CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28:267‐269.
   Day, W.H.E. and McMorris, F.R. 1993. A consensus program for molecular sequences. CABIOS 9:653‐656.
   Dayhoff, M.O. 1978. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C.
   Depiereux, E. and Feytmans, E. 1991. Simultaneous and multivariate alignment of protein sequences: Correspondence between physicochemical profiles and structurally conserved regions (SCR). Protein Eng. 4:603‐613.
   De Rijk, P. and De Wachter, R. 1993. DCSE, an interactive tool for sequence alignment and secondary structure search. CABIOS 9:735‐740.
   Dodo, H., Marsic, D., Callender, M., Cebert, E., and Viquez, O. 2002 Screening 34 Peanut Introductions for Allergen Content Using Elisa, Food and Agricultural Immunology 14:147‐154.
   Doolittle, R.F. 1981. Similar amino acid sequences: Chance or common ancestry? Science 214:167‐339.
   Doolittle, R.F. 1986. Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Ann Arbor, Mich.
   Doolittle, R.F. 1989. Redundancies in protein sequences. In Prediction of Protein Structure and the Principles of Protein Conformation (G.D. Fasman, ed.) pp. 599‐623. Plenum, New York.
   Doolittle, R.F. 1990. What we have learned and will learn from sequence databases. In Computers and DNA, Santa Fe Institute (G. Bell and T. Marr, eds.) pp. 21‐31. Addison‐Wesley, Reading, Mass.
   Dumas, J.P. and Nunio, J. 1982. Efficient algorithm for folding and comparing nucleic acid sequences. Nucl. Acids Res. 10:197‐206.
   Eddy, S.R. Profile hidden Markov models. 1998. Bioinformatics 14:755‐763.
   Edgar, R.C. and Sjolander, K. 2004. Coach: profile‐profile alignment of protein families using hidden Markov models. Bioinformatics 20:1309‐1318.
   Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome‐wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863‐14868.
   Eroshkin, A.M., Zhilkin, P.A., and Fomin, V.I. 1993. Algorithm and computer program: Pro_Anal for analysis of relationship between structure and activity in a family of proteins or peptides. CABIOS 9:491‐497.
   Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., and Bairoch, A. 2002. The PROSITE database, its status in 2002. Nucleic Acids Res. 30:235‐238.
   Felsenstein, J. 1989. PHYLIP ‐ Phylogeny Inference Package (Version 3.2). Cladistics 5:164‐166.
   Feng, D.F. and Doolittle, R.F. 1987. Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. J. Mol. Evol. 25:351‐360.
   Finkelstein, A.V. and Ptitsyn, O.B. 1987. Why do globular proteins fit the limited set of folding patterns? Prog. Biophys. Mol. Biol. 50:171‐190.
   Finn, R.D., Mistry, J., Schuster‐Bockler, B., Griffiths‐Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer, E L., and Bateman, A. 2006. Pfam: Clans, web tools and services. Nucleic Acids Res. 34:D247‐D251.
   Fitch, W.M. 1966. An improved method of testing for evolutionary homology. J. Mol. Biol. 16:9‐16.
   Fitch, W.M. 1969. Locating gaps in amino acid sequences to optimize the homology between two proteins. Biochem. Genet. 3:99‐108.
   Fitch, W.M. 1970. Distinguishing homologous from analogous proteins. Syst Zool. 19:99‐113.
   Fuchs, R. 1994. Fast protein block searches. CABIOS 10:79‐80.
   Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97‐120.
   Genetics Computer Group. 1994. GCG Program Manual for the Wisconsin Package, Version 8, September 1994. Genetics Computer Group Inc., Madison, Wis.
   George, D., Hunt, L.T., and Barker, W.C. 1990. Mutation data matrix and its uses. Methods Enzymol. 183:333‐351.
   Gibbs, A.J. and McIntyre, G.A. 1970. The diagram, a method for comparing sequences. J. Biochem. 16:1‐11.
   Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of amino acid substitution matrices. Proteins Struct. Funct. Genet. 17:49‐61.
   Henikoff, J.G., Greene, E.A., Pietrokovski, S., and Henikoff, S. 2000. Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28:228‐230.
   Heringa, J., Sommerfeldt, H., Higgins, D.G., and Argos, P. 1992. OBSTRUCT: A program to obtain the largest cliques from a protein sequence set according to structural resolution and sequence similarity. CABIOS 8:599‐600.
   Hodgman, T.C. 1992. Nucleic acid and protein sequence management. In Microcomputers in Biochemistry: A Practical Approach (C.F.A. Bryce, ed.) pp. 131‐158. IRL Press, Oxford.
   Huang, H., Barker, W.C., Chen, Y., and Wu, C.H. 2003. iProClass: An integrated database of protein family, function and structure information. Nucleic Acids Res. 31:390‐392.
   Junier, T. and Pagni, M. 2000. Dotlet: Diagonal plots in a web browser. Bioinformatics 16:178‐179.
   Kanaoka, M., Kishimoto, F., Ueki, Y., and Umeyama, H. 1989. Alignment of protein sequences using the hydrophobic core scores. Protein Eng. 2:347‐351.
   Karlin, S.P., Morris, M., Ghandour, G., and Leung, M.‐Y. 1988. Algorithms for identifying local molecular sequence features. CABIOS 4:41‐51.
   Karlin, S.P., Ost, F., and Blaisdell, B.E. 1989. Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 133‐157. CRC Press, Boca Raton, Fla.
   Karlin, S., Bucher, P., and Brendel, V. 1991. Statistical methods and insights for protein and DNA sequences. Annu. Rev. Biophys. Chem. 20:175‐203.
   Karplus, K., Barrett, C., and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846‐856.
   Koonin, E.V., Makarova, K.S., and Aravind, L. 2001. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55:709‐742.
   Kruskal, J.B. 1983. An overview of sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J.B. Kruskal, eds.) pp. 1‐44. Addison‐Wesley, Reading, Mass.
   Kruskal, J.B. and Sankoff, D. 1983. An anthology of algorithms and concepts for sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J.B. Kruskal, eds.) pp. 265‐310. Addison‐Wesley, Reading, Mass.
   Kyte, J. and Doolittle, R.F. 1982. A simple method for displaying the hydrophobic character of a protein. J. Mol. Biol. 157:105‐132.
   Landau, G.M., Vishkin, U., and Nussinov, R. 1988. Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS 4:19‐24.
   Landau, G.M., Vishkin, U., and Nussinov, R. 1990. Fast alignment of DNA and protein sequences. Methods Enzymol. 183:487‐502.
   Landes, C., Henaut, A., and Risler, J.‐L. 1993. Dot‐plot comparisons by multivariate analysis (DOCMA): A tool for classifying protein sequences. CABIOS 9:91‐196.
   Lipman, D.J. and Pearson, W.R. 1985. Rapid and sensitive protein similarity searches. Science 227:1435‐1441.
   Livingstone, C.D. and Barton, G.F. 1993. Protein sequence alignments: A strategy for the hierarchical analysis of residue conservation. CABIOS 9:745‐756.
   Madera, M. and Gough, J. 2002. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30:4321‐4328.
   Maizel, J.V. and Lenk, R.P. 1981. Enhanced graphic matrix analysis of nucleic acids and protein sequences. Proc. Natl. Acad. Sci. U.S.A. 78:7665‐7669.
   McLachlan, A.D. 1971. Test for comparing related amino acid sequences: Cytochrome c and cytochrome c‐551. J. Mol. Biol. 61:409‐424.
   Mrazek, J. and Kypr, J. 1993. UNIREP: A microcomputer program to find unique and repetitive nucleotide sequences in genomes. CABIOS 9:355‐360.
   Nedde, D.N. and Ward, M.O. 1993. Visualizing relationships between nucleic acid sequences using correlation images. CABIOS 9:331‐335.
   Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443‐453.
   Notredame, C., Holme, L., and Higgins, D. 1998. COFFEE: A New Objective Function for Multiple Sequence Alignment. Bioinformatics 14:407‐422.
   Notredame, C., Higgins, D., and Heringa, J. 2000. T‐Coffee: A novel method for multiple sequence alignments. J. Mol. Biol. 302:205‐217.
   Panjukov, V.V. 1993. Finding steady alignments: Similarity and distance. CABIOS 9:285‐290.
   Pearson, W.R. 1990. Rapid and sensitive comparison with FASTP and FASTA. Methods Enzymol. 183:63‐98.
   Pearson, W.R. 1994. Using the FASTA program to search protein and DNA sequence databases. Methods Mol. Biol. 24:365‐389.
   Pearson, W.R. and Miller, W. 1992. Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 210:576‐610.
   Pevzner, P.A. 1992. Statistical distance between texts and filtration methods in sequence comparison. CABIOS 8:121‐127.
   Pizzi, E.M., Attimonelli, M., Liuni, S., Frontali, C., and Saccone, C. 1991. A simple method for global sequence comparison. Nucl. Acids Res. 20:131‐136.
   Raghava, G.P., Searle, S.M., Audley, P.C., Barber, J.D., and Barton, G.J. 2003. OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4:47.
   Sankoff, D., Kruskal, J., and Nerbonne, J. (eds) 2000. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Cambridge University Press.
   Sellers, P.H. 1974. On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26:787‐793.
   Smith, R.F. and Smith, T.F. 1992. Pattern‐induced multisequence alignment (PIMA) algorithm employing secondary structure‐dependent gap penalties for use in comparative protein modeling. Protein Eng. 5:35‐41.
   Smith, T.F. and Waterman, M.S. 1981. Comparative biosequence metrics. J. Mol. Evol. 18:38‐46.
   Soding, J. 2005. Protein homology detection by HMM‐HMM comparison. Bioinformatics 21:951‐960
   Sonnhammer, E.L. and Durbin, R. 1995. A dot‐matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:GC1‐10.
   Sonnhammer, E.L. and Wootton, J.C. 2001. Integrated graphical analysis of protein sequence features predicted from sequence composition. Proteins 45:262‐273.
   Staden, R. 1994a. Statistical and structural analysis of protein sequences. Methods Mol. Biol. 24:125‐130.
   Staden, R. 1994b. Searching for motifs in protein sequences. Methods Mol. Biol. 24:131‐139.
   Staden, R. 1994c. Using patterns to analyze protein sequences. Methods Mol. Biol. 24:141‐154.
   Staden, R. 1994d. Comparing sequences. Methods Mol. Biol. 24:155‐170.
   States, D.J. 1992. Molecular sequence accuracy: Analyzing imperfect data. Trends Genet. 8:52‐55.
   States, D.J. and Boguski, M.S. 1990. Sequence Analysis Primer. Stockton Press, New York.
   Streletc, V.B., Shindyalov, I.N., Kolchanov, N.A., and Lim, H.A. 1991. Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of dot matrices. CABIOS 8:529‐534.
   Swofford, D.L. 2002. PAUP 4.0: Phylogenetic Analysis Using Parsimony (And Other Methods). Sinauer Associates, Sunderland, Mass.
   Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278:631‐637.
   Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673‐4680.
   Wan, X. and Xu, D. 2005. Computational methods for remote homlog identification. Curr. Protein Pept. Sci. 6:527‐546.
   Waterman, M.S. 1989. Sequence alignments. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 53‐90. CRC Press, Boca Raton, Fla.
   Waterman, M.S. 1990. Consensus patterns in sequences. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 93‐115. CRC Press, Boca Raton, Fla.
   Waterman, M.S. and Eggert, M. 1991. A new algorithm for best subsequence alignments with application to tRNA‐rRNA comparisons. J. Mol. Biol. 197:723‐728.
   Waterman, M.S. and Jones, R. 1990. Consensus methods for DNA and protein sequence alignment. Methods Enzymol. 183:221‐237.
   Wilbur, W.J. and Lipman, D.J. 1983. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. U.S.A. 80:726‐730.
   Xu, D., Xu, Y., and Uberbacher, E.C. 2000. Computational tools for protein modeling. Curr. Protein Pept. Sci. 1:1‐21.
   Yona, G. and Levitt, M. 2002. Within the twilight zone: A sensitive profile‐profile comparison tool based on information theory. J. Mol. Biol. 315:1257‐1275.
PDF or HTML at Wiley Online Library