An Overview of Multiple Sequence Alignment

Victor Simossis1, Jens Kleinjung1, Jaap Heringa1

1 Integrative Bioinformatics Institute (IBIVU), Free University, Amsterdam
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.7
DOI:  10.1002/0471250953.bi0307s03
Online Posting Date:  November, 2003
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Multiple sequence alignment is perhaps the most commonly applied bioinformatics technique. It often leads to fundamental biological insight into sequence‐structure‐function relationships of nucleotide or protein sequence families. In this unit, an overview of multiple sequence alignment techniques is presented, covering a history of nearly 30 years from the early pioneering methods to the current state‐of‐the‐art techniques. Methodological and biological issues and end‐user considerations, as well as alignment evaluation issues, are discussed.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • MSA Methodology
  • MSA Methods
  • Assessment of MSA
  • Conclusion
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

   Altschul, S.F. and Erickson, B.W. 1986. A nonlinear measure of subalignment similarity and its significance levels. Bull. Math. Biol. 48:617‐632.
   Altschul, S.F., Carrol, R.J., and Lipman, D.J. 1989. Weights for data related by a tree. J. Mol. Biol. 207:647‐653.
   Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389‐3402.
   Bailey, T.L. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. pp. 28‐36. AAAI Press, Menlo Park, Calif.
   Bailey, T.L. and Elkan, C. 1995. The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. pp. 21‐29. AAAI Press, Menlo Park, Calif.
   Barton, G.J. and Sternberg, M.J.E. 1987. A strategy for the rapid multiple sequence alignment of protein sequences: Confidence levels from tertiary structure comparisons. J. Mol. Biol. 198:327‐337.
   Benner, S.A., Cohen, M.A., and Gonnet, G.H. 1993. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 229:1065‐1082.
   Blanken, R.L., Klotz, L.C., and Hinnebusch, A.G. 1982. Computer comparison of new and existing criteria for constructing evolutionary trees from sequence data. J. Mol. Evol. 19:9‐19.
   Boguski, M.S., Hardison, R.C., Schwartz, S., and Miller, W. 1992. Analysis of conserved domains and sequence motifs on cellular regulatory proteins and locus control regions using new software tools for multiple sequence alignment and visualization. New Biol. 4:247‐260.
   Brocchieri, L. and Karlin, S. 1998. A symmetric‐iterated multiple sequence alignment of protein sequences. J. Mol. Biol. 276:249‐264.
   Bucher, P., Karplus, K., Moeri, N., and Hofmann, K. 1996. A flexible motif search technique based on generalized profiles. Comput. Chem. 20:3‐24.
   Carillo, H. and Lipman, D.J. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48:1073‐1082.
   Corpet, F. 1988. Multiple sequence alignment with hierarchical clustering. Nucl. Acids Res. 16:10881‐10890.
   Dayhoff, M.O., Schwart, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure (M. Dayhoff, ed.) pp. 345‐352. National Biomedical Research Foundation, Washington D.C.
   Dayhoff, M.O., Barker, W.C., and Hunt, L.T. 1983. Establishing homologies in protein sequences. Methods Enzymol. 91:524‐545.
   Delcher, A., Phillippy, A., Carlton, J., and Salzberg, S.L. 2002. Fast algorithms for large‐scale genome alignment and comparison. Nucleic Acids Res. 30:2478‐2483.
   Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., and Salzberg, S.L. 1999. Alignment of whole genomes. Nucleic Acids Res. 27:2369‐2376.
   Deperieux, E. and Feytmans, E. 1992. Match‐Box: A fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comp. Appl. Biosci. 8:501‐509.
   Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G. (eds.). 1998. Biological sequence analysis, Ch. 5.6, pp. 115‐119. Cambridge University Press, Cambridge.
   Eck, R.V. and Dayhoff, M.O. 1966. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Springs, Maryland.
   Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14:755‐763.
   Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368‐376.
   Felsenstein, J. 1985. Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39:783‐791.
   Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25:351‐360.
   Ficket, J.W. 1984. Fast optimal alignment. Nucl. Acids Res. 12:175‐180.
   Fitch, W. 1966. An improved method of testing for evolutionary homology. J. Mol. Biol. 16:9‐16.
   Fitch, W.M. and Margoliash, E. 1967. Construction of phylogenetic trees. Science 155:279‐284.
   Fitch, W.M. 1971. Toward defining the course of evolution: Minimum change for a specific tree topology. Syst. Zool. 20:406‐416.
   Frishman, D. and Argos, P. 1996. Incorporation of long‐distance interactions in a secondary structure prediction method. Protein Eng. 9:133‐142.
   Frishman, D. and Argos, P. 1997. Seventy‐five percent accuracy in protein secondary structure prediction. Proteins. 27:329‐335.
   GCG. 1993. Program manual for the GCG Package, v. 8. Genetics Computer Group, Madison, Wis.
   Gibbs, A.J. and McIntyre, G.A. 1970. The diagram: A method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16:1‐11.
   Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning, Addison‐Wesley, New York.
   Gonnet, G.H., Korostensky, C., and Benner, S. 2000. Evaluation measures of multiple sequence alignments. J. Comput. Biol. 7:261‐276.
   Gotoh, O. 1986. Alignment of three biological sequences with an efficient traceback procedure. J. Theor. Biol. 121:327‐337.
   Gotoh, O. 1993. Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput. Appl. Biosci. 9:361‐370.
   Gotoh, O. 1996. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264:823‐838.
   Gribskov, M., McLachlan, A.D., and Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84:4355‐4358.
   Gropp, W., Lusk, E., Doss, N., and Skjellum, A. 1996. A high‐performance, portable implementation of the MPI Message‐Passing Interface standard. Parallel Computing 22:789‐828.
   Haussler, D., Krogh, A., Mian, I.S., and Sjölander, K. 1993. Protein modeling using hidden Markov models: Analysis of globins. In Proceedings of the Hawaii International Conference on System Sciences. pp. 792‐802. IEEE Computer Society Press, Los Alamitos, Calif.
   Henikoff, S. and Henikoff, J.G. 1994. Position‐based sequence weights. J. Mol. Biol. 243:574‐578.
   Henikoff, S. and Henikoff, J.G. 1999. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919.
   Heringa, J. 1998. Detection of internal repeats: How common are they? Curr. Opin. Struct. Biol. 8:338‐345.
   Heringa, J. 1999. Two strategies for sequence comparison: Profile‐preprocessed and secondary structure‐induced multiple sequence alignment. Comput. Chem. 23: 341‐364.
   Heringa, J. 2002. Local weighting schemes for protein multiple sequence alignment. Comput. Chem. 26:459‐477.
   Heringa, J. and Taylor, W.R. 1997. Three‐dimensional domain duplication, swapping and stealing. Curr. Opin. Struct. Biol. 7:416‐21.
   Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene 73:237‐244.
   Hogeweg, P. and Hesper, B. 1984. The alignment of sets of sequences and the construction of phyletic trees: An integrated method. J. Mol. Evol. 20:175‐186.
   Huang, X. and Miller, W. 1991. A time‐efficient, linear‐space local similarity algorithm. Adv. Appl. Math. 12:337‐357.
   Huang, X., Hardison, R.C., and Miller, W. 1990. A space‐efficient algorithm for local similarities. CABIOS 6:373‐381.
   Johnson, M.S. and Doolittle, R.F. 1986. A method for the simultaneous alignment of three or more amino acid sequences. J. Mol. Evol. 23:267‐278.
   Karplus, K. and Hu, B. 2001. Evaluation of protein multiple alignments by SAM‐T99 using the BAliBASE multiple alignment test set. Bioinformatics 17:713‐720.
   Karplus, K., Barrett, C., and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846‐856.
   Katoh, K., Misawa, K., Kuma, K., and Miyata, T. 2002. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30:3059‐3066.
   Kleinjung, J., Douglas, N., and Heringa, J. 2002. Parallelized multiple sequence alignment. Bioinformatics 18:1270‐1271.
   Kluge, A.G. and Farris, J.S. 1969. Quantitative phyletics and the evolution of anurans. Syst. Zool. 18:1‐32.
   Lassmann, T. and Sonnhammer, E.L.L. 2002. Quality assessment of multiple sequence alignment programs. FEBS Lett. 529:126‐130.
   Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C. 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple sequence alignment. Science 262:208‐214.
   Lee, C., Grasso, C., and Sharlow, M.F. 2002. Multiple sequence alignment using partial order graphs. Bioinformatics 18:452‐464.
   Lin, K.X., Kleinjung, J., Taylor, W.R, and Heringa, J. 2003. Testing homology with cao: A contact‐based markov model of protein evolution. Comp. Biol. Chem. 27:93‐102.
   Lipman, D.J., Altschul, S.F., and Kececioglu, J.D. 1989. A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 86:4412‐4415.
   Liu, J.S. and Lawrence, C.E. 1999. Bayesian inference on biopolymer models. Bioinformatics 15:38‐52.
   Lüthy, R., Xenarios, I., and Bucher, P. 1994. Improving the sensitivity of the sequence profile method. Protein Science 3:139‐146.
   Morgenstern, B. 1999. Dialign 2: Improvement of the segment‐to‐segment approach to multiple sequence alignment. Bioinformatics 15:211‐218.
   Morgenstern B., Dress, A., and Werner, T. 1996. Multiple DNA and protein sequence alignment based on segment‐to‐segment comparison. Proc. Natl. Acad. Sci. U.S.A. 93:12098‐12103.
   Müller, T. and Vingron, M. 2000. Modelling amino acid replacement. J. Comput. Biol. 7:761‐776.
   Murata, M., Richardson, J.S., and Sussman, J.L. 1985. Simultaneous comparison of three protein sequences. Proc. Natl. Acad. Sci. U.S.A. 82:3073‐3077.
   Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443‐453.
   Notredame, C. and Higgins, D.G. 1996. SAGA: Sequence alignment by genetic algorithm. Nucl. Acids Res. 24: 1515‐24.
   Notredame, C., Holm, L., and Higgins, D.G. 1998. COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14:407‐422.
   Notredame, C., Higgins, D.G., and Heringa, J. 2000. T‐Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302:205‐217.
   Pacheco, P.S. 1997. Parallel Programming with MPI. Morgan Kaufman Publishers, Inc., San Francisco.
   Reinert, K., Stoye, J., and Will, T. 2000. An iterative method for faster sum‐of‐pair multiple sequence alignment. Bioinformatics 16:808‐814.
   Rost, B. and Sander, C. 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232:584‐599.
   Saitou, N. and Nei, M. 1987. The neighbor‐joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406‐425.
   Saitou, N. 1990. Maximum likelihood methods. Meth. Enzymol. 183:584‐598.
   Schneider, T.D. 2002. Consensus sequence zen. Appl. Bioinformatics 1:111‐119.
   Schuler, G.D., Altschul, S.F., and Lipman, D.J. 1991. A workbench for multiple sequence alignment construction and analysis. Proteins 9:180‐190.
   Sibbald, P.R. and Argos, P. 1990. Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J. Mol. Biol. 216:813‐818.
   Sjölander, K., Karplus, K., Brown, M., Hughly, R., Krogh, A., Mian, I., and Haussler, D. 1996. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. CABIOS 12:327‐345.
   Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197.
   Sneath, P.H. and Sokal, R.R. 1973. Numerical Taxonomy. Freeman, San Francisco.
   Sobel, E. and Martinez, H.M. 1986. A multiple sequence alignment program. Nucl. Acids Res. 14:363‐374.
   Sokal, R.R. and Michener, C.D. 1958. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 28:1409‐1438.
   Stoye, J. 1998. Multiple sequence alignment with the divide‐and‐conquer method. Gene 211:GC45‐GC56.
   Stoye, J., Moulton, V., and Dress, A.W.M. 1997. DCA: An efficient implementation of the divide‐and‐conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13:625‐626.
   Tanner, M.A., and Wong, W.H. 1987. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82:528‐550.
   Taylor, W.R. 1988. A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28:161‐169.
   Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994a. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choice. Nucl. Acids Res. 22:4673‐80.
   Thompson J.D., Higgins, D.G., and Gibson, T.J. 1994b. Improved sensitivity of profile searched through the use of sequence weights and gap excision. CABIOS 10:19‐29.
   Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., and Higgins, D.G. 1997. The CLUSTAL X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25:4876‐4882.
   Thompson, J.D., Plewniak, F., and Poch, O. 1999a. BAliBASE: A benchmark alignments database for the evaluation of multiple sequence alignment programs. Bioinformatics 15:87‐88.
   Thompson, J.D., Plewniak, F., and Poch, O. 1999b. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27:2682‐2690.
   Vingron, M. and Argos, P. 1989. A fast and sensitive multiple sequence alignment program. CABIOS 5:115‐121.
   Vingron, M. and Sibbald, P.R. 1993. Weighting in sequence space: A comparison of methods in terms of generalized sequences. Proc. Natl. Acad. Sci. U.S.A. 90:8777‐8781.
   Viterbi, A.J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Informat. Theory IT‐13:260‐269.
   Wang, L. and Jiang, T. 1994. On the complexity of multiple sequence alignment. J. Comput. Biol. 1:723‐728.
   Waterman, M.S. 1986. Multiple sequence alignment by consensus. Nucl. Acids Res. 14:9095‐9102.
   Waterman, M.S. and Eggert, M. 1987. A new algorithm for best subsequence alignments with application to tRNA‐rRNA comparisons. J. Mol. Biol. 197:723‐728.
   Waterman, M.S. and Jones, R. 1990. Consensus methods for DNA and protein sequence alignment. Methods Enzymol. 183:221‐237.
   Zhu J., Liu, J.S., and Lawrence, C. 1998. Bayesian adaptive sequence alignment algorithms. Bioinformatics 14:25‐31.
Key References
   Dayhoff et al., 1978. See above.
  This atlas represents a seminal approach to sequence alignment. The evolutionary model that is still used today in most methods is introduced here, together with the now classical PAM series of amino acid exchange matrices. The evolutionary model is often referred to as the Dayhoff model, while the most widely used early matrix in the PAM series, the PAM250 matrix, is commonly known as the Dayhoff matrix.
   Felsenstein, 1981. See above.
  In this paper, the important evolutionary method of maximum likelihood is introduced. The method, which attempts to find the tree that maximizes the probability that the observed data will fit the tree under a given evolutionary model, is now generally accepted as the most accurate strategy.
   Hogeweg and Hesper, 1984. See above.
  This paper introduces the progressive multiple alignment strategy, which is still the most widely used multiple alignment technique. In this early paper, alignment iteration is already addressed. Another interesting feature of the paper is the use of so‐called internode sequences, which are additionally inferred sequences ancestral to subgroups of sequences in the phylogenetic tree calculated for the query sequence set.
   Needleman and Wunsch, 1970. See above.
  This is one of the most quoted early papers on sequence alignment. In this paper, the global dynamic programming algorithm is introduced to the biological community and applied to pairwise amino acid sequences. The basic dynamic programming algorithm had been conceived before by the physicist Richard Belman, who published a large series of papers and books on the topic during the 1950s and 60s.
   Smith and Waterman, 1981. See above.
  Following the approach by Needleman and Wunsch (), Smith and Waterman derived the now classical algorithm for local pair‐wise sequence alignment. Most widely used homology search engines such as BLAST are fast approximations of the Smith and Waterman algorithm and perform local alignment.
Internet Resources
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library