Using MUMmer to Identify Similar Regions in Large Sequence Sets

Arthur L. Delcher1, Steven L. Salzberg2, Adam M. Phillippy2

1 The Institute for Genomic Research Rockville, Maryland and Computer Science Department, Loyola College in Maryland, Baltimore, Maryland, 2 The Institute for Genomic Research, Rockville, Maryland
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 10.3
DOI:  10.1002/0471250953.bi1003s00
Online Posting Date:  February, 2003
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

The MUMmer sequence alignment package is a suite of computer programs designed to detect regions of homology in long biological sequences. Version 2.1 makes several improvements to the package, including: increased speed and reduced memory requirements; the ability to handle both protein and DNA sequences; the ability to handle multiple sequence fragments; and new algorithms for clustering together basic matches. The system is particularly efficient at comparing highly similar sequences, such as alternative versions of fragment assemblies or closely related strains of the same bacterium.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Basic Protocol 1: MUMmer2: Comparing a Set of Sequences to a Single Reference Sequence
  • Alternate Protocol 1: NUCmer: Comparing a Set of Sequences to Another Set of Sequences
  • Alternate Protocol 2: PROmer: Comparing Sequences Using Protein Translations
  • Alternate Protocol 3: MUMmer1: Aligning Two Single Sequences
  • Support Protocol 1: Obtaining and Installing the MUMmer Package
  • Guidelines for Understanding Results
  • Commentary
  • Figures
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: MUMmer2: Comparing a Set of Sequences to a Single Reference Sequence

  Necessary Resources
  • Hardware
    • Unix or Linux workstation. The largest program used in this protocol requires main memory of approximately 20 bytes per base of reference sequence plus 1 byte per base of query sequence. Thus, to compare 2 million bases of query sequence to 3 million bases of reference sequence, the computer should have at least (20 × 3 Mb) + (1 × 2 Mb) = 62 Mb of main memory.
  • Software
    • MUMmer 2.12 package (see protocol 5 for download and installation)
  • Files
    • A multi‐FASTA query file and a single‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are complete genomic sequences from two strains of Helicobacter pylori—known as 26695 and J99. These sequences can be downloaded from TIGR's Comprehensive Microbial Resource at http://www.tigr.org/CMR, from the NCBI at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html, or from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm.

Alternate Protocol 1: NUCmer: Comparing a Set of Sequences to Another Set of Sequences

  Necessary Resources
  • Hardware
    • Unix or Linux workstation. The largest program in the suite requires main memory of approximately 20 bytes per base of reference sequence plus 1 byte per base of query sequence. Thus, to compare 2 million bases of query sequence to 3 million bases of reference sequence, the computer should have at least (20 × 3 Mb) + (1 × 2 Mb) = 62 Mb of main memory.
  • Software
    • NUCmer is included in the MUMmer 2.12 package (see protocol 5 for download and installation)
  • Files
    • A multi‐FASTA query file and a multi‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are sequences extracted from alignment regions of the H. pylori genomes used in the protocol 1. File 26695parts.seq has five 2‐kb sequences extracted in order from file hp26695.seq, and file j99parts.seq has five corresponding 2‐kb sequences from file hpj99.seq but in permuted order, with 2 sequences reversed. The positions of the sequences in the files from which they were extracted are indicated in the FASTA header lines in the files. These files can be obtained from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm. The first field after the > on the FASTA header line of each sequence will be used to identify each sequence; therefore these field should be unique both within and between the query and reference files.

Alternate Protocol 2: PROmer: Comparing Sequences Using Protein Translations

  Necessary Resources
  • Hardware
    • Unix or Linux workstation. The largest program in the suite requires main memory of approximately 20 bytes per base of reference sequence plus 1 byte per base of query sequence. Thus, to compare 2 million bases of query sequence to 3 million bases of reference sequence, the computer should have at least (20 × 3 Mb) + (1 × 2 Mb) = 62 Mb of main memory.
  • Software
    • PROmer is included in the MUMmer 2.12 package (see protocol 5 for download and installation)
  • Files
    • A multi‐FASTA query file and a multi‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are sequences extracted from alignment regions of the H. pylori genomes used in the protocol 1. File 26695parts.seq has five 2‐kb sequences extracted in order from file hp26695.seq, and file j99parts.seq has five corresponding 2‐kb sequences from file hpj99.seq but in permuted order, with 2 sequences reversed. The positions of the sequences in the files from which they were extracted are indicated in the FASTA header lines in the files. These files can be obtained from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm. The first field after the > on the FASTA header line of each sequence will be used to identify each sequence; therefore these field should be unique both within and between the query and reference files.

Alternate Protocol 3: MUMmer1: Aligning Two Single Sequences

  Necessary Resources
  • Hardware
    • Same as the MUMmer2 protocol except that more memory is required: ∼25 bytes per base of both the query and reference sequences. Thus, to compare two 2‐megabase genomes will require ∼100 Mb of main memory.
  • Software
    • The MUMmer1 script is included in the MUMmer 2.12 package (see protocol 5 for download and installation)
  • Files
    • A multi‐FASTA query file and a single‐FASTA reference file (see appendix 1B for information on FASTA). The files used in this example are complete genomic sequences from two strains of Helicobacter pylori—known as 26695 and J99. These sequences can be downloaded from TIGR's Comprehensive Microbial Resource at http://www.tigr.org/CMR, from the NCBI at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html, or from the Current Protocols in Bioinformatics Web site at http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm. These are the same files as in the example used for the protocol 1.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Arabidopsis Genome Initiative 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796‐815.
   Chang, W.I. and Lawler, E.L. 1994. Sublinear expected time approximate string matching and biological applications. Algorithmica 12:327‐344.
   Chao, K.M., Zhang, J., Ostell, J., and Miller, W. 1995. A local alignment tool for very long DNA sequences. Comput. Appl. Biosci. 11:147‐153.
   Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., and Salzberg, S.L. 1999. Alignment of whole genomes. Nucleic Acids Res. 27:2369‐2376.
   Delcher, A.L., Phillipy, A., Carlton, J., and Salzberg, S.L. 2002. Fast algorithms for large‐scale genome alignment and comparison. Nucleic Acids Res. 30:2478‐2483.
   Eisen, J.A., Heidelberg, J.F., White, O., and Salzberg, S.L. 2000. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol 1:research11.01‐09.
   Gusfield, D. 1997. Algorithms on Strings,Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York.
   Henikoff, J.G., Pietrokovski, S., McCallum, C.M., and Henikoff, S. 2000. Blocks‐based methods for detecting protein homology. Electrophoresis 21:1700‐1706.
   Kurtz, S. 1999. Reducing the space requirement of suffix trees. Software Practice and Experience 29:1149‐1171.
   Lin, X., Kaul, S., Rounsley, S., Shea, T.P., Benito, M.I., Town, C.D., Fujii, C.Y., Mason, T., Bowman, C.L., and Barnstead, M. et al. 1999. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402:761‐768.
   Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L.G., Wides, R., Halpern, A., Li, P.W., Sutton, G., and Nadeau, J.et al. 2002. A comparison of whole‐genome shotgun‐derived mouse chromosome 16 and the human genome. Science 296:1661‐1671.
   Pearson, W.R. 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol. 132:185‐219.
   Perna, N.T., Plunkett, G., 3rd, Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., and Kirkpatrick, H.A. et al. 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409:529‐533.
   Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. 2000. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10:577‐586.
   Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., and Holt, R.A. et al. 2001. The sequence of the human genome. Science 291:1304‐1351.
Key References
   Delcher et al., 1999. See above.
  This describes the original MUMmer1 algorithm.
   Delcher et al., 2002. See above.
  This describes the enhancements in version 2 of MUMmer, including improved efficiency, more flexible clustering and alignment options, and the ability to handle files with multiple sequences.
   Gusfield, 1997. See above.
  This is a comprehensive treatment of suffix trees and sequence alignment algorithms for those interested in computer science details.
Internet Resources
   http://www.tigr.org/software/mummer
  The MUMmer homepage.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library