Discovering Novel Sequence Motifs with MEME

Timothy L. Bailey1

1 University of Queensland, Brisbane, Australia
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 2.4
DOI:  10.1002/0471250953.bi0204s00
Online Posting Date:  November, 2002
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

This unit illustrates how to use MEME to discover motifs in a group of related nucleotide or peptide sequences. A MEME motif is a sequence pattern that occurs repeatedly in one or more sequences in the input group. MEME can be used to discover novel patterns because it bases its discoveries only on the input sequences, not on any prior knowledge (such as databases of known motifs). The input to MEME is a set of unaligned sequences of the same type (peptide or nucleotide). For each motif it discovers, MEME reports the occurrences (sites), consensus sequence, and the level of conservation (information content) at each position in the pattern. MEME also produces block diagrams showing where all of the discovered motifs occur in the training set sequences. MEME's hypertext (HTML) output also contains buttons that allow for the convenient use of the motifs in other searches.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Basic Protocol 1: Discovering Motifs in a Protein Sequence Family Using MEME
  • Support Protocol 1: Searching for Other Proteins Containing the Same Motifs
  • Alternate Protocol 1: Finding Repeated Motifs in Protein Sequences
  • Basic Protocol 2: Discovering DNA Motifs in a set of DNA Sequences with MEME
  • Alternate Protocol 2: Finding Repeated Motifs in DNA Sequences with MEME
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Discovering Motifs in a Protein Sequence Family Using MEME

  Necessary Resources
  • Hardware
    • Computer connected to the InternetCommand‐line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message‐passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README.
  • Software
    • Web browser (e.g., Internet Explorer, Netscape Navigator)
    • E‐mail reader (e.g., Netscape Messenger)
    • Command‐line MEME (optional) (http://meme.sdsc.edu/meme/website/meme‐download.html)
    MEME can be used remotely over the Web (Web MEME), with results being returned by E‐mail, or it can be installed and run on the user's Unix‐based computer (command‐line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command‐line version. Command‐line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command‐line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme‐download.html).When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command‐line MEME locally on the user's Unix‐based computer.
  • Files
    • A sequence file (training set) containing one or more protein sequences ( FASTA format required for command‐line MEME; appendix 1B). Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.
    There are many ways to construct a family of protein sequences for input to MEME. For example, file tf4.fasta contains a family of bacterial protein sequences related to Entrez sequence hypothetical protein [Sulfolobus solfataricus]. It was constructed by doing a BLASTP search (unit 3.4) of the nonredundant protein database using the sequence named above ( ) as the query. The accession numbers of all of the sequences matching the query with BLAST E‐values ≤0.01 were then placed in file . Then, Batch Entrez (unit 1.3) was used with the file of accession numbers to download the sequences in FASTA format into file tf4.fasta.The data file used in this example (tf4.fasta) should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).

Support Protocol 1: Searching for Other Proteins Containing the Same Motifs

  Necessary Resources
  • Hardware
    • Computer connected to the Internet
  • Software
    • Web browser (e.g., Internet Explorer, Netscape Navigator)
    • E‐mail reader (e.g., Netscape Messenger)
  • Files
    • Results from MEME (see protocol 1)

Alternate Protocol 1: Finding Repeated Motifs in Protein Sequences

  Necessary Resources
  • Hardware
    • Computer connected to the Internet
    • Command‐line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message‐passing software installed. A list of supported operating systems and their manufacturers is available at: ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README.
  • Software
    • Web browser (e.g., Internet Explorer, Netscape Navigator)
    • E‐mail reader
    • Command‐line MEME (optional)
  • MEME can be used remotely over the Web (Web MEME), with results being returned by E‐mail, or it can be installed and run on the user's Unix‐based computer (command‐line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command‐line version. Command‐line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command‐line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme‐download.html).
  • When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command‐line MEME locally on the user's Unix‐based computer.
  • Files
    • A sequence file (the training set) containing one or more protein sequences. Note that sequences must be in FASTA format ( appendix 1B) if using command‐line MEME. Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.
There are many ways to construct a family of protein sequences for input to MEME. For example, file tf4.fasta contains a family of bacterial protein sequences related to Entrez sequence gi|15897224|ref|NP_341829.1|, hypothetical protein [Sulfolobus solfataricus]. It was constructed by doing a BLASTP search of the nonredundant protein database (unit 3.4) using the sequence named above (gi|15897224) as the query. The accession numbers of all of the sequences matching the query with BLAST E‐values ≤0.01 were then placed in file tf4.acc. Then, Batch Entrez (unit 1.3) was used with the file of accession numbers to download the sequences in FASTA format into file tf4.fasta.The data file (tf4.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).

Basic Protocol 2: Discovering DNA Motifs in a set of DNA Sequences with MEME

  Necessary Resources
  • Hardware
    • Computer connected to the Internet
  • Command‐line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message‐passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README (optional).
  • Software
    • Web browser (e.g. Internet Explorer, Netscape Navigator)
    • E‐mail reader
    • Command‐line MEME (optional)
    MEME can be used remotely over the Web (Web MEME), with results being returned by E‐mail, or it can be installed and run on the user's Unix‐based computer (command‐line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command‐line version. Command‐line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command‐line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme‐download.html).When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command‐line MEME locally on the user's Unix‐based computer.
  • Files
    • A sequence file (the training set) containing one or more protein sequences
    Note that sequences must be in FASTA format ( appendix 1B) if using command‐line MEME. Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.There are many ways to construct a set of DNA sequences for input into MEME—e.g., a set of upstream regions from genes known to be co‐regulated as determined by expression microarray experiments can be used. In this example, a file (lex.fasta) will be used that contains a set of E. coli DNA sequences known to bind LexA. MEME will be used to discover the LexA binding sites and characterize the motif.The data file (lex.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).

Alternate Protocol 2: Finding Repeated Motifs in DNA Sequences with MEME

  Necessary Resources
  • Hardware
    • Computer connected to the Internet
  • Command‐line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message‐passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README (optional).
  • Software
    • Web browser (e.g. Internet Explorer, Netscape Navigator)
    • E‐mail reader
    • Command‐line MEME (optional)
    MEME can be used remotely over the Web (Web MEME), with results being returned by E‐mail, or it can be installed and run on the user's Unix‐based computer (command‐line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command‐line version. Command‐line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command‐line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme‐download.html).When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command‐line MEME locally on the user's Unix‐based computer.
  • Files
    • A sequence file (the training set) containing one or more DNA sequences
Note that sequences must be in FASTA format ( appendix 1B) if using command‐line MEME. Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.This example uses a file (INO_up800.fasta) that contains upstream regions from S. cerevisiae genes known to be repressed in the presence of inositol or choline (van Helden et al., ).The file (INO_up800.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
   Bailey, T.L. and Elkan, C. 1995. The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 21‐29. AAAI Press, Menlo Park, Calif.
   Bailey, T.L. and Gribskov, M. 1998. Combining evidence using p‐values: Application to sequence homology searches. Bioinform. 14:48‐54.
   Grundy, W.N., Bailey, T.L., Elkan, C.P., and Baker, M.E. 1997. Meta‐MEME: Motif‐based hidden Markov models of protein families. Comp. Appl. Bio. Sci. 13:397‐496.
   Kyte, J. and Doolittle, R. 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157:105‐132.
   Lawrence, C.E. and Reilly, A.A. 1990. An expectation maximization {(EM)} algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Struc. Func. Genet. 7:41‐50.
   Pietrokovski, S. 1996. Searching databases of conserved sequence regions by aligning protein multiple‐alignments. Nucl. Acids Res. 24:3836‐3845.
   Pietrokovski, S., Henikoff, S., and Henikoff, J. 1996. The {BLOCKS} database: A system for protein classification. Nucl. Acids Res. 24:197‐200.
   van Helden, J., André, B., and Collado‐Vides, J. 1998. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281:827‐942.
   Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., and Schacherer, F. 2000. TRANSFAC: An integrated system for gene expression regulation. Nucl. Acids Res. 28:316‐319.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library