User Ratings

Your rating: None
Your rating: None
Your rating: None
Add your comments

Discovering Novel Sequence Motifs with MEME

Timothy L. Bailey1

1University of Queensland, Brisbane, Australia

Unit Number: 
Unit 2.4
DOI: 
10.1002/0471250953.bi0204s00
Online Posting Date: 
January, 2003
GO TO THE FULL TEXT:
PDF or HTML at Wiley Online Library
Are you the author of this protocol? Login or register and return to this page.

Abstract

This unit illustrates how to use MEME to discover motifs in a group of related nucleotide or peptide sequences. A MEME motif is a sequence pattern that occurs repeatedly in one or more sequences in the input group. MEME can be used to discover novel patterns because it bases its discoveries only on the input sequences, not on any prior knowledge (such as databases of known motifs). The input to MEME is a set of unaligned sequences of the same type (peptide or nucleotide). For each motif it discovers, MEME reports the occurrences (sites), consensus sequence, and the level of conservation (information content) at each position in the pattern. MEME also produces block diagrams showing where all of the discovered motifs occur in the training set sequences. MEME's hypertext (HTML) output also contains buttons that allow for the convenient use of the motifs in other searches.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Unit Introduction
  • Basic Protocol 1: Discovering Motifs in a Protein Sequence Family Using MEME
  • Support Protocol: Searching for Other Proteins Containing the Same Motifs
  • Alternate Protocol 1: Finding Repeated Motifs in Protein Sequences
  • Basic Protocol 2: Discovering DNA Motifs in a set of DNA Sequences with MEME
  • Alternate Protocol 2: Finding Repeated Motifs in DNA Sequences with MEME
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Discovering Motifs in a Protein Sequence Family Using MEME

 Necessary Resources
  • Hardware
    • Computer connected to the Internet
      Command-line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message-passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README.
  • Software
    MEME can be used remotely over the Web (Web MEME), with results being returned by E-mail, or it can be installed and run on the user's Unix-based computer (command-line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command-line version. Command-line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command-line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme-download.html).

    When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command-line MEME locally on the user's Unix-based computer.



  • Files
    • A sequence file (training set) containing one or more protein sequences (FASTA format required for command-line MEME; appendix 1B). Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.
    There are many ways to construct a family of protein sequences for input to MEME. For example, file tf4.fasta contains a family of bacterial protein sequences related to Entrez sequence gi|15897224|ref|NP_341829.1| hypothetical protein [Sulfolobus solfataricus]. It was constructed by doing a BLASTP search (unit 3.4) of the nonredundant protein database using the sequence named above (gi|15897224) as the query. The accession numbers of all of the sequences matching the query with BLAST E-values £0.01 were then placed in file tf4.acc. Then, Batch Entrez (unit 1.3) was used with the file of accession numbers to download the sequences in FASTA format into file tf4.fasta.
    The data file used in this example (tf4.fasta) should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).

Support Protocol: Searching for Other Proteins Containing the Same Motifs

 Necessary Resources
  • Hardware
    • Computer connected to the Internet
  • Software
    • Web browser (e.g., Internet Explorer, Netscape Navigator)
    • E-mail reader (e.g., Netscape Messenger)
  • Files
    • Results from MEME (see Basic Protocol 1)

Alternate Protocol 1: Finding Repeated Motifs in Protein Sequences

 Necessary Resources
  • Hardware
    • Computer connected to the Internet
    • Command-line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message-passing software installed. A list of supported operating systems and their manufacturers is available at: ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README.
  • Software
    • Web browser (e.g., Internet Explorer, Netscape Navigator)
    • E-mail reader
    • Command-line MEME (optional)
  • MEME can be used remotely over the Web (Web MEME), with results being returned by E-mail, or it can be installed and run on the user's Unix-based computer (command-line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command-line version. Command-line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command-line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme-download.html).


  • When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command-line MEME locally on the user's Unix-based computer.


  • Files
    • A sequence file (the training set) containing one or more protein sequences. Note that sequences must be in FASTA format (appendix 1B) if using command-line MEME. Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.

There are many ways to construct a family of protein sequences for input to MEME. For example, file tf4.fasta contains a family of bacterial protein sequences related to Entrez sequence gi|15897224|ref|NP_341829.1|, hypothetical protein [Sulfolobus solfataricus]. It was constructed by doing a BLASTP search of the nonredundant protein database (unit 3.4) using the sequence named above (gi|15897224) as the query. The accession numbers of all of the sequences matching the query with BLAST E-values £0.01 were then placed in file tf4.acc. Then, Batch Entrez (unit 1.3) was used with the file of accession numbers to download the sequences in FASTA format into file tf4.fasta.

The data file (tf4.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).


Basic Protocol 2: Discovering DNA Motifs in a set of DNA Sequences with MEME

 Necessary Resources
  • Hardware
    • Computer connected to the Internet
  • Command-line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message-passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README (optional).


  • Software
    • Web browser (e.g. Internet Explorer, Netscape Navigator)
    • E-mail reader
    • Command-line MEME (optional)
  • MEME can be used remotely over the Web (Web MEME), with results being returned by E-mail, or it can be installed and run on the user's Unix-based computer (command-line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command-line version. Command-line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command-line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme-download.html).
    When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command-line MEME locally on the user's Unix-based computer.
  • Files
    • A sequence file (the training set) containing one or more protein sequences
Note that sequences must be in FASTA format (appendix 1B) if using command-line MEME. Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.
There are many ways to construct a set of DNA sequences for input into MEME—e.g., a set of upstream regions from genes known to be co-regulated as determined by expression microarray experiments can be used. In this example, a file (lex.fasta) will be used that contains a set of E. coli DNA sequences known to bind LexA. MEME will be used to discover the LexA binding sites and characterize the motif.
The data file (lex.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).

Alternate Protocol 2: Finding Repeated Motifs in DNA Sequences with MEME

 Necessary Resources
  • Hardware
    • Computer connected to the Internet
  • Command-line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message-passing software installed. A list of supported operating systems and their manufacturers is available at ftp://ftp.sdsc.edu/pub/sdsc/biology/meme/README (optional).


  • Software
    • Web browser (e.g. Internet Explorer, Netscape Navigator)
    • E-mail reader
    • Command-line MEME (optional)
  • MEME can be used remotely over the Web (Web MEME), with results being returned by E-mail, or it can be installed and run on the user's Unix-based computer (command-line MEME). The Web interface has the advantage of not requiring any software installation, but some MEME features are only available in the command-line version. Command-line MEME removes the restriction on the size of the training set imposed by the MEME Web server (maximum of 60,000 characters). Web access is free (currently available at http://meme.sdsc.edu and http://bioweb.pasteur.fr/seqanal/motif/meme). The command-line version is free for noncommercial use or can be obtained with a commercial license, and can be downloaded over the Web (http://meme.sdsc.edu/meme/website/meme-download.html).
    When using MEME via a Web interface, results will typically arrive within a few hours. It is not possible to predict when the MEME results will arrive because the computers on which MEME runs at SDSC and the Pasteur Institute are shared resources. Depending on the load, it can sometimes take a day or more for a job to be processed. Please be patient. This unpredictability can be avoided by installing command-line MEME locally on the user's Unix-based computer.
  • Files
    • A sequence file (the training set) containing one or more DNA sequences

Note that sequences must be in FASTA format (appendix 1B) if using command-line MEME. Other formats, described on the MEME Web site, are supported if using MEME via the Web interface, but the total number of characters in the sequences may not exceed 60,000.

This example uses a file (INO_up800.fasta) that contains upstream regions from S. cerevisiae genes known to be repressed in the presence of inositol or choline (van Helden et al., 1998).

The file (INO_up800.fasta) used in this example should be downloaded from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm).



     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  • Figure 2.4.1
    Overview of the input and output of MEME.

  • Figure 2.4.2
    A typical protein motif discovered by MEME showing the aligned motif sites, multilevel consensus sequence, and information content.

  • Figure 2.4.3
    A typical Summary of Motifs diagram produced by MEME showing the positions of occurrences of the three motifs MEME discovered. The names of the training set sequences and a measure of how well all three motifs are represented in each sequence (combined P-value) are shown to the left of each.

  • Figure 2.4.4
    MEME input form for the tf4 protein family assuming zero or one occurrence of a single motif in each sequence.

  • Figure 2.4.5
    MEME verification screen.

  • Figure 2.4.6
    MEME confirmation E-mail message.

  • Figure 2.4.7
    MEME E-mail results header. The actual results are attached to the E-mail message as an HTML attachment. Many E-mail programs display the attachment directly, but it is usually better to save the attachment to a file and use a web browser to view it.

  • Figure 2.4.8
    Six buttons at the top of the MEME output, which allow for easy navigation through the file.

  • Figure 2.4.9
    MEME command-line summary section showing all the MEME parameters. This is useful for keeping track of distinct MEME runs.

  • Figure 2.4.10
    MEME training set section.

  • Figure 2.4.11
    Motif summary line.

  • Figure 2.4.12
    Simplified PSPM, information content diagram, consensus, and alignment.

  • Figure 2.4.13
    Motif 5 block diagrams showing schematically the position and strength of occurrences of motif 5 in the training set.

  • Figure 2.4.14
    The figure displays the options available under the Motif 5 in BLOCKS format. The user may choose to view the motif in block, FASTA, or raw format by clicking the appropriate button. The Submit BLOCK 5 button provides a link to the Blocks database (unit 2.2), which is valuable for further analysis of the motif.

  • Figure 2.4.15
    A display of Motif 5 in FASTA Format, obtained by clicking on the View FASTA 5 button shown in Figure 2.4.14.

  • Figure 2.4.16
    Motif 5 in Logos format obtained by clicking on the Submit Block 5 button shown in Figure 2.4.14, followed by clicking on the Logos:GIF button.

  • Figure 2.4.17
    Motif 5 neighbor-joining tree obtained by clicking on the Submit Block 5 button shown in Figure 2.4.14, followed by clicking on the Tree:Gif button.

  • Figure 2.4.18
    Motif 5 neighbor-joining tree obtained by clicking on the Submit Block 5 button shown in Figure 2.4.14, followed by clicking on the LAMA button.

  • Figure 2.4.19
    Summary of motifs diagram showing the positions of matches to all ten motifs discovered by MEME in the training set. The diagram is generated by searching for all nonoverlapping matches to the motif PSSMs, and may not correspond exactly to the individual motif block diagrams (e.g., Fig. 2.4.13). This diagram is generated by MEME using the MAST algorithm (Bailey and Gribskov, 1998).

  • Figure 2.4.20
    Top of MAST input form showing all of the required inputs—i.e., the user's E-mail address and the sequence to search.

  • Figure 2.4.21
    Partial results of a MAST search of yeast showing the motif matches in schematic format.

  • Figure 2.4.22
    MetaMEME input form.

  • Figure 2.4.23
    MetaMEME search of yeast.

  • Figure 2.4.24
    MEME input form for the tf4 protein family assuming any number of repeats of a single motif in each sequence.

  • Figure 2.4.25
    MEME motif summary: repeated motifs.

  • Figure 2.4.26
    LAMA search of BLOCKS motif database showing the top matches of the MEME motifs to known protein motifs.

  • Figure 2.4.27
    Neighbor-joining tree of motif 1.

  • Figure 2.4.28
    MAST search of yeast with repeated motifs.

  • Figure 2.4.29
    MEME input form for LexA binding sites.

  • Figure 2.4.30
    LexA binding site motif.

  • Figure 2.4.31
    Summary of motifs in lex.fasta. A minus sign before the motif number indicates that the match is on the reverse complement strand. A plus sign indicates that the motif is on the given strand.

  • Figure 2.4.32
    MAST input form for LexA.

  • Figure 2.4.33
    MAST results of search of E. coli with LexA binding site motif.

  • Figure 2.4.34
    MEME input form for genes repressed by inositol or choline.

  • Figure 2.4.35
    Inositol binding site motif identified by MEME.

Literature Cited

Literature Cited
    Bailey, T.L. and Elkan, C. 1995. The value of prior knowledge in discovering motifs with MEME. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 21-29. AAAI Press, Menlo Park, Calif.
    Bailey, T.L. and Gribskov, M. 1998. Combining evidence using p-values: Application to sequence homology searches. Bioinform. 14:48-54.
    Grundy, W.N., Bailey, T.L., Elkan, C.P., and Baker, M.E. 1997. Meta-MEME: Motif-based hidden Markov models of protein families. Comp. Appl. Bio. Sci. 13:397-496.
    Kyte, J. and Doolittle, R. 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157:105-132.
    Lawrence, C.E. and Reilly, A.A. 1990. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Struc. Func. Genet. 7:41-50.
    Pietrokovski, S. 1996. Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucl. Acids Res. 24:3836-3845.
    Pietrokovski, S., Henikoff, S., and Henikoff, J. 1996. The BLOCKS database: A system for protein classification. Nucl. Acids Res. 24:197-200.
    van Helden, J., André, B., and Collado-Vides, J. 1998. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281:827-942.
    Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., and Schacherer, F. 2000. TRANSFAC: An integrated system for gene expression regulation. Nucl. Acids Res. 28:316-319.
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library
Looking for Answers?
Do you have tips, tricks, or improvements to share?

Join the Conversation

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.