User Ratings

Your rating: None
Your rating: None
Your rating: None
Add your comments

Using the Gibbs Motif Sampler to Find Conserved Domains in DNA and Protein Sequences

William Thompson1,  Lee Ann McCue2,  Charles E. Lawrence1

1Brown University, Providence, Rhode Island
2Center for Bioinformatics The Wadsworth Center New York State Department of Health, Albany, New York


Unit Number: 
Unit 2.8
DOI: 
10.1002/0471250953.bi0208s10
Online Posting Date: 
July, 2005
GO TO THE FULL TEXT:
PDF or HTML at Wiley Online Library
Are you the author of this protocol? Login or register and return to this page.

Abstract

The Gibbs Motif Sampler (Gibbs) is a software package for discovering conserved elements in biopolymer sequences. This unit describes the basic operation of the Web-based interface to Gibbs, along with advanced examples of its use, and the Web interface to dscan, a sequence database search program.

Keywords: Gibbs sampling; Transcription factor binding site; Sequence Alignment; Motif; DNA; Protein; Phylogentic Footprinting; Stochastic Algorithm; Markov Chain Monte-Carlo; Bayesian statistics

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Unit Introduction
  • Basic Protocol 1: Running the Gibbs Motif Sampler
  • Basic Protocol 2: Searching for Other Sequences Containing Similar Motifs Using dscan
  • Guidelines for Understanding Results
  • Commentary
  • Appendix A
  • Appendix B
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

  • Figure 2.8.1
    The contents of the file crp.dat.

  • Figure 2.8.2
    The Gibbs home page.

  • Figure 2.8.3
    The basic Gibbs entry form. The fields marked with an asterisk * are the minimum required entries to run Gibbs in Site Sampler mode.

  • Figure 2.8.4
    dscan motif models. (A) dscan model consisting of a collection of sites produced by Gibbs. (B) The same data displayed as a count matrix. (C) The same data displayed as a probability matrix.

  • Figure 2.8.5
    The dscan entry form.

  • Figure 2.8.6
    The output produced by running Gibbs on the crp.dat data file with a conserved motif width of 16, estimated number of sites of 22 and allowing 0, 1 or 2 sites per sequence. (A) Gibbs program options. (B) Gibbs output, showing a listing of the input FASTA sequence headers. (C) Maximum MAP output.

  • Figure 2.8.7
    dscan output from scanning the database of E. coli intergenic sequences.

  • Figure 2.8.8
    Gibbs advanced options page for DNA data with default options selected.

  • Figure 2.8.9
    Gibbs advanced options screen for protein data.

  • Figure 2.8.10
    Restriction site for the enzyme EcoRI illustrating its palindromic nature. The GAA at the 5¢ end, at positions 1 through 3, is complementary to the TTC at the 3¢ end at positions 6 through 4.

  • Figure 2.8.11
    Background composition. The figure shows the distribution of the probabilities of each nucleotide at each position, as generated by the Bayesian segmentation algorithm (Liu and Lawrence, 1999) for a 131-bp region upstream of the Haemophilus influenzae purA gene.

  • Figure 2.8.12
    Output from a Gibbs run with the Wilcoxon signed-rank test option enabled. (A) The 18 E. coli CRP regulated promoter sequences have been supplemented with 18 shuffled sequences. (B) Maximum MAP alignment and the results of the Wilcoxon signed-rank test. The p value of 0.000671 indicates that the alignment is highly significant despite the inclusion of three shuffled sequences in the alignment.

  • Figure 2.8.13
    Sample spacing distribution. (A) The probability distribution of the distances of sites from the start codon for the default spacing model for prokaryotic DNA sequences. This is the model used when the option Prokaryotic Defaults is selected. (B) Values for the spacing distribution shown in Figure 2.8.12A.

  • Figure 2.8.14
    Sample prior information file. Prior pseudocounts are shown for the CRP TFBS model. The model is 16 rows by 4 columns; the order of the columns is A, T, C, G. In this example, each row sums to 10, although this is not a requirement. Rows may have different sums. By default, each table entry is multiplied by 0.1, resulting in 1.0 total pseudocounts for each position. Prior probabilities for 0, 1, or 2 sites per sequence are included, with a weight of 0.1.

  • Figure 2.8.15
    Gibbs more advanced options screen for DNA. This screen includes options for controlling program performance.

  • Figure 2.8.16
    Alignments from phylogenetic footprinting. (A) Alignment from the phylogenetic footprinting of the E. coli purL gene and six orthologous genes from related species. (B) Alignment from the phylogenetic footprinting of the E. coli glnA gene and six orthologous genes from related species.

  • Figure 2.8.17
    (A) Alignment from the analysis of seven intergenic sequences that contain the ten M. tuberculosis promoters. (B) Sequence logo (Schneider and Stephens, 1990) of the alignment of seven intergenic sequences that contain the ten M. tuberculosis promoters. A sequence logo is a graphical representation of a multiple sequence alignment. The overall height of the letters at a position indicates the sequence conservation at that position. The height of the individual letters at a position indicates the relative frequency of the nucleotide at that position.

  • Figure 2.8.18
    (A) Alignments for motifs a and b for the M. tuberculosis hypoxia microarray data. (B,C) Sequence logos (Schneider and Stephens, 1990) of the alignments of motifs a and b respectively for the M. tuberculosis hypoxia microarray data.

Literature Cited

Literature Cited
    Altschul, S.F. and Lipman, D.J. 1990. Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. U.S.A. 87:5509-5513.
    Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nat. Genet. 6:119.
    Bailey, T.L. and Elkan, C. 1995. Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 21:51-80.
    Claverie, J.M. and States, D.J. 1993. Information enhancement methods for large scale sequence analysis. Comput. Chem. 17:191-201.
    Florczyk, M.A., McCue, L.A., Purkayastha, A., Currenti, E., Wolin, M.J., and McDonough, K.A. 2003. A family of acr-coregulated mycobacterium tuberculosis genes shares a common DNA motif and requires Rv3133c (dosR or devR) for expression. Infect. Immun. 71:5332-5343.
    Lawrence, C.E. and Reilly, A.A. 1990. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct. Funct. Genet. 7:41-51.
    Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J. 1993. Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science 262:208-214.
    Liu, J.S. and Lawrence, C.E. 1999. Bayesian inference on biopolymer models. Bioinformatics 15:38-52.
    Liu, J., Neuwald, A., and Lawrence, C. 1995. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 432:1156-1170.
    Liu, X., Brutlag, D.L., and Liu, J.S. 2001. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In Proceedings of the Pacific Symposium on Biocomputing, pp. 127-138. World Scientific Press, Hawaii.
    McCue, L., Thompson, W., Carmack, C., Ryan, M.P., Liu, J.S., Derbyshire, V., and Lawrence, C.E. 2001. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucl. Acids Res. 29:774-782.
    McCue, L.A., Thompson, W., Carmack, C.S., and Lawrence, C.E. 2002. Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 12:1523-1532.
    Neuwald, A., Liu, J., and Lawrence, C. 1995. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci. 4:1618-1632.
    Schneider, T.D. and Stephens, R.M. 1990. Sequence logos: A new way to display consensus sequences. Nucl. Acids Res. 18:6097-6100.
    Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13:103-107.
    Sherman, D.R., Voskuil, M., Schnappinger, D., Liao, R., Harrell, M.I., and Schoolnik, G.K. 2001. Regulation of the Mycobacterium tuberculosis hypoxic response gene encoding alpha-crystallin. Proc. Natl. Acad. Sci. U.S.A. 98:7534-7539.
    Thompson, W., Rouchka, E.C., and Lawrence, C.E. 2003. Gibbs Recursive Sampler: Finding transcription factor binding sites. Nucl. Acids. Res. 31:3580-3585.
    Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., and Lawrence, C.E. 2004. Decoding human regulatory circuits. Genome Res. 14:1967-1974.
    Wanner, B.L. 1996. Phosphorus assimilation and control of the phosphate regulon. In Escherichia coli and Salmonella: Cellular and Molecular Biology (F.C. Neihdhardt, ed.), pp. 1357-1381. ASM Press, Washington, D.C.
    Webb, B.J., Liu, J.S., and Lawrence, C.E. 2002. BALSA: Bayesian algorithm for local sequence alignment. Nucl. Acids Res. 30:1268-1277.
 Internet Resources
    http://bayesweb.wadsworth.org/gibbs/gibbs.html
    http://www.bioinfo.rpi.edu/applications/bayesian/gibbs/gibbs.html

Web sites for running the Gibbs sample

    http://bayesweb.wadsworth.org/GIBBS-SAMPLER-ACADEMIC.htm
    http://bayesweb.wadsworth.org/GIBBS-SAMPLER-COMMERCIAL.htm

The above sites provide information about obtaining Gibbs.

    http://bayesweb.wadsworth.org/gibbs/module

Auxiliary data for running the examples

    http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21

IUPAC amino acid codes

    http://bayesweb.wadsworth.org/web_help.PF.html
    http://bayesweb.wadsworth.org/web_help_text.CE.htm

Annotated examples using Gibbs to analyze bacterial data

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library
Looking for Answers?
Do you have tips, tricks, or improvements to share?

Join the Conversation

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.