Sequence Similarity Searching Using the BLAST Family of Programs

Tyra G. Wolfsberg1, Thomas L. Madden1

1 National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, Maryland
Publication Name:  Current Protocols in Molecular Biology
Unit Number:  Unit 19.3
DOI:  10.1002/0471142727.mb1903s46
Online Posting Date:  May, 2001
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Database sequence similarity searching is carried out thousands of times each day by researchers worldwide and has become a very valuable tool. Over the years, a number of algorithms have been implemented to facilitate database searching. The BLAST (Basic Local Alignment Research Tool) family of sequence similarity search programs allows searches to be done quickly and easily, but with sensitive, yet rigorous statistical expectations. In this unit, which is a completely new version of its predecessor of the same title, the user learns how to access the databases, determine the correct searching strategies, and apply examples of BLAST searches to his or her own data.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Accessing BLAST Programs and Documentation
  • Introduction to BLAST
  • Examples of BLAST Searches
  • Searching Strategies
  • Sequence Alignment Algorithms
  • Appendix A: BLAST Parameters
  • Appendix B: Sequence Identifier Syntax
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
   Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., et al. 1991. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252:1651‐1656.
   Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219:555‐565.
   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nature Genet. 6:119‐129.
   Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
   Bairoch, A. and Apweiler, R. 1998. The SWISS‐PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucl. Acids Res. 26:38‐42.
   Barker, W.C., Garavelli, J.S., Haft, D.H., Hunt, L.T., Marzec, C.R., Orcutt, B.C., Srinivasarao, G.Y., Yeh, L.S.L., Ledley, R.S., Mewes, H.W., Pfeiffer, F., and Tsugita, A. 1998. The PIR‐International Protein Sequence Database. Nucl. Acids Res. 26:27‐32.
   Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., and Ouellette, B.F. 1998. GenBank. Nucl. Acids Res. 26:1‐7.
   Boguski, M.S., Lowe, T.M., and Tolstoshev, C.M. 1993. dbEST—database for “expressed sequence tags”. Nature Genet. 4:332‐333.
   Chandrasekharappa, S.C., Guru, S.C., Manickam, P., Olufemi, S.E., Collins, F.S., Emmert‐Buck, M.R., Debelenko, L.V., Zhang, Z., Lubensky, I.A., Liotta, L.A., et al. 1997. Positional cloning of the gene for multiple endocrine neoplasia‐type 1. Science 276:404‐407.
   Chang, Z.Y., Nygaard, P., Chinault, A.C., and Kellems, R.E. 1991. Deduced amino acid sequence of Escherichia coli adenosine deaminase reveals evolutionarily conserved amino acid residues: Implications for catalytic function. Biochemistry 30:2273‐2280.
   Claverie, J.M. and Makalowski, W. 1994. Alu alert. Nature 371:752.
   Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, Vol.5, suppl. 3. (M.O. Dayhoff, ed.) pp.345‐352. National Biomedical Research Foundation, Washington, D.C.
   Gish, W. and States, D.J. 1993. Identification of protein coding regions by database similarity search. Nature Genet. 3:266‐272.
   Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919.
   Holm, L. and Sander, C. 1997. An evolutionary treasure: Unification of a broad set of amidohydrolases related to urease. Proteins 28:72‐82.
   Karlin, S. and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87:2264‐2268.
   Karlin, S. and Altschul, S.F. 1993. Applications and statistics for multiple high‐scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U.S.A. 90:5873‐5877.
   Lavin, M.F. and Shiloh, Y. 1997. The genetic defect in ataxia‐telangiectasia. Annu. Rev. Immunol. 15:177‐202.
   Olson, M., Hood, L., Cantor, C., and Botstein, D. 1989. A common language for physical mapping of the human genome. Science 245:1434‐1435.
   Ostell, J.M. and Kans, J.A. 1998. The NCBI data model. In Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (A.D. Baxevanis and B.F.F. Ouellette, eds.) pp.121‐144. John Wiley & Sons, New York.
   Ouellette, B.F.F. 1998. The GenBank sequence database. In Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (A.D. Baxevanis and B.F.F. Ouellette, eds.) pp.16‐45. John Wiley & Sons, New York.
   Ouellette, B.F. and Boguski, M.S. 1997. Database divisions and homology search files: A guide for the perplexed. Genome Res. 7:952‐955.
   Pearson, W.R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183:63‐98.
   Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444‐2448.
   Schuler, G.D. 1998. Sequence alignment and database searching. In Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins (A.D. Baxevanis and B.F.F. Ouellette, eds.) pp.145‐171. John Wiley & Sons, New York.
   Schwartz, R.M. and Dayhoff, M.O. 1978. Matrices for detecting distant relationships. In Atlas of Protein Sequence and Structure, Vol.5, suppl. 3. (M.O. Dayhoff, ed.) pp.353‐358. National Biomedical Research Foundation, Washington, D.C.
   Seabra, M.C., Brown, M.S., and Goldstein, J.L. 1993. Retinal degeneration in choroideremia: Deficiency of rab geranylgeranyl transferase. Science 259:377‐381.
   Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197.
   Smith, M.W., Holmsen, A.L., Wei, Y.H., Peterson, M., and Evans, G.A. 1994. Genomic sequence sampling: A strategy for high resolution sequence‐based physical mapping of complex genomes. Nature Genet. 7:40‐47.
   Stoesser, G., Moseley, M.A., Sleep, J., McGowran, M., Garcia‐Pastor, M., and Sterk, P. 1998. The EMBL nucleotide sequence database. Nucl. Acids Res. 26:8‐15.
   Tateno, Y., Fukami‐Kobayashi, K., Miyazaki, S., Sugawara, H., and Gojobori, T. 1998. DNA Data Bank of Japan at work on genome sequence data. Nucl. Acids Res. 26:16‐20.
   Wolfsberg, T.G., Straight, P.D., Gerena, R.L., Huovila, A.P., Primakoff, P., Myles, D.G., and White, J.M. 1995. ADAM, a widely distributed and developmentally regulated gene family encoding membrane proteins with a disintegrin and metalloprotease domain. Dev. Biol. 169:378‐383.
   Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149‐163.
   Wootton, J.C. and Federhen, S. 1996. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266:554‐571.
   Zhang, J. and Madden, T.L. 1997. PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 7:649‐656.
   Zhang, Z., Berman, P., and Miller, W. 1998. Alignments without low‐scoring regions. J. Comput. Biol. 5:197‐210.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library