Finding Protein and Nucleotide Similarities with FASTA

William R. Pearson1

1 University of Virginia School of Medicine, Charlottesville, Virginia
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.9
DOI:  10.1002/0471250953.bi0309s53
Online Posting Date:  March, 2016
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce “BLAST‐like” alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. © 2016 by John Wiley & Sons, Inc.

Keywords: similarity; homology; expectation; E()‐value; alignment annotation; scoring matrices

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Strategic Planning
  • Basic Protocol 1: Using the Fasta Programs
  • Support Protocol 1: Downloading and Installing the FASTA Programs
  • Support Protocol 2: Downloading and Preparing Sequence Databases
  • Basic Protocol 2: Large‐Scale Sequence Analysis with Alignment Annotation
  • Alternate Protocol 1: Using Annotation Files
  • Summary
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
  Eddy, S.R. 2011. Accelerated profile HMM searches. PLoS Comput. Biol. 7:e1002195.
  Farrar, M. 2007. Striped Smith‐Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23:156‐161. doi: 10.1093/bioinformatics/btl582.
  Finn, R.D., Clements, J., and Eddy, S.R. 2011. HMMER web server: Interactive sequence similarity searching. Nucleic Acids Res. 39:W29‐W37. doi: 10.1093/nar/gkr367.
  Finn, R.D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E.L. L., Tate, J., and Punta, M. 2014. Pfam: The protein families database. Nucleic Acids Res. 42:D222‐30. doi: 10.1093/nar/gkt1223.
  Gonzalez, M.W. and Pearson, W.R. 2010. RefProtDom: A protein database with improved domain boundaries and homology relationships. Bioinformatics 26:2361‐2361. doi: 10.1093/bioinformatics/btq426.
  Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitutions matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919. doi: 10.1073/pnas.89.22.10915.
  Huang, X., Hardison, R.C., and Miller, W. 1990. A space‐efficient algorithm for local similarities. Comp. Appl. Biosci. 6:373‐381.
  Jamison, D.C. 2003. Structured query language (SQL) fundamentals. Curr. Protoc. Bioinform. 00:9.2.1‐9.2.29.
  Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8:275‐282.
  Kann, M.G. and Goldstein, R.A. 2002. Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison. Proteins 48:367‐376. doi: 10.1002/prot.10117.
  Ladunga, I. 2009a. Finding similar nucleotide sequences using network BLAST searches. Curr. Protoc. Bioinform. 26:3.3.1‐3.3.26.
  Ladunga, I. 2009b. Finding homologs in amino acid sequences using network BLAST searches. Curr. Protoc. Bioinform. 25:3.4.1‐3.4.34.
  Mackey, A. J. and Pearson, W. R. 2004. Using relational databases for improved sequence similarity searching and large‐scale genomic analyses. Curr. Protoc. Bioinform. 7:9.4.1‐9.4.25.
  Mackey, A.J., Haystead, T.A. J., and Pearson, W.R. 2002. Getting more from less: Algorithms for rapid protein identification with multiple short peptide sequences. Mol. Cell. Proteomics 1:139‐147. doi: 10.1074/mcp.M100004-MCP200.
  Mills, L. 2014. Common file formats. Curr. Protoc. Bioinform. 1:A.1B.1‐A.1B.18.
  Mills, L.J. and Pearson, W.R. 2013. Adjusting scoring matrices to correct overextended alignments. Bioinformatics, 29:3007‐3013. doi: 10.1093/bioinformatics/btt517.
  Mott, R. 1992. Maximum‐likelihood estimation of the statistical distribution of smith‐waterman local sequence similarity scores. Bull. Math. Biol. 54:59‐75. doi: 10.1007/BF02458620.
  Mueller, T., Spang, R., and Vingron, M. 2002. Estimating amino acid substitution models: A comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19:8‐13. doi: 10.1093/oxfordjournals.molbev.a003985.
  Pearson, W.R. 1996. Effective protein sequence comparison. Methods Enzymol. 266:227‐258. doi: 10.1016/S0076-6879(96)66017-0.
  Pearson, W.R. 2013a. An introduction to sequence similarity (“homology”) aearching. Curr. Protoc. Bioinform. 42:3.1.1‐3.1.8.
  Pearson, W.R. 2013b. Selecting the right similarity‐scoring matrix. Curr. Protoc. Bioinform. 3:3.5.1‐3.5.9.
  Pearson, W.R. 2015. Protein function prediction: problems and pitfalls. Curr. Protoc. Bioinform. 51:4.12.1‐4.12.8. doi: 10.1002/0471250953.bi0412s51.
  Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444‐2448. doi: 10.1073/pnas.85.8.2444.
  Pearson, W.R., Wood, T.C., Zhang, Z., and Miller, W. 1997. Comparison of DNA sequences with protein sequences. Genomics 46:24‐36. doi: 10.1006/geno.1997.4995.
  Reese, J.T. and Pearson, W.R. 2002. Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18:1500‐1507. doi: 10.1093/bioinformatics/18.11.1500.
  Ropelewski, A.J., Nicholas, H.B., and Deerfield, D.W. 2004. Mathematically complete nucleotide and protein sequence searching using Ssearch. Curr. Protoc. Bioinform. 4:3.10.1‐3.10.12.
  Schwartz, R.M. and Dayhoff, M. 1978. Matrices for detecting distant relationships. In Atlas of Protein Sequence and Structure, Vol. 5(3), (M. Dayhoff, ed.) pp. 353‐358. National Biomedical Research Foundation, Silver Spring, Md.
  Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197. doi: 10.1016/0022-2836(81)90087-5.
  UniProt Consortium 2015. UniProt: A hub for protein information. Nucleic Acids Res. 43(Database issue):D204‐12.
  Waterman, M.S. and Eggert, M. 1987. A new algorithm for best subsequences alignment with application to tRNA‐rRNA comparisons. J. Mol. Biol. 197:723‐728. doi: 10.1016/0022-2836(87)90478-5.
  Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149‐163. doi: 10.1016/0097-8485(93)85006-X.
  Zhang, Z., Pearson, W.R., and Miller, W. 1997. Aligning a DNA sequence with a protein sequence. J. Computational Biol. 4:339‐349. doi: 10.1089/cmb.1997.4.339.
PDF or HTML at Wiley Online Library