Mathematically Complete Nucleotide and Protein Sequence Searching Using Ssearch

Alexander J. Ropelewski1, Hugh B. Nicholas1, David W. Deerfield1

1 Pittsburgh Supercomputing Center, Pittsburgh, Pennsylvania
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.10
DOI:  10.1002/0471250953.bi0310s04
Online Posting Date:  February, 2004
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


In this unit a protocol is described for predicting the structure of simple transmembrane a‐helical bundles. The protocol is based on a global molecular dynamics search (GMDS) of the configuration space of the helical bundle, yielding several candidate structures. The correct structure among these candidates is selected using information from silent amino acid substitutions, employing the premise that only the correct structure must (by definition) accept all of the silent amino acid substitutions. Thus, the correct structure is found by repeating the GMDS for several close homologs and selecting the structure that persists in all of the trials.

PDF or HTML at Wiley Online Library

Table of Contents

  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


Basic Protocol 1:

  Necessary Resources
  • Hardware
    • The Ssearch code is a resource‐intensive code and thus generally requires a substantial computational platform with adequate CPU, memory, and disk space. However, these requirements will vary greatly depending on the usage of the code. For example, searching against compilations of known protein sequences, such as those that have been placed in the PIR database or Swiss‐Prot databases, can be done on an inexpensive PC running Windows, Linux, or Macintosh OS. Performing regular searches of complete nucleic acid libraries with the code is a task best suited to higher‐performance, multiprocessor machines.
  • Software
    • The Ssearch code is part of the FASTA package (unit 3.9) from Dr. William Pearson, which is available via anonymous FTP from Dr. Pearson can be contacted at the Department of Biochemistry, School of Medicine, University of Virginia, Charlottesville, Va. 22908.
  • Files
    • Sequence files: The Ssearch code requires an input file that contains the query sequence in FASTA format ( appendix 1B). It also requires that one or more sequence data libraries (e.g., GenBank, NBRF‐PIR, Swiss‐Prot, or EMBL) be installed, or, in lieu of installing a sequence data library, simply having a set of sequences against which one wants to compare the query sequence in FASTA format ( appendix 1C).
    • Scoring matrix file: The Ssearch code enables one to use scoring matrices (unit 3.5) that are not internal to the code, such as the BLOSUM35 matrix (Fig. ). The format of the matrix file should be in the same configuration as is acceptable by the BLAST (Altschul et. al., ; units 3.3& 3.4) family of programs.
    • A variety of compatible scoring matrices that can be used with the Ssearch program can be found at the NCBI FTP site ( The Ssearch program does not require the use of an external scoring matrix file if one of the built‐in scoring matrices is selected.
PDF or HTML at Wiley Online Library



Literature Cited

   Aho, A.V., Hopcroft, J.E., and Ullman, J.D. 1983. Data Structures and Algorithms. Addison‐Wesley, Reading, Mass.
   Altschul, S.F. 1991. Amino acid substitution matricies from an information theoretic perspective. J. Mol. Biol. 219:555‐565
   Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. 1994. Issues in searching molecular sequence databases. Nature Genet. 6:119‐129.
   Altschul, S.F., Madden, T.L., Scheffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25:3389‐3402.
   Agarwal, P. and States, D. 1998. Comparative accuracy of methods for protein sequence similarity search. Bioinformatics 14:40‐47.
   Bork, P. and Gibson, T.J. 1996. Applying motif and profile searches. Methods Enzymol. 266:383‐402.
   Brendel, V., Bucher, P., Nourbakhsh, I.R., Blaisdell, B.E., and Karlin, S. 1992. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 89:2002‐2006.
   Dayhoff, M., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, Vol. 5, supp. 3 (M. Dayhoff, ed.) pp. 345‐352. National Biomedical Research Foundation, Silver Spring, Md.
   Henikoff, S. and Henikoff, J.G. 1992. Amino acid substitution matricies from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89:10915‐10919.
   Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275‐282.
   Kann, M., Qian, B., and Goldstein, R.A. 2000. Optimization of a new score function for the detection of remote homologs. Proteins 41:498‐503.
   Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48:443‐453.
   Nicholas, H.B., Deerfield, D.W. II, and Ropelewski, A.J. 2000. Strategies for searching sequence databases. BioTechniques 28:1174‐1191.
   Nicholas, H.B., Ropelewski, A.J., Deerfield, D.W. II. 2002. Strategies for multiple sequence alignment. BioTechniques 32:572‐591.
   Pearson, W.R. 1995. Comparison of methods for searching protein sequence databases. Protein Sci. 4:1145‐1160.
   Pearson, W.R. 1998. Empirical statistical estimates for sequence similarity searches. J. Mol. Biol. 276:71‐84.
   Pearson, W.R. and Lipman, D.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85:2444‐2448.
   Pizzi, E. and Frontali, C. 2001. Low‐complexity regions in Plasmodium falciparum proteins. Genome Res. 11:218‐229.
   Ropelewski, A.J., Nicholas, H.B., Deerfield, D.W. II. 2000. Selective and sensitive comparison of genetic sequence data. In Industrial Strength Parallel Computing (A. Konges, ed.) pp. 453‐479. Morgan Kauffmann, San Francisco.
   Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195‐197.
   Spang, R. and Vingron, M. 1998. Statistics of large‐scale sequence searching. Bioinformatics 14:279‐284.
   Spang, R. and Vingron, M. 2001. Limits of homology detection by pairwise sequence comparison. Bioinformatics 17:338‐342.
   States, D.J., Gish, W., and Altschul, S.F. 1991. Improved sensitivity of nucleic acid database searches using application‐specific scoring matricies. Methods Enzymol. 3:66‐77.
   Vingron, M. and Waterman, M.S. 1994. Sequence alignment and penalty choice: Review of concepts, case studies and implications. J. Mol. Biol. 235:1‐12.
   Waterman, M.S. and Eggert, M. 1987. A new algorithm for subsequence alignments with application to tRNA‐rRNA comparisons. J. Mol. Biol. 197:723‐728
   Waterman, M.S. and Vingron, M. 1994. Rapid and accurate estimates of statistical significance for sequence database searches. Proc. Natl. Acad. Sci. U.S.A. 91:4625‐4628.
   Webber, C. and Barton, G.J. 2003. Increased coverage obtained by combination of methods for protein sequence database searching. Bioinformatics 19:1397‐1403.
   Weizhong, L., Po, F., Pawlowski, K., and Godzik, A. 2000. Saturated BLAST: An automated multiple intermediate sequence search used to detect distant homology. Bioinformatics 16:1105‐1110.
   Wootton, J.C. and Federhen, S. 1993. Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149‐163.
Key References
   Altschul et al., 1994. See above.
  This review provides detailed information about local alignment statistics, extreme value distributions, scoring matrices, and low‐complexity regions.
   Agarwal and States, 1998. See above.
  This article compares the Smith‐Waterman, FASTA, original BLAST code, WU‐BLAST2, and Probabilistic Smith‐Waterman codes.
   Nicholas et al., 2000. See above.
  This review discusses the advantages and disadvantages of the BLAST, FASTA, and Smith‐Waterman search algorithms, how to select appropriate scoring matrices, scoring insertions and deletions, as well as a few different methods by which statistical significance can be computed.
   Pearson, 1995. See above.
  This article compares FASTA, Smith‐Waterman and original BLAST algorithms in the context of which method did the best job in finding members of 67 different protein superfamilies.
Internet Resources
  The FASTA package (in which the Ssearch code is included) can be obtained here.
  A variety of protein scoring matrices that can be used with the Ssearch code can be obtained here.
PDF or HTML at Wiley Online Library