Finding Homologs to Nucleic Acid or Protein Sequences Using the Framesearch Program

Matthew Healy1

1 Bristol‐Myers Squibb Pharmaceutical Research Institute, Wallingford, Connecticut
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.2
DOI:  10.1002/0471250953.bi0302s00
Online Posting Date:  August, 2002
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

The Framesearch algorithm includes the possibility of a frameshift error in its alignment algorithm, and therefore can find alignments that span different reading frames. Protocols in this unit describe the use of Framesearch to search a protein sequence database for sequences that are similar to a query nucleotide sequence, and to search a nucleotide sequence database for sequences that are similar to a query protein sequence. Three alternate protocols describe ways to improve the speed of Framesearch and thus make it practical for routine use. Framesearch is especially appropriate for low‐quality single‐read nucleotide sequence data, such as ESTs (expressed sequence tags) or early drafts of genomic sequences; it does not offer any significant advantage over less CPU‐intensive algorithms for relatively high‐quality nucleotide sequences without many single‐nucleotide insertion or deletion errors.

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence
  • Basic Protocol 2: Framesearch Using a Protein Query Sequence
  • Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence
  • Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence
  • Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware
  • Support Protocol 1: Downloading and Converting Sequence Files for the Examples Used in the Protocols
  • Guidelines for Understanding Results
  • Commentary
  • Figures
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Framesearch Using a Nucleic Acid Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • Files
  • DNA sequence file of interest (this will be the query sequence; maximum length, 350 kb)
  • Protein database of sequences to which the DNA sequence will be compared
For example, BA000007.faa contains the amino acid translations of all putative genes found in this bacterial genome by the lab where it was sequenced, as a single FASTA format text file ( appendix 1B).Both the query sequence and the database files must be converted to the GCG format ( protocol 6).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6.

Basic Protocol 2: Framesearch Using a Protein Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • Files
  • Protein sequence file of interest (this will be the query sequence)
  • Nucleic acid database of sequences to which the protein sequence will be compared
For example, BA000007.fna contains the nucleotide sequence of all putative genes found in this bacterial genome by the laboratory where it was sequenced, as a single FASTA format text file ( appendix 1B).Both the query sequence and the database files must be converted to the GCG format ( protocol 6).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6.

Alternate Protocol 1: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Nucleic Acid Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • BLAST program (unit 3.4)In the GCG environment assumed for these examples, both BLAST and Framesearch are included.
  • Files
  • DNA sequence file of interest (this will be the query sequence; maximum length, 350 kb)
  • Protein database of sequences to which the DNA sequence will be compared
For example, contains the amino acid translations of all putative genes found in this bacterial genome by the lab where it was sequenced, as a single FASTA format text file ( appendix 1B).Both the query sequence and the database files must be converted to the GCG format ( protocol 6).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6.

Alternate Protocol 2: Prefiltering with a Search Algorithm to Improve the Speed of Framesearch with a Protein Query Sequence

  Necessary Resources
  • Hardware
  • Framesearch can be run on any Unix or VMS system that has the Wisconsin Package installed; because it is so CPU‐intensive, Framesearch should be run on the fastest computer available to the user
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher)
  • BLAST program (unit 3.4)In the GCG environment assumed for these examples, both BLAST and Framesearch are included.
  • Files
  • Protein sequence file of interest (this will be the query sequence)
  • Nucleic acid database of sequences to which the protein sequence will be compared
For example, BA000007.fna contains the nucleotide sequence of all putative genes found in this bacterial genome by the laboratory where it was sequenced, as a single FASTA format text file ( appendix 1B).Both the query sequence and the database files must be converted to the GCG format ( protocol 6).The files used in this example should be downloaded from NCBI or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm) and converted to GCG format, as described in the protocol 6.

Alternate Protocol 3: Improving Speed of Framesearch by Using Specialized Hardware

  Necessary Resources
  • Hardware
  • Any Unix or VMS system that has the Wisconsin Package installed
  • Software
  • GCG Wisconsin Package (v. 8.1 or higher; includes FROMFASTA)
  • Files
  • The files used in this example can be downloaded from the NCBI FTP server as described below, or from the Current Protocols Web site (http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm)
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
   Accelerys. 2001. Announcement of new features in SeqWeb version 2 http://www.accelerys.com/products/seqweb/whats_new2p0.html.
   NOTE: The text of this poster can be found at http://sulu.gcg.com/company/posters/framesearch.html.
   Edelman, I., Faigler, S., Mintz, E., Natan, A., and Devereux, J. 1995. Framesearch: A rigorous alignment program for searching protein databases with nucleic acid queries. Poster, Genome Sequence and analysis Conference, Hilton Head, South Carolina, 1995.
   NOTE: The GCG Transcript, subtitled “Bio‐Computing News for Users of the Wisconsin Package,” was published by the company for a number of years. The text of this issue, which features a discussion of the newly‐added Framesearch program, can be found at http://sulu.gcg.com/pub/newsletter/vol3_no2_nov95.html.
   GCG. 1995. GCG Transcript 3:2. Genetics Computing Group, Madison, Wisconsin.
   Halperin, E., Faigler, S., and Gill‐More, R. 1999. FramePlus: Aligning DNA to protein sequences. Bioinformatics 15(11):867‐873.
   TimeLogic. 2001. Manuals supplied with a DeCypher bioinformatics accelerator. TimeLogic Corporation, Incline Village, Nevada.
   Zhang, Z., Pearson, W.R., and Miller, W. 1997. Aligning a DNA sequence with a protein sequence. Journal of Computational Biology 4(3):339‐349.
Key References
   Edelman et al., 1995. See above.
  The key reference for the Framesearch algorithm is the poster by Edelman. The key reference for a particular implementation of Framesearch is the documentation supplied with that implementation.
Internet Resources
   http://www.accelerys.com/
  Web site of Accelerys, the corporate parent of GCG.
   http://www.cgen.com/
  Web site of the Compugen company.
   http://www.paracel.com/
  Web site of the Paracel company.
   http://www.timelogic.com
  Web site of the TimeLogic company.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library