Installing, Maintaining, and Using a Local Copy of BLAST for Intranet and Workstation Use

Timothy G. Littlejohn1

1 IBM Life Sciences, St. Leonards, NSW
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.11
DOI:  10.1002/0471250953.bi0311s05
Online Posting Date:  May, 2004
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

The Basic Local Alignment Search Tool (BLAST) is one of the widest used and most useful applications in sequence‐based bioinformatics analysis. Frequently it is not practical or possible to use remote BLAST services through the Internet due to restrictions of a security or technical nature or the need for high‐throughput analysis requiring greater amounts of processing power than are available from remote services. This unit describes the steps involved in obtaining and installing a copy of the BLAST software for use on a local intranet or stand‐alone workstation. Once installed, the BLAST package can be used to create BLAST‐searchable nucleotide and protein sequence databanks. Various popular hardware (PPC, Intel) and operating system (MacOSX, FreeBSD and Linux) options for running and maintaining the software are discussed. Finally, steps for indexing proprietary and third party (publicly available) sequence databanks for use with BLAST and managing these resources are discussed.

Keywords: BLAST; sequence similarity searching; Unix‐like operating systems

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Strategic Planning
  • Basic Protocol 1: Installing and Running Blast Locally under Unix‐Like Operating Systems such as Linux
  • Alternate Protocol 1: Installing and Running Blast Locally under Microsoft Windows
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Installing and Running Blast Locally under Unix‐Like Operating Systems such as Linux

  Necessary Resources
  • Hardware
    • The hardware requirements for running BLAST locally are modest indeed: any Intel or equivalent (e.g., AMD)–based architecture will be adequate for running this protocol. However a few considerations on hardware need to be addressed:
      • BLAST can be computationally intensive. For example, if searching large databases, using CPU‐intensive BLAST algorithms (e.g., TBLASTX) or searching with many sequences, more CPU power is better.
      • BLAST can be memory‐hungry; for ideal performance, one should have enough memory to load the entire indexed database comfortably into RAM.
      • Databases can be large and require large amounts of disk space. Researchers who decide to take on ambitious projects such as downloading the entire GenBank database for local searching should keep this in mind.
    • The nice thing about BLAST is that the hardware needs can be scaled easily by adding more disk space or RAM or moving to multiprocessor architectures. In addition, if using BLAST to query a large number of sequences against a common database, this process can very easily be parallelized by copying the indexed database, the relevant query sequences, and the blastall application to any number of machines on which the analyses are to be performed, and running them at the same time.
  • Software
    • There are several programs in the stand‐alone BLAST package. The main ones that are needed to run BLAST locally are formatdb to create BLASTable databases and blastall to query these databases using any of the favorite BLAST algorithms (blastn, blastp, blastx, tblastn, and tblastx). formatdb is a program for formatting FASTA formatted databases for searching using BLAST. Details on formatdb can be found in the file README.formatdb distributed with the BLAST package. The options for formatdb are listed in Table 3.11.1. blastall is the main BLAST application. It is used for running queries against the indexed databases created with formatdb. Details on blastall can be found in the file README.bls distributed with the BLAST package. Some of the most commonly used blastall options are listed in Table 3.11.2.
      Table 3.1.1   Necessary Resources   Options for formatdb a   Options for formatdb   Options for blastall b   Options for blastall

      Option Explanation
      ‐t Title for database file [String] (optional)
      ‐i Input file(s) for formatting [File In] (this parameter must be set)
      ‐l Logfile name: [File Out] (optional)
      default = formatdb.log
      ‐p Type of file [T/F] (optional):
      T = protein
      F = nucleotide
      default = T
      ‐o Parse options [T/F] (optional):
      T (true) = parse SeqID and create indexes
      F (false) = do not parse SeqID; do not create indexes
      default = F
      ‐a Input file is database in ASN.1 format (otherwise FASTA is expected) [T/F](optional):
      T = True
      F = False
      default = F
      ‐b ASN.1 database in binary mode [T/F] (optional):
      T = binary
      F = text mode
      default = F
      ‐e Input is a Seq entry [T/F] (optional)
      default = F
      ‐n Base name for BLAST files [String] (optional)
      ‐v Number of sequence bases to be created in the volume [Integer] (optional)
      default = 0
      ‐s Create indexes limited only to accessions: sparse [T/F] (optional)default = F
      ‐V Verbose: check for nonunique string IDs in the database [T/F] (optional)default = F
      ‐A Create ASN.1 structured deflines [T/F] (optional)default = F
      Option Explanation
      ‐p Program name [String]
      Input should be one of blastp, blastn, blastx, tblastn, or, tblastx
      ‐d Database [String]
      default = nr
      The database specified must first be formatted with formatdb. An example would be ‐d nr est, which will search both the nr and est databases, presenting the results as if one “virtual” database consisting of all the entries from both were searched. The statistics are based on the “virtual” database of nr and est.
      ‐i Query file [File In]
      default = stdin
      The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched.
      ‐e Expectation value (E) [Real]
      default = 10.0
      ‐o BLAST report output file [File Out] (optional)
      default = stdout
      ‐F Filter query sequence (dust with BLASTN, seg with others) [String]
      default = T
      BLAST 2.0 and 2.1 use the dust low‐complexity filter for BLASTN and seg for the other programs. Both dust and seg are integral parts of the NCBI Toolkit and are accessed automatically. If one uses ‐F T then normal filtering by seg or dust (for BLASTN) occurs (likewise ‐F F means no filtering whatsoever). This option also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters.
      ‐S Query strands to search against database (for BLAST[NX], and TBLASTX). 3 is both, 1 is top, 2 is bottom [Integer]
      default = 3
      ‐T Produce HTML output [T/F]
      default = F
      ‐l Restrict search of database to list of GI's [String] (optional)
      This option specifies that only a subset of the database should be searched, determined by the list of GI's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory from which BLAST is called.
      ‐U Use lowercase filtering of FASTA sequence [T/F] (optional)
      This option specifies that any lower‐case letters in the input FASTA file should be masked

       aFor an example of using these options, see protocol 1, step .
      Table 3.1.2   Necessary Resources   Options for formatdb a   Options for formatdb   Options for blastall b   Options for blastall

      Option Explanation
      ‐t Title for database file [String] (optional)
      ‐i Input file(s) for formatting [File In] (this parameter must be set)
      ‐l Logfile name: [File Out] (optional)
      default = formatdb.log
      ‐p Type of file [T/F] (optional):
      T = protein
      F = nucleotide
      default = T
      ‐o Parse options [T/F] (optional):
      T (true) = parse SeqID and create indexes
      F (false) = do not parse SeqID; do not create indexes
      default = F
      ‐a Input file is database in ASN.1 format (otherwise FASTA is expected) [T/F](optional):
      T = True
      F = False
      default = F
      ‐b ASN.1 database in binary mode [T/F] (optional):
      T = binary
      F = text mode
      default = F
      ‐e Input is a Seq entry [T/F] (optional)
      default = F
      ‐n Base name for BLAST files [String] (optional)
      ‐v Number of sequence bases to be created in the volume [Integer] (optional)
      default = 0
      ‐s Create indexes limited only to accessions: sparse [T/F] (optional)default = F
      ‐V Verbose: check for nonunique string IDs in the database [T/F] (optional)default = F
      ‐A Create ASN.1 structured deflines [T/F] (optional)default = F
      Option Explanation
      ‐p Program name [String]
      Input should be one of blastp, blastn, blastx, tblastn, or, tblastx
      ‐d Database [String]
      default = nr
      The database specified must first be formatted with formatdb. An example would be ‐d nr est, which will search both the nr and est databases, presenting the results as if one “virtual” database consisting of all the entries from both were searched. The statistics are based on the “virtual” database of nr and est.
      ‐i Query file [File In]
      default = stdin
      The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched.
      ‐e Expectation value (E) [Real]
      default = 10.0
      ‐o BLAST report output file [File Out] (optional)
      default = stdout
      ‐F Filter query sequence (dust with BLASTN, seg with others) [String]
      default = T
      BLAST 2.0 and 2.1 use the dust low‐complexity filter for BLASTN and seg for the other programs. Both dust and seg are integral parts of the NCBI Toolkit and are accessed automatically. If one uses ‐F T then normal filtering by seg or dust (for BLASTN) occurs (likewise ‐F F means no filtering whatsoever). This option also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters.
      ‐S Query strands to search against database (for BLAST[NX], and TBLASTX). 3 is both, 1 is top, 2 is bottom [Integer]
      default = 3
      ‐T Produce HTML output [T/F]
      default = F
      ‐l Restrict search of database to list of GI's [String] (optional)
      This option specifies that only a subset of the database should be searched, determined by the list of GI's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory from which BLAST is called.
      ‐U Use lowercase filtering of FASTA sequence [T/F] (optional)
      This option specifies that any lower‐case letters in the input FASTA file should be masked

       bFor an example of using these options, see protocol 1, step .
  • Files
    • Input data files must be in FASTA format (see appendix 1B)
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
   Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389‐3402.
Internet Resources
   http://www.ibiostation.com
  Web site of iBiostation, from which the book iBiostation Linux: Bioinformatics for Linux (2003), by M. Hobbs, T. G. Littlejohn and K. Castle (BioLateral Pty. Ltd., Sydney, Au.; ISBN 0‐9750583‐0‐4), may be purchased.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library