Finding Similar Nucleotide Sequences Using Network BLAST Searches

Istvan Ladunga1

1 Departments of Statistics, Biochemistry, and School of Biological Sciences, University of Nebraska‐Lincoln, Lincoln, Nebraska
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.3
DOI:  10.1002/cpbi.29
Online Posting Date:  June, 2017
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


The Basic Local Alignment Search Tool (BLAST) is the first tool in the annotation of nucleotide or amino acid sequences. BLAST is a flagship of bioinformatics due to its performance and user‐friendliness. Beginners and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNAs, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but indicate hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low‐complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez and PubMed, as well as structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge. © 2017 by John Wiley & Sons, Inc.

Keywords: BLAST; sequence alignment; database search; homology search; mapping; nucleic acid; DNA; RNA; genome; blastn; Megablast

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Using the WEB‐Interface Blast from the NCBI Blast Server for Nucleotide Sequences
  • Basic Protocol 2: The Default Blastn Result Output
  • Support Protocol 1: Setting Optional Parameters
  • Support Protocol 2: Formatting Results of a Blast Search
  • Alternate Protocol 1: Megablast Search for Ribosomal RNA
  • Alternate Protocol 2: Finding Transcribed Gene Copies and Splice Variants Using Megablast
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

  Altschul, S. F. (1991). Amino acid substitution matrices from an information theoretic perspective. Journal of Molecular Biology, 219, 555–565. doi: 10.1016/0022‐2836(91)90193‐A.
  Altschul, S. F., Boguski, M. S., Gish, W., & Wootton, J. C. (1994). Issues in searching molecular sequence databases. Nature Genetics, 6, 119–129. doi: 10.1038/ng0294‐119.
  Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403–410. doi: 10.1016/S0022‐2836(05)80360‐2.
  Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389–3402. doi: 10.1093/nar/25.17.3389.
  Bailey, T. L. (2002). Discovering novel sequence motifs with MEME. Current Protocols in Bioinformatics, 00, 2.4.1–2.4.35.
  Baker, M. E., Yan, L., & Pear, M. R. (2000). Three‐dimensional model of human TIP30, a coactivator for HIV‐1 Tat‐activated transcription, and CC3, a protein associated with metastasis suppression. Cellular and Molecular Life Sciences, 57, 851–858. doi: 10.1007/s000180050047.
  Barrett, C., Hughey, R., & Karplus, K. (1997). Scoring hidden Markov models. Computer Applications in the Biosciences, 13, 191–199.
  Baxevanis, A. D. (2005). Assessing pairwise sequence similarity: BLAST and FASTA. In A. D. Baxevanis & B. F. Ouellette (Eds.), Bioinformatics. A practical guide to the analysis of genes and proteins (pp. 295–324). Hoboken, NJ: John Wiley & Sons.
  Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., … Bourne, P. E. (2000). The protein data bank. Nucleic Acids Research, 28, 235–242. doi: 10.1093/nar/28.1.235.
  Bolten, E., Schliep, A., Schneckener, S., Schomburg, D., & Schrader, R. (2001). Clustering protein sequences–structure prediction by transitive homology. Bioinformatics, 17, 935–941. doi: 10.1093/bioinformatics/17.10.935.
  Coggill, P., Finn, R. D., & Bateman, A. (2008). Identifying protein domains with the Pfam database. Current Protocols in Bioinformatics, 23, 2.5.1–2.5.17.
  Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14, 755–763. doi: 10.1093/bioinformatics/14.9.755.
  Elbashir, S. M., Harborth, J., Weber, K., & Tuschl, T. (2002). Analysis of gene function in somatic mammalian cells using small interfering RNAs. Methods, 26, 199–213. doi: 10.1016/S1046‐2023(02)00023‐3.
  Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H. R., … Bateman, A. (2008). The Pfam protein families database. Nucleic Acids Research, 36, D281‐D288. doi: 10.1093/nar/gkm960.
  Gerstein, M. (1998). Measurement of the effectiveness of transitive sequence comparison, through a third ‘intermediate’ sequence. Bioinformatics, 14, 707–714. doi: 10.1093/bioinformatics/14.8.707.
  Gibney, G., & Baxevanis, A. D. (2011). Searching NCBI databases using Entrez. Current Protocols in Bioinformatics, 34, 1.3.1–1.3.25. doi: 10.1002/0471250953.bi0103s34.
  Healy, M. (2002). Finding homologs to nucleic acid or protein sequences using the Framesearch program. Current Protocols in Bioinformatics, 00, 3.2.1–3.2.23.
  Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89, 10915–10919. doi: 10.1073/pnas.89.22.10915.
  Henikoff, J. G., Greene, E. A., Taylor, N., Henikoff, S., & Pietrokovski, S. (2002). Using the blocks database to recognize functional domains. Current Protocols in Bioinformatics, 00, 2.2.1–2.2.32. doi: 10.1002/0471250953.bi0202s00.
  Holm, L., & Sander, C. (1998). Removing near‐neighbor redundancy from large protein sequence collections. Bioinformatics, 14, 423–429. doi: 10.1093/bioinformatics/14.5.423.
  Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., & Madden, T. L. (2008). NCBI BLAST: A better web interface. Nucleic Acids Research, 36, W5‐W9. doi: 10.1093/nar/gkn201.
  Jurka, J., Kapitonov, V. V., Kohany, O., & Jurka, M. V. (2007). Repetitive sequences in complex genomes: Structure and evolution. Annual Review of Genomics and Human Genetics, 8, 241–259. doi: 10.1146/annurev.genom.8.080706.092416.
  Karlin, S., & Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America, 87, 2264–2268. doi: 10.1073/pnas.87.6.2264.
  Karlin, S., & Bucher, P. (1992). Correlation analysis of amino acid usage in protein classes. Proceedings of the National Academy of Sciences of the United States of America, 89, 12165–12169. doi: 10.1073/pnas.89.24.12165.
  Kent, W. J. (2002). BLAT–the BLAST‐like alignment tool. Genome Research, 12, 656–664. doi: 10.1101/gr.229202.
  Korf, I., Yandell, M., & Bedell, J. (2003). BLAST. An essential guide to the basic local alignment tool. Sebastopol, CA: O'Reilly.
  Ladunga, I. (2009). Finding homologs in amino acid sequences using network BLAST searches. Current Protocols in Bioinformatics, 25, 3.4:3.4.1–3.4.34. doi: 10.1002/0471250953.bi0304s25.
  Leonard, S. A. (2002). IUPAC/IUB single‐letter codes within nucleic acid and amino acid sequences. Current Protocols in Bioinformatics, 00, 1A:A.1A.1. doi: 10.1002/0471250953.bia01as00.
  Letunic, I., Copley, R. R., Pils, B., Pinkert, S., Schultz, J., & Bork, P. (2006). SMART, 5, Domains in the context of genomes and networks. Nucleic Acids Research, 34, D257‐D260. doi: 10.1093/nar/gkj079.
  Liang, Y. D. (2006). Introduction to JAVA programming: Comprehensive version (3rd ed.). Lebanon, Ind.: Pearson Prentice Hall.
  Mills, L. (2014). Common file formats. Current Protocols in Bioinformatics, 1, A.1B.1–A.1B.18. doi: 10.1002/0471250953.bia01bs45.
  Møller, A., & Schwartzbach, M. I. (2006). An introduction to XML and Web technologies. New York: Addison‐Wesley.
  Morgulis, A., Coulouris, G., Raytselis, Y., Madden, T. L., Agarwala, R., & Schaffer, A. A. (2008). Database indexing for production MegaBLAST searches. Bioinformatics, 24, 1757–1764. doi: 10.1093/bioinformatics/btn322.
  Morgulis, A., Gertz, E. M., Schaffer, A. A., & Agarwala, R. (2006). A fast and symmetric DUST implementation to mask low‐complexity DNA sequences. Journal of Computational Biology, 13, 1028–1040. doi: 10.1089/cmb.2006.13.1028.
  Neuwald, A. F., & Altschul, S. F. (2016). Inference of functionally‐relevant n‐acetyltransferase residues based on statistical correlations. PLoS Computational Biology, 12, e1005294. doi: 10.1371/journal.pcbi.1005294.
  Schultz, J., Milpetz, F., Bork, P., & Ponting, C. P. (1998). SMART, a simple modular architecture research tool: Identification of signaling domains. Proceedings of the National Academy of Sciences of the United States of America, 95, 5857–5864. doi: 10.1073/pnas.95.11.5857.
  Stajich, J. E. (2007). An introduction to BioPerl. Methods in Molecular Biology, 406, 535–548.
  Stein, L. (1998). Official guide to programming with The standard for building web scripts. New York: John Wiley & Sons.
  Stein, L. (2013). Creating databases for biological information: An introduction. Current Protocols in Bioinformatics, 42, 9.1:9.1.1–9.1.10.
  Stein, L. D. (2015). Unix survival guide. Current Protocols in Bioinformatics, 51, A1.C.1–A1.C.27. doi: 10.1002/0471250953.bia01cs51.
  Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., … Robinson, G. E. (2015) Big data: Astronomical or genomical? PLoS Biology, 13, e1002195. doi: 10.1371/journal.pbio.1002195.
  Tarailo‐Graovac, M., & Chen, N. (2009). Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics, 25, 4.10.1–4.10.14. doi: 10.1002/0471250953.bi0410s25.
  Thompson, J. D., Gibson, T. J., & Higgins, D. G. (2002). Multiple sequence alignment using ClustalW and ClustalX. Current Protocols in Bioinformatics, 00, 2.3.1–2.3.22. doi: 10.1002/0471250953.bi0203s00.
  Tisdall, J. D. (2001). Beginning PERL for bioinformatics. An introduction to PERL for biologists. Sebastopol, CA: O'Reilly.
  Tyner, C., Barber, G. P., Casper, J., Clawson, H., Diekhans, M., Eisenhart, C., … Kent, W. J. (2016). The UCSC genome browser database: 2017 update. Nucleic Acids Research, 10.1093/nar/gkw1134.
  Ullman, L. (2006). MySQL: Visual quickstart guide. Berkeley, CA: Peachpit Press.
  Wang, Y., Addess, K. J., Chen, J., Geer, L. Y., He, J., He, S., … Bryant, S. H. (2007). MMDB: Annotating protein sequences with Entrez's 3D‐structure database. Nucleic Acids Research, 35, D298‐D300. doi: 10.1093/nar/gkl952.
  Wootton, J. C., & Federhen, S. (1996). Analysis of compositionally biased regions in sequence databases. Methods in Enzymology, 266, 554–571. doi: 10.1016/S0076‐6879(96)66035‐2.
  Zhang, Z., Schwartz, S., Wagner, L., & Miller, W. (2000). A greedy algorithm for aligning DNA sequences. Journal of Computational Biology, 7, 203–214. doi: 10.1089/10665270050081478.
Key References
  Altschul et al., 1994. See above.
  An excellent review on the application of pair‐wise BLAST tools for the identification of possible coding regions, for the elucidation of gene structure and protein function. This review discusses significance sequence filtering, database issues, alignment statistics, gap costs, scoring systems, and others.
  Altschul et al., 1997. See above.
  This is the original research paper on gapped alignment BLAST and position‐specific iterative BLAST. A series of algorithmic and performance improvements, gap penalty, and statistical considerations, as well as biological examples with marginal similarities are covered.
  Baxevanis, A. D., & Ouellette, B. F. (2005). Bioinformatics. A practical guide to the analysis of genes and proteins. Hoboken, NJ: John Wiley & Sons.
  A widely taught, clearly written textbook that introduces pairwise sequence similarity searches, biological databases, and many other areas of bioinformatics. Reviews the general concepts of alignments, scoring matrices, and BLAST with practical applications and guidelines for interpretation.
  Korf et al., 2003. See above.
  An excellent overview of theory and practice of the BLAST tools as of 2003. This most comprehensive and easy‐to‐understand textbook is highly recommended to everyone in bioinformatics or computational biology.
Internet Resources
  The NCBI BLAST Web site.
  The Entrez Documentation at NCBI.
  The Entrez site for nucleic acid searches at NCBI.
  The BioPerl site.
  The full documentation for BLAST at NCBI.
  The new Server for the Washington University BLAST.
  The RepeatMasker Web site.‐bin/WEBRepeatMasker.
PDF or HTML at Wiley Online Library