Finding Homologs in Amino Acid Sequences Using Network BLAST Searches

Istvan Ladunga1

1 Departments of Statistics, Biochemistry and School of Biological Sciences, University of Nebraska–Lincoln, Lincoln, Nebraska
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 3.4
DOI:  10.1002/cpbi.34
Online Posting Date:  September, 2017
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

BLAST, the Basic Local Alignment Search Tool, is used more frequently than any other biosequence database search program. We show how to run searches on the Web, and demonstrate how to increase performance by fine‐tuning arguments for a specific research project. We offer guidance for interpreting results, statistical significance and biological relevance issues, and suggest complementary analyses. This unit covers both protein‐to‐protein (blastp) searches and translated searches (blastx, tblastn, tfastx). blastx conceptually translates the query sequence and tblastn translates all nucleotide sequences in a database, while tblastx translates both the query and the database sequences into amino acid sequences. © 2017 by John Wiley & Sons, Inc.

Keywords: BLAST; Basic Local Alignment Search Tool; Sequence similarity search; Protein function prediction; Homology; translated BLAST searches

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Using the Web‐Interface BLAST for BLASTP: Protein‐to‐ Protein Searches
  • Support Protocol 1: Setting Arguments (Options) for Advanced BLAST
  • Support Protocol 2: Formatting Results from a BLAST Search
  • Basic Protocol 2: Translated BLAST Searches
  • Basic Protocol 3: Comparing Two or More Sequences
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

  Adams, M. D., Dubnick, M., Kerlavage, A. R., Moreno, R., Kelley, J. M., Utterback, T. R., … Venter, J. C. (1992). Sequence identification of 2,375 human brain genes. Nature, 355(6361), 632–634. doi: 10.1038/355632a0.
  Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. doi: 10.1016/S0022‐2836(05)80360‐2.
  Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI‐BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389–3402. doi: 10.1093/nar/25.17.3389.
  Benson, D. A., Cavanaugh, M., Clark, K., Karsch‐Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2017). GenBank. Nucleic Acids Research, 45(D1), D37–D42. doi: 10.1093/nar/gkw1070.
  Birney, E., Clamp, M., & Durbin, R. (2004). GeneWise and Genomewise. Genome Research, 14(5), 988–995. doi: 10.1101/gr.1865504.
  Chen, C., Huang, H., & Wu, C. H. (2017). Protein bioinformatics databases and resources. Methods in Molecular Biology (Clifton, N.J.), 1558, 3–39. doi: 10.1007/978‐1‐4939‐6783‐4_1
  Dayhoff, M., & Eck, R. (1968). Atlas of protein sequence and structure 1967–1968. Silver Spring, MD: National Biomedical Research Foundation.
  Dorn, M., MB, E. S., Buriol, L. S., & Lamb, L. C. (2014). Three‐dimensional protein structure prediction: Methods and computational strategies. Computational Biology and Chemistry, 53PB, 251–276. doi: 10.1016/j.compbiolchem.2014.10.001.
  Finn, R. D., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Mistry, J., Mitchell, A. L., … Bateman, A. (2016). The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Research, 44(D1), D279–285. doi: 10.1093/nar/gkv1344.
  Germani, F., Moens, L., & Dewilde, S. (2013). Haem‐based sensors: A still growing old superfamily. Advances in Microbial Physiology, 63, 1–47. doi: 10.1016/B978‐0‐12‐407693‐8.00001‐7.
  Gish, W., & States, D. J. (1993). Identification of protein coding regions by database similarity search. Nature Genetics, 3(3), 266–272. doi: 10.1038/ng0393‐266.
  Henikoff, S., & Henikoff, J. G. (1993). Performance evaluation of amino acid substitution matrices. Proteins, 17(1), 49–61. doi: 10.1002/prot.340170108.
  Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M., & Stanke, M. (2016). BRAKER1: Unsupervised RNA‐seq‐based genome annotation with GeneMark‐ET and AUGUSTUS. Bioinformatics (Oxford, England), 32(5), 767–769. doi: 10.1093/bioinformatics/btv661.
  Jurka, J. (2008). Conserved eukaryotic transposable elements and the evolution of gene regulation. Cellular and Molecular Life Sciences, 65(2), 201–204. doi: 10.1007/s00018‐007‐7369‐3.
  Kapitonov, V. V., & Jurka, J. (2008). A universal classification of eukaryotic transposable elements implemented in Repbase. Nature Reviews Genetics, 9(5), 411–412; author reply 414. doi: 10.1038/nrg2165‐c1.
  Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. (1994). Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology, 235(5), 1501–1531. doi: 10.1006/jmbi.1994.1104.
  Ladunga, I. (1992). Phylogenetic continuum indicates “galaxies” in the protein universe: Preliminary results on the natural group structures of proteins. Journal of Molecular Evolution, 34(4), 358–375. doi: 10.1007/BF00160244.
  Ladunga, I. (2017). Finding similar nucleotide sequences using network BLAST searches. Current Protocols in Bioinformatics, 58, 3.3.1–3.3.25. doi: 10.1002/cpbi.29.
  Letunic, I., Doerks, T., & Bork, P. (2015). SMART: Recent updates, new developments and status in 2015. Nucleic Acids Research, 43(Database issue), D257–260. doi: 10.1093/nar/gku949.
  Mackey, A. J., & Pearson, W. R. (2004). Using relational databases for improved sequence similarity searching and large‐scale genomic analyses. Current Protocols in Bioinformatics, 7, 9.4.1–9.4.25. doi: 10.1002/0471250953.bi0904s7.
  NCBI Resource Coordinators (2017). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 45(D1), D12–D17. doi: 10.1093/nar/gkw1071.
  Pearson, W. R. (2013a). An introduction to sequence similarity (“homology”) searching. Current Protocols in Bioinformatics, 42, 3.1.1–3.1.8. doi: 10.1002/0471250953.bi0301s42.
  Pearson, W. R. (2013b). Selecting the right similarity‐scoring matrix. Current Protocols in Bioinformatics, 3, 3.5.1–3.5.9. doi: 10.1002/0471250953.bi0305s43.
  Petsko, G. A. (2006). An introduction to modeling structure from sequence. Current Protocols in Bioinformatics, 15, 5.1.1–5.1.3. doi: 10.1002/0471250953.bi0501s15.
  Pruitt, K. D., Brown, G. R., Hiatt, S. M., Thibaud‐Nissen, F., Astashyn, A., Ermolaeva, O., … Ostell, J. M. (2014). RefSeq: An update on mammalian reference sequences. Nucleic Acids Research, 42(Database issue), D756–763. doi: 10.1093/nar/gkt1114.
  Reeck, G., de Haen, C., Teller, D., Doolittle, R., Fitch, W., Dickerson, R., … Zuckerkandl, E. (1987). Homology in proteins and nucleic acids: A terminology muddle and a way out of it. Cell, 50, 667. doi: 10.1016/0092‐8674(87)90322‐9.
  Rodionov, M. A., & Blundell, T. L. (1998). Sequence and structure conservation in a protein core. Proteins, 33(3), 358‐366.
  Ropelewski, A. J., Nicholas, H. B., & Deerfield, D. W. (2004). Mathematically complete nucleotide and protein sequence searching using Ssearch. Current Protocols in Bioinformatics, 4, 3.10.1–3.10.12. doi: 10.1002/0471250953.bi0310s04.
  Rose, P. W., Prlic, A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., … Burley, S. K. (2017). The RCSB Protein Data Bank: Integrative view of protein, gene and 3D structural information. Nucleic Acids Research, 45(D1), D271–D281. doi: 10.1093/nar/gkw1000.
  Schaffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, Y. I., … Altschul, S. F. (2001). Improving the accuracy of PSI‐BLAST protein database searches with composition‐based statistics and other refinements. Nucleic Acids Research, 29(14), 2994–3005. doi: 10.1093/nar/29.14.2994.
  Skolnick, J., & Zhou, H. (2017). Why is there a glass ceiling for threading based protein structure prediction methods? The Journal of Physical Chemistry. B, 121(15), 3546–3554. doi: 10.1021/acs.jpcb.6b09517.
  Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195–197. doi: 10.1016/0022‐2836(81)90087‐5.
  Stockinger, H., Altenhoff, A. M., Arnold, K., Bairoch, A., Bastian, F., Bergmann, S., … Appel, R. D. (2014). Fifteen years SIB Swiss Institute of Bioinformatics: Life science databases, tools and support. Nucleic Acids Research, 42(Web Server issue), W436–441. doi: 10.1093/nar/gku380.
  Stormo, G. D. (2011). An introduction to recognizing functional domains. Current Protocols in Bioinformatics, 34, 2.1.1–2.1.6. doi: 10.1002/0471250953.bi0201s34.
  Tatusova, T. A., & Madden, T. L. (1999). BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiology Letters, 174(2), 247–250. doi: 10.1111/j.1574‐6968.1999.tb13575.x.
  Thomas, P. D., Kejariwal, A., Campbell, M. J., Mi, H., Diemer, K., Guo, N., … Doremieux, O. (2003). PANTHER: A browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucleic Acids Research, 31(1), 334–341. doi: 10.1093/nar/gkg115.
  Tyner, C., Barber, G. P., Casper, J., Clawson, H., Diekhans, M., Eisenhart, C., … Kent, W. J. (2016). The UCSC genome browser database: 2017 update. Nucleic Acids Research, 45(Database issue), D626–D634.
  Tzou, P. L., Huang, X., & Shafer, R. W. (2017). NucAmino: A nucleotide to amino acid alignment optimized for virus gene sequences. BMC Bioinformatics, 18(1), 138. doi: 10.1186/s12859‐017‐1555‐6.
  Wootton, J. C., & Federhen, S. (1996). Analysis of compositionally biased regions in sequence databases. Methods in Enzymology, 266, 554–571. doi: 10.1016/S0076‐6879(96)66035‐2.
  Yu, Y. K., Gertz, E. M., Agarwala, R., Schaffer, A. A., & Altschul, S. F. (2006). Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Research, 34(20), 5966–5973. doi: 10.1093/nar/gkl731.
  Zerbino, D. R. (2010). Using the Velvet de novo assembler for short‐read sequencing technologies. Current Protocols in Bioinformatics, 31, 11.5.1–11.5.12. doi: 10.1002/0471250953.bi1105s31.
  Zhou, L., Pertea, M., Delcher, A. L., & Florea, L. (2009). Sim4cc: A cross‐species spliced alignment program. Nucleic Acids Research, 37(11), e80. doi: 10.1093/nar/gkp319.
Key References
  Altschul, S. F., Boguski, M. S., Gish, W., & Wootton, J. C. (1994). Issues in searching molecular sequence databases. Nature Genetics 6, 119‐129.
  Probably the best description of the BLAST program that produced nongapped alignments at that time. This review discusses the underlying statistics and their biological interpretation, the scoring schemes, the search, the sensitivity, and selectivity on biological examples.
  Altschul et al., 1997. See above.
  The original research paper on gapped and PSI‐BLAST. Both are significant improvements over earlier BLAST versions. Computational speed, increased sensitivity, and decreased selectivity are analyzed.
  Gish & States, 1993. See above.
  The original research paper about translated BLAST. The authors evaluate the advantages and pitfalls of this application when processing introns, frameshifts, and similar issues. Besides the theory, implications for statistical significance are illustrated using detailed examples.
Internet Resources
  https://ncbi.nlm.nih.gov/BLAST
  The NCBI BLAST Web site.
  http://repeatmasker.genome.washington.edu/cgi‐bin/RepeatMasker
  The RepeatMasker Web site.
  http://www.ch.embnet.org/software/COILS_form.html
  Coiled coil predictions.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library