Using SQL Databases for Sequence Similarity Searching and Analysis

William R. Pearson1, Aaron J. Mackey2

1 Department of Biochemistry and Molecular Genetics, University of Virginia, School of Medicine, Charlottesville, 2 Department of Public Health Sciences, University of Virginia, School of Medicine, Charlottesville
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 9.4
DOI:  10.1002/cpbi.32
Online Posting Date:  September, 2017
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome‐scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large‐scale genomic analyses of homology‐related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large‐scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc.

Keywords: relational database; sequence similarity; comparative genomic analysis; homology

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Installing and Populating the seqdb_demo Relational Database
  • Basic Protocol 2: Extracting Sequences from seqdb_demo for Similarity Searching to Improve Homology Detection
  • Basic Protocol 3: Storing Similarity Search Results in search_demo
  • Basic Protocol 4: Analyzing Similarity Search Results: Identifying Ancient Proteins
  • Basic Protocol 5: Analyzing Similarity Search Results: Taxonomic Groupings
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

 
  Celko, J. (1999). Joe Celko's SQL for Smarties. San Francisco: Morgan Kaufmann.
Internet Resources
  ftp://ftp.ncbi.nih.gov/pub/blast/db/FASTA/nr.gz
  Comprehensive nr database (flat file protein sequence database).
  http://doi.org/10.5281/zenodo.377027
  William R. Pearson. (2017). CPBI_seqdb_demo sample QFO sequence library [Data set]. Zenodo.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library