Finding Pathogenic Nucleic Acid Sequences in Next Generation Sequencing Data

Michael Parfenov1, J.G. Seidman1

1 Harvard Medical School, Boston, Massachusetts
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 18.9
DOI:  10.1002/0471142905.hg1809s86
Online Posting Date:  July, 2015
Viruses and bacteria are established as one of the main causes of human diseases from hepatitis to cancer. Recently, the presence of such pathogens has been extensively studied using human whole genome and transcriptome sequencing data. However, detecting and studying pathogens via next generation sequencing data is a challenging task in terms of time and computational resources. In this protocol we give instructions for a simple and quick method to find pathogenic DNA or RNA and detect possible integration of the pathogen genome into the host genome. © 2015 by John Wiley & Sons, Inc.

Keywords: next generation sequencing; pathogens; viruses; bacteria; integration; detection

Table of Contents

  • Commentary
  • Literature Cited
  • Figures
Basic Protocol 1:

  • Unix operating system (for a beginner's guide see Stein, )
  • Burrows‐Wheeler Aligner (BWA) 0.5.9rc1 (r1561)
  • Samtools 0.1.19‐44428 cd
  • bamUtil
  • Perl v5.10.1
  • Integrative Genomics Viewer (IGV)
  • Analysis scripts written in Perl and C
  • Reference human genome hg19 in FASTA format:
  • Database of reference viral genomes is provided. To update the database or to create a customized database one could download reference genomes from the NCBI database:
    • Viral genomes:
    • Bacterial genomes:
  • Database of reference pathogen genomes in FASTA format
  • Paired‐end DNA sequencing data in FASTQ format (sample.01.fastq and sample.02.fastq)
NOTE: A computing cluster is recommended.NOTE: Later or earlier versions of BWA aligner should be tested.
Literature Cited

