Genotyping in the Cloud with Crossbow

James Gurtowski1, Michael C. Schatz1, Ben Langmead2

1 Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 2 Department of Computer Science, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 15.3
DOI:  10.1002/0471250953.bi1503s39
Online Posting Date:  September, 2012
Crossbow is a scalable, portable, and automatic cloud computing tool for identifying SNPs from high‐coverage, short‐read resequencing data. It is built on Apache Hadoop, an implementation of the MapReduce software framework. Hadoop allows Crossbow to distribute read alignment and SNP calling subtasks over a cluster of commodity computers. Two robust tools, Bowtie and SOAPsnp, implement the fundamental alignment and variant calling operations respectively, and have demonstrated capabilities within Crossbow of analyzing approximately one billion short reads per hour on a commodity Hadoop cluster with 320 cores. Through protocol examples, this unit will demonstrate the use of Crossbow for identifying variations in three different operating modes: on a Hadoop cluster, on a single computer, and on the Amazon Elastic MapReduce cloud computing service. Curr. Protoc. Bioinform. 39:15.3.1‐15.3.15. © 2012 by John Wiley & Sons, Inc.

Keywords: short reads; read alignment; SNP calling; cloud computing; Hadoop; software package

Table of Contents

  • Introduction
  • Basic Protocol 1: Running Crossbow on a Local Hadoop Cluster
  • Basic Protocol 2: Running Crossbow in Single‐Computer Mode
  • Basic Protocol 3: Running Crossbow on Amazon Web Services via the Command Line
  • Alternate Protocol 1: Running Crossbow on Amazon Web Services via the Web Interface
  • Support Protocol 1: Obtaining and Installing Crossbow
  • Support Protocol 2: Preparing Manifest Files with Sequence Read Information
  • Support Protocol 3: Preparing Reference Jars with Reference Genome Information
  • Guidelines for Understanding Results
  • Commentary
  • Literature Cited
  • Figures
Literature Cited

Key Reference
Internet Resources
  Web site where the latest version of the software as well as an extensive manual are available.
  Web site with the Hadoop documentation and software.
  Describes how to get started using the Amazon Web services including the elastic compute cloud (EC2) and the simple storage system (S3).
