Using the Seven Bridges Cancer Genomics Cloud to Access and Analyze Petabytes of Cancer Data

Raunaq Malhotra1, Isheeta Seth1, Erik Lehnert1, Jing Zhao1, Gaurav Kaushik1, Elizabeth H. Williams1, Anurag Sethi1, Brandi N. Davis‐Dusenbery1

1 Seven Bridges Genomics Inc., Cambridge, Massachusetts
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 11.16
DOI:  10.1002/cpbi.39
Online Posting Date:  December, 2017
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Next‐generation sequencing has produced petabytes of data, but accessing and analyzing these data remain challenging. Traditionally, researchers investigating public datasets like The Cancer Genome Atlas (TCGA) would download the data to a high‐performance cluster, which could take several weeks even with a highly optimized network connection. The National Cancer Institute (NCI) initiated the Cancer Genomics Cloud Pilots program to provide researchers with the resources to process data with cloud computational resources. We present protocols using one of these Cloud Pilots, the Seven Bridges Cancer Genomics Cloud (CGC), to find and query public datasets, bring your own data to the CGC, analyze data using standard or custom workflows, and benchmark tools for accuracy with interactive analysis features. These protocols demonstrate that the CGC is a data‐analysis ecosystem that fully empowers researchers with a variety of areas of expertise and interests to collaborate in the analysis of petabytes of data. © 2017 by John Wiley & Sons, Inc.

Keywords: big data; cancer genomics; cloud computing; common workflow language; reproducible; scalable

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Accessing TCGA Data on the CGC Platform
  • Support Protocol 1: Creating a Login for the CGC With/Without Controlled Data Access
  • Support Protocol 2: Create a Project and Collaborate Via the Graphical User Interface (GUI)
  • Alternate Protocol 1: Accessing Public Cancer Data Using the Datasets API
  • Basic Protocol 2: Scaling Analysis on the CGC Platform With Hundreds of Data Files
  • Support Protocol 3: Using the Workflow Editor to Run STAR‐Fusion With TCGA Files
  • Basic Protocol 3: Interactive Analysis of Results Using Data Cruncher
  • Basic Protocol 4: Analyzing Private Data on the CGC
  • Basic Protocol 5: Deploying Reproducible and Scalable Containerized Tools Implemented in CWL
  • Support Protocol 4: Creating a Docker Container for FunSeq2
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Accessing TCGA Data on the CGC Platform

  Necessary Resources
  • A computer with Internet access and an up‐to‐date Internet browser (e.g., Firefox, Chrome, Safari)
  • An account on the Seven Bridges CGC (https://cgc.sbgenomics.com). All researchers can query the metadata. However, to access TCGA Controlled Data you must have permission from the Database of Genotypes and Phenotypes (dbGaP) through your eRA Commons or NIH Center for Information Technology (CIT) account (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login). Note that these protocols were developed prior to the addition of Open Access raw sequencing data (e.g., CCLE data) to the Data Browser. Researchers who do not have access to TCGA Controlled Data can follow along with minor modifications using either their own data or other raw sequencing Open Data.

Support Protocol 1: Creating a Login for the CGC With/Without Controlled Data Access

  Necessary Resources
  • See protocol 1

Support Protocol 2: Create a Project and Collaborate Via the Graphical User Interface (GUI)

  Necessary Resources
  • See protocol 1

Alternate Protocol 1: Accessing Public Cancer Data Using the Datasets API

  Necessary Resources
    • A computer with Internet access and an up‐to‐date Internet browser (e.g., Firefox, Chrome, Safari)
    • An account on the Seven Bridges CGC (https://cgc.sbgenomics.com). All researchers can query the metadata. However, to access Controlled Data you must have permission through the Database of Genotypes and Phenotypes (dbGaP) through your eRA Commons or NIH Center for Information Technology (CIT) account (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login). Note that these protocols were developed prior to the addition of Open Access raw sequencing data (e.g., CCLE data) to the Data Browser. Researchers who do not have access to TCGA Controlled Data can follow along with minor modifications using either their own data or other raw sequencing Open Data.
    • Installations of conda and the Python bindings for the Seven Bridges API using:
      • pip install sevenbridges‐python
    • The jupyter notebook corresponding to protocol 4 is available on github at https://github.com/sbg‐cancer‐cloud/Current‐Protocols‐in‐Bioinformatics‐2017

Basic Protocol 2: Scaling Analysis on the CGC Platform With Hundreds of Data Files

  Necessary Resources
  • See protocol 1 and protocol 4
  • Also required: synthetic positive dataset (Note that we used the FusionMap dataset; Ge et al., )

Support Protocol 3: Using the Workflow Editor to Run STAR‐Fusion With TCGA Files

  Necessary Resources
  • See protocol 4
  • Also required:
  • Personal AWS bucket with raw sequencing data
  • The jupyter notebook corresponding to protocol 8 is available on github at https://github.com/sbg‐cancer‐cloud/Current‐Protocols‐in‐Bioinformatics‐2017

Basic Protocol 3: Interactive Analysis of Results Using Data Cruncher

  Necessary Resources
  • A computer with internet access (e.g., Firefox, Chrome, Safari)
  • An account on the CGC (https://cgc.sbgenomics.com)
  • Docker installed on a computer
  • The Dockerfile and CWL for the tools used in protocol 9 is available on github at https://github.com/sbg‐cancer‐cloud/Current‐Protocols‐in‐Bioinformatics‐2017

Basic Protocol 4: Analyzing Private Data on the CGC

  Necessary Resources
  • See protocol 9
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

   Amstutz, P. , Crusoe, M. R. , Tijanić, N. , Chapman, B. , Chilton, J. , Heuer, M. , … Stojanovic, L. (2016). Common Workflow Language, v1.0. https://doi.org/10.6084/m9.figshare.3115156.v2.
   Barretina, J. , Caponigro, G. , Stransky, N. , Venkatesan, K. , Margolin, A. A. , Kim, S. , … Garraway, L. A. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391), 603–607. doi: 10.1038/nature11003.
   Becnel, L. B. , Pereira, S. , Drummond, J. A. , Gingras, M.‐C. , Covington, K. R. , Kovar, C. L. , … Gibbs, R. A. (2016). An open access pilot freely sharing cancer genomic data from participants in Texas. Scientific Data, 3, 160010. doi: 10.1038/sdata.2016.10.
   Benelli, M. , Pescucci, C. , Marseglia, G. , Severgnini, M. , Torricelli, F. , & Magi, A. (2012). Discovering chimeric transcripts in paired‐end RNA‐seq data by using EricScript. Bioinformatics, 28(24), 3232–3239. doi: 10.1093/bioinformatics/bts617.
   Cancer Genome Atlas Research Network, Weinstein, J. N. , Collisson, E. A. , Mills, G. B. , Shaw, K. R. M. , Ozenberger, B. A. , … Stuart, J. M. (2013). The Cancer Genome Atlas Pan‐Cancer analysis project. Nature Genetics, 45(10), 1113–1120. doi: 10.1038/ng.2764.
   Fu, Y. , Liu, Z. , Lou, S. , Bedford, J. , Mu, X. J. , Yip, K. Y. , … Gerstein, M. (2014). FunSeq2: A framework for prioritizing noncoding regulatory variants in cancer. Genome Biology, 15(10), 480. doi: 10.1186/s13059‐014‐0480‐5.
   Ge, H. , Liu, K. , Juan, T. , Fang, F. , Newman, M. , & Hoeck, W. (2011). FusionMap: Detecting fusion genes from next‐generation sequencing data at base‐pair resolution. Bioinformatics, 27(14), 1922–1928. doi: 10.1093/bioinformatics/btr310.
   Koboldt, D. C. , Zhang, Q. , Larson, D. E. , Shen, D. , McLellan, M. D. , Lin, L. , … Wilson, R. K. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568–576. doi: 10.1101/gr.129684.111.
   Kumar, S. , Vo, A. D. , Qin, F. , & Li, H. (2016). Comparative assessment of methods for the fusion transcripts detection from RNA‐Seq data. Scientific Reports, 6, 21597. doi: 10.1038/srep21597.
   Mallick, S. , Li, H. , Lipson, M. , Mathieson, I. , Gymrek, M. , Racimo, F. , … Reich, D. (2016). The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature, 538(7624), 201–206. doi: 10.1038/nature18964.
   Nicorici, D. , Satalan, M. , Edgren, H. , Kangaspeska, S. , Murumagi, A. , Kallioniemi, O. , … Kilkku, O. (2014 November 19). FusionCatcher ‐ a tool for finding somatic fusion genes in paired‐end RNA‐sequencing data. bioRxiv. https://doi.org/10.1101/011650.
   Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227. doi: 10.1126/science.1213847.
   Veeraraghavan, J. , Ma, J. , Hu, Y. , & Wang, X.‐S. (2016). Recurrent and pathological gene fusions in breast cancer: Current advances in genomic discovery and clinical implications. Breast Cancer Research and Treatment, 158(2), 219–232. doi: 10.1007/s10549‐016‐3876‐y.
   Yoshihara, K. , Wang, Q. , Torres‐Garcia, W. , Zheng, S. , Vegesna, R. , Kim, H. , & Verhaak, R. G. W. (2015). The landscape and therapeutic relevance of cancer‐associated transcript fusions. Oncogene, 34(37), 4845–4854. doi: 10.1038/onc.2014.406.
   Yung, C. K. , O'Connor, B. D. , Yakneen, S. , Zhang, J. , Ellrott, K. , Kleinheinz, K. , … PCAWG Technical Working Group (2017 July 10). Large‐scale uniform analysis of cancer whole genomes in multiple computing environments. bioRxiv. https://doi.org/10.1101/161638
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library