cgpCaVEManWrapper: Simple Execution of CaVEMan in Order to Detect Somatic Single Nucleotide Variants in NGS Data

David Jones1, Keiran M. Raine1, Helen Davies1, Patrick S. Tarpey1, Adam P. Butler1, Jon W. Teague1, Serena Nik‐Zainal1, Peter J. Campbell1

1 Cancer Genome Project, Wellcome Trust Sanger Institute, Cambridge
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 15.10
DOI:  10.1002/cpbi.20
Online Posting Date:  December, 2016
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


CaVEMan is an expectation maximization–based somatic substitution‐detection algorithm that is written in C. The algorithm analyzes sequence data from a test sample, such as a tumor relative to a reference normal sample from the same patient and the reference genome. It performs a comparative analysis of the tumor and normal sample to derive a probabilistic estimate for putative somatic substitutions. When combined with a set of validated post‐hoc filters, CaVEMan generates a set of somatic substitution calls with high recall and positive predictive value. Here we provide instructions for using a wrapper script called cgpCaVEManWrapper, which runs the CaVEMan algorithm and additional downstream post‐hoc filters. We describe both a simple one‐shot run of cgpCaVEManWrapper and a more in‐depth implementation suited to large‐scale compute farms. © 2016 by John Wiley & Sons, Inc.

Keywords: somatic; cancer; sequencing; SNV; substitution

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Calling Substitutions via a Single Command for a Tumor/Normal Sample Pair
  • Alternate Protocol 1: Processing Other Sequencing Types
  • Support Protocol 1: Installation of cgpCaVEManWrapper and Dependencies
  • Alternate Protocol 2: Using cgpCaVEManWrapper With Compute Farm Infrastructure
  • Support Protocol 2: Static Reference File Generation
  • Support Protocol 3: ASCAT and Pindel Output File Manipulation
  • Commentary
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


Basic Protocol 1: Calling Substitutions via a Single Command for a Tumor/Normal Sample Pair

  Necessary Resources
  • Each individual step will have different hardware requirements and will require tuning on a sequencing type/species basis. Requirements described in Basic Protocol will serve as a good starting point.
PDF or HTML at Wiley Online Library



Literature Cited

  Alioto, T.S., Buchhalter, I., Derdak, S., Hutter, B., Eldridge, M.D., Hovig, E., Heisler, L.E., Beck, T.A., Simpson, J.T., Tonon, L., Sertier, A.S., Patch, A.M., Jäger, N., Ginsbach, P., Drews, R., Paramasivam, N., Kabbe, R., Chotewutmontri, S., Diessl, N., Previti, C., Schmidt, S., Brors, B., Feuerbach, L., Heinold, M., Gröbner, S., Korshunov, A., Tarpey, P.S., Butler, A.P., Hinton, J., Jones, D., Menzies, A., Raine, K., Shepherd, R., Stebbings, L., Teague, J.W., Ribeca, P., Giner, F.C., Beltran, S., Raineri, E., Dabad, M., Heath, S.C., Gut, M., Denroche, R.E., Harding, N.J., Yamaguchi, T.N., Fujimoto, A., Nakagawa, H., Quesada, V., Valdés‐Mas, R., Nakken, S., Vodák, D., Bower, L., Lynch, A.G., Anderson, C.L., Waddell, N., Pearson, J.V., Grimmond, S.M., Peto, M., Spellman, P., He, M., Kandoth, C., Lee, S., Zhang, J., Létourneau, L., Ma, S., Seth, S., Torrents, D., Xi, L., Wheeler, D.A., López‐Otín, C., Campo, E., Campbell, P.J., Boutros, P.C., Puente, X.S., Gerhard, D.S., Pfister, S.M., McPherson, J.D., Hudson, T.J., Schlesner, M., Lichter, P., Eils, R., Jones, D.T., and Gut, I.G. 2015. A comprehensive assessment of somatic mutation detection in cancer using whole‐genome sequencing. Nat. Commun. 6:10001. doi: 10.1038/ncomms10001.
  Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R., and 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics 27:2156‐2158. doi: 10.1093/bioinformatics/btr330.
  Do, C.B. and Batzoglou, S. 2008. What is the expectation maximization algorithm? Nat. Biotechno. 26:897‐899. doi: 10.1038/nbt1406.
  Hsi‐Yang Fritz, M., Leinonen, R., Cochrane, G., and Birney, E. 2011. Efficient storage of high throughput DNA sequencing data using reference‐based compression. Genome Res. 21:734‐740. doi: 10.1101/gr.114819.110.
  Li, H. 2011. Tabix: Fast retrieval of sequence features from generic TAB‐delimited files. Bioinformatics 27:718‐719. doi: 10.1093/bioinformatics/btq671.
  Li, H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA‐MEM. arXiv:1303.3997 [q‐bio]. Available at: [Accessed June 8, 2016].
  Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows‐Wheeler transform. Bioinformatics 25:1754‐1760. doi: 10.1093/bioinformatics/btp324.
  Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078‐2079. doi: 10.1093/bioinformatics/btp352.
  Menzies, A., Teague, J.W., Butler, A.P., Davies, H., Tarpey, P., Nik‐Zainal, S., and Campbell, P.J. 2015. VAGrENT: Variation annotation generator. Curr. Protoc. Bioinform. 52:15.8.1‐15.8.11. doi: 10.1002/0471250953.bi1508s52.
  Pleasance, E.D., Cheetham, R.K., Stephens, P.J., McBride, D.J., Humphray, S.J., Greenman, C.D., Varela, I., Lin, M.‐L., Ordóñez, G.R., Bignell, G.R., Ye, K., Alipaz, J., Bauer, M.J., Beare, D., Butler, A., Carter, R.J., Chen, L., Cox, A.J., Edkins, S., Kokko‐Gonzales, P.I., Gormley, N.A., Grocock, R.J., Haudenschild, C.D., Hims, M.M., James, T., Jia, M., Kingsbury, Z., Leroy, C., Marshall, J., Menzies, A., Mudie, L.J., Ning, Z., Royce, T., Schulz‐Trieglaff, O.B., Spiridou, A., Stebbings, L.A., Szajkowski, L., Teague, J., Williamson, D., Chin, L., Ross, M.T., Campbell, P.J., Bentley, D.R., Futreal, P.A., and Stratton MR. 2010. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463:191‐196. doi: 10.1038/nature08658.
  Raine, K.M., Van Loo, P., Wedge D.C., Jones, D., Menzies, A., Butler, A.P., Teague, J.W., Tarpey, P., Nik‐Zainal, S., and Campbell, P.J. 2016. ascatNgs: Identifying somatically acquired copy‐number alterations from whole‐genome sequencing data. Curr. Protoc. Bioinform. 56:1‐??.
  Raine, K.M., Hinton, J., Butler, A.P., Teague, J.W., Davies, H., Tarpey, P., Nik‐Zainal, S. and Campbell, P.J. 2015. cgpPindel: Identifying somatically acquired insertion and deletion events from paired end sequencing. Curr. Protoc. Bioinform. 52:15.7.1‐15.7.12. doi: 10.1002/0471250953.bi1507s52.
Internet Resources
  Repository for Wellcome Trust Sanger Institute Cancer Genome Project public projects.‐files/CPIB/
  FTP site for reference and example data listed in this unit.‐bin/hgTables
  UCSC Genome Browser Table Browser
  ICGC/TCGA Pancancer project site.
  VCF format.
  SAM format.‐TCGA‐PanCancer/PCAP‐core/wiki
  PCAP‐core wiki describes generation of high sequence depth file.
PDF or HTML at Wiley Online Library