Using XHMM Software to Detect Copy Number Variation in Whole‐Exome Sequencing Data

Menachem Fromer1, Shaun M. Purcell1

1 Analytic and Translational Genetics Unit Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 7.23
DOI:  10.1002/0471142905.hg0723s81
Online Posting Date:  April, 2014
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Copy number variation (CNV) has emerged as an important genetic component in human diseases, which are increasingly being studied for large numbers of samples by sequencing the coding regions of the genome, i.e., exome sequencing. Nonetheless, detecting this variation from such targeted sequencing data is a difficult task, involving sorting out signal from noise, for which we have recently developed a set of statistical and computational tools called XHMM. In this unit, we give detailed instructions on how to run XHMM and how to use the resulting CNV calls in biological analyses. Curr. Protoc. Hum. Genet. 81:7.23.1‐7.23.21. © 2014 by John Wiley & Sons, Inc.

Keywords: next‐generation sequencing (NGS); copy number variation (CNV); principal component analysis (PCA); data normalization; Hidden Markov Model (HMM)

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Installation of XHMM, Depth of Coverage Calculation, Filtering and Normalization, and CNV Calling
  • Basic Protocol 2: Visualize Resulting CNVs Using R Scripts
  • Basic Protocol 3: Call De Novo CNVs Using Plink/SEQ
  • Basic Protocol 4: Compare XHMM CNVs to External CNV Call Set
  • Support Protocol 1: Convert XHMM CNV Calls to Plink Format
  • Support Protocol 2: Request Support from the XHMM Users Forum
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Installation of XHMM, Depth of Coverage Calculation, Filtering and Normalization, and CNV Calling

  Necessary Resources
  • Installed versions of the LAPACK (http://www.netlib.org/lapack/) and pthread (https://computing.llnl.gov/tutorials/pthreads/) C libraries, which are properly accessible to the C++ compiler (i.e., in the proper path environment variables). For LAPACK to work, you may need to also install atlas and acml on some systems. LAPACK is used for efficiently performing the singular value decomposition (SVD) step of the principal component analysis (PCA) used for normalization of the data. Pthread is for speeding up certain computations using multiple parallel processing threads (currently still not highly developed in XHMM, as we have found the steps following the read depth calculations to be quite fast in practice, even for datasets of thousands of samples; see Commentary).
  • Installed copy of the Genome Analysis ToolKit (GATK; http://www.broadinstitute.org/gatk/download). It is assumed that GATK is installed in Sting/dist/GenomeAnalysisTK.jar.
  • For certain optional (but preferred steps), it is also necessary to install the latest version of Plink/Seq (http://atgu.mgh.harvard.edu/plinkseq). Up‐to‐date code can be downloaded at https://bitbucket.org/statgen/plinkseq/get/master.zip.
  • The human reference sequence database file (seqdb) can be downloaded at http://atgu.mgh.harvard.edu/plinkseq/resources.shtml
  • The following user‐input files are required in a number of the following steps and so are listed here once for convenience:
    • Reference genome FASTA file and associated BWA index file (http://bio‐bwa.sourceforge.net/bwa.shtml). In the examples here, we refer to this file as human_g1k_v37.fasta (which can be downloaded as part of the GATK resource bundle at http://gatkforums.broadinstitute.org/discussion/1213/what‐s‐in‐the‐resource‐bundle‐and‐how‐can‐i‐get‐it).
    • List of exome targets, in the ‘interval_list’ GATK format (http://gatkforums.broadinstitute.org/discussion/1204/what‐input‐files‐does‐the‐gatk‐accept). We refer to this file as EXOME.interval_list. This file should contain non‐overlapping, sorted intervals. As an example, two lines for chromosome 22 coding sequence exons are:
    • 22:16449425‐16449804
    • 22:17071768‐17071966

Basic Protocol 2: Visualize Resulting CNVs Using R Scripts

  Necessary Resources
  • For visualization, we use the R statistical analysis software, which can be downloaded at http://www.r‐project.org
  • Also, install the latest version of Plink/Seq (http://atgu.mgh.harvard.edu/plinkseq). Up‐to‐date code can be downloaded at https://bitbucket.org/statgen/plinkseq/get/master.zip

Basic Protocol 3: Call De Novo CNVs Using Plink/SEQ

  Necessary Resources
  • Plink software (download from http://pngu.mgh.harvard.edu/∼purcell/plink/download.shtml)
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
  Cooper, G.M., Coe, B.P., Girirajan, S., Rosenfeld, J.A., Vu, T.H., Baker, C., Williams, C., Stalker, H., Hamid, R., Hannig, V., Abdel‐Hamid, H., Bader, P., McCracken, E., Niyazov, D., Leppig, K., Thiese, H., Hummel, M., Alexander, N., Gorski, J., Kussmann, J., Shashi, V., Johnson, K., Rehder, C., Ballif, B.C., Shaffer, L.G., and Eichler, E.E. 2011. A copy number variation morbidity map of developmental delay. Nat. Genet. 43:838‐846.
  DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., and Daly, M.J. 2011. A framework for variation discovery and denotyping using next‐generation DNA sequencing data. Nat. Genet. 43:491‐498.
  Fromer, M., Moran, J.L., Chambert, K., Banks, E., Bergen, S.E., Ruderfer, D.M., Handsaker, R.E., McCarroll, S.A., O'Donovan, M.C., Owen, M.J., Kirov, G., Sullivan, P.F., Hultman, C.M., Sklar, P., and Purcell, S.M. 2012. Discovery and statistical genotyping of copy‐number variation from whole‐exome sequencing depth. Am. J. Hum. Genet. 91:597‐607.
  International Schizophrenia Consortium. 2008. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455:237‐241.
  Kirov, G., Pocklington, A.J., Holmans, P., Ivanov, D., Ikeda, M., Ruderfer, D., Moran, J., Chambert, K., Toncheva, D., Georgieva, L., Grozeva, D., Fjodorova, M., Wollerton, R., Rees, E., Nikolov, I., van de Lagemaat, L.N., Bayés, A., Fernandez, E., Olason, P.I., Böttcher, Y., Komiyama, N.H., Collins, M.O., Choudhary, J., Stefansson, K., Stefansson, H., Grant, S.G., Purcell, S., Sklar, P., O'Donovan, M.C., and Owen, M.J. 2012. De novo CNV analysis implicates specific abnormalities of postsynaptic signalling complexes in the pathogenesis of schizophrenia. Mol. Psychiatry 17:142‐153.
  Krumm, N., Sudmant, P.H., Ko, A., O'Roak, B.J., Malig, M., Coe, B.P.; NHLBI Exome Sequencing Project, Quinlan, A.R., Nickerson, D.A., and Eichler, E.E. 2012. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22:1525‐1532.
  Lim, E.T., Raychaudhuri, S., Sanders, S.J., Stevens, C., Sabo, A., MacArthur, D.G., Neale, B.M., Kirby, A., Ruderfer, D.M., Fromer, M., Lek, M., Liu, L., Flannick, J., Ripke, S., Nagaswamy, U., Muzny, D., Reid, J.G., Hawes, A., Newsham, I., Wu, Y., Lewis, L., Dinh, H., Gross, S., Wang, L.S., Lin, C.F., Valladares, O., Gabriel, S.B., dePristo, M., Altshuler, D.M., Purcell, S.M.; NHLBI Exome Sequencing Project, State, M.W., Boerwinkle, E., Buxbaum, J.D., Cook, E.H., Gibbs, R.A., Schellenberg, G.D., Sutcliffe, J.S., Devlin, B., Roeder, K., and Daly, M.J. 2013. Rare complete knockouts in humans: Population distribution and significant role in autism spectrum disorders. Neuron 77:235‐242.
  Pinto, D., Pagnamenta, A.T., Klei, L., Anney, R., Merico, D., Regan, R., Conroy, J., Magalhaes, T.R., Correia, C., Abrahams, B.S., Almeida, J., Bacchelli, E., et al. 2010. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 466:368‐372.
  Pollack, J.R., Sørlie, T., Perou, C.M., Rees, C.A., Jeffrey, S.S., Lonning, P.E., Tibshirani, R., Botstein, D., Børresen‐Dale, A‐L., and Brown, P.O. 2002. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl. Acad. Sci. U.S.A. 99:12963‐12968.
  Poultney, C.S., Goldberg, A.P., Drapeau, E., Kou, Y., Harony‐Nicolas, H., Kajiwara, Y., De Rubeis, S., Durand, S., Stevens, C., Rehnström, K., Palotie, A., Daly, M.J., Ma'ayan, A., Fromer, M., and Buxbaum, J.D. 2013. Identification of small exonic CNV from whole‐exome sequence data and application to autism spectrum disorder. Am. J. Hum. Genet. 93:607‐619.
  Sathirapongsasuti, J.F., Lee, H., Horst, B.A.J., Brunner, G., Cochran, A.J., Binder, S., Quackenbush, J., and Nelson, S.F. 2011. Exome sequencing‐based copy‐number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 27:2648‐2654.
  Shlien, A. and Malkin, D. 2010. Copy number variations and cancer susceptibility. Curr. Opin. Oncology 22:55‐63.
  Smit, A.F.A. and Hubley, R. 2008. RepeatModeler Open‐1.0 2008‐2010. http://www.repeatmasker.org.
  Stefansson, H., Rujescu, D., Cichon, S., Pietiläinen, O.P., Ingason, A., Steinberg, S., Fossdal, R., Sigurdsson, E., Sigmundsson, T., Buizer‐Voskamp, J.E., Hansen, T., Jakobsen, K.D.M. et al. 2008. Large recurrent microdeletions associated with schizophrenia. Nature 455:232‐236.
  Wu, J., Grzeda, K.R., Stewart, C., Grubert, F., Urban, A.E., Snyder, M.P., and Marth, G.T. 2012. Copy number variation detection from 1000 Genomes Project exon capture sequencing data. BMC Bioinformatics 13:305.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library