Analyzing Copy Number Variation Using SNP Array Data: Protocols for Calling CNV and Association Tests

Chiao‐Feng Lin1, Adam C. Naj1, Li‐San Wang1

1 Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 1.27
DOI:  10.1002/0471142905.hg0127s79
Online Posting Date:  October, 2013
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


High‐density SNP genotyping technology provides a low‐cost, effective tool for conducting Genome Wide Association (GWA) studies. The wide adoption of GWA studies has indeed led to discoveries of disease‐ or trait‐associated SNPs, some of which were subsequently shown to be causal. However, the nearly universal shortcoming of many GWA studies—missing heritability—has prompted great interest in searching for other types of genetic variation, such as copy number variation (CNV). Certain CNVs have been reported to alter disease susceptibility. Algorithms and tools have been developed to identify CNVs using SNP array hybridization intensity data. Such an approach provides an additional source of data with almost no extra cost. In this unit, we demonstrate the steps for calling CNVs from Illumina SNP array data using PennCNV and performing association analysis using R and PLINK. Curr. Protoc. Hum. Genet. 79:1.27.1‐1.27.15. © 2013 by John Wiley & Sons, Inc.

Keywords: copy number variations (CNV); CNV calling; genome‐wide association studies; SNP genotyping array; association study; burden analysis

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Detect CNVs from Illumina Whole‐Genome Genotyping Array Data Using PennCNV
  • Basic Protocol 2: Use of R to Perform Association Tests for Common CNVs
  • Basic Protocol 3: Use of PLINK to Perform Burden Tests for Rare or Non‐Overlapping CNVs
  • Support Protocol 1: Visually Inspect CNVs on the UCSC Genome Browser
  • Commentary
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


Basic Protocol 1: Detect CNVs from Illumina Whole‐Genome Genotyping Array Data Using PennCNV

  • Signal intensity data: LRR (Log R Ratio) and BAF (B Allele Frequency) of each individual and each probe
  • Additional input files for PennCNV as described in its manual: PFB (Population Frequency of B allele), HMM, and GCModel files
  • Linux environment with PennCNV installed: we assume the user has PennCNV installed or has the knowledge on how to obtain and install the software; more information is available on the PennCNV Web site (
  • GenomeStudio or BeadStudio (Illumina) for exporting signal intensity files from Illumina SNP array project files
  • We encourage the reader to browse the respective software package Web sites (provided at the end of this unit) to find out more details on hardware requirements. In general, modern PCs and Linux servers with 2 to 4 GB RAM should be sufficient for running the programs we use in this unit. Analysis of larger datasets (on the scale of thousands subjects) may require more storage space and memory.
  • Illumina recommends running GenomeStudio (formerly BeadStudio) on a Windows (XP or later) computer with Intel Celeron Duo or faster 64‐bit CPU, at least 8 GB memory, at least 100 GB storage space, and 1280 × 1024 screen resolution for better viewing
  • PennCNV runs on Linux systems. Both source code and pre‐compiled executables are available. Instructions for installation on Windows systems with Cygwin or ActivePerl are provided on the PennCNV Web site. See Time Considerations for processing large datasets.
  • Precompiled executables for Windows and Linux systems are available for both R and PLINK

Basic Protocol 2: Use of R to Perform Association Tests for Common CNVs

  • Output file from the PennCNV software (see protocol 1) that contains all called CNVs
  • Individuals' case/control status and phenotypes or factors that may confound the relation between CNVs and the disease state; these pieces of information are equivalent to those of PLINK FAM and covariate files
  • Linux environment with R installed. We assume the user has installed R or has the knowledge on how to obtain and install the software from the R Web site (http://www.r‐ Comprehensive documentation is available there.
  • R script (penncnv2cnpr.r, which can be downloaded at
  • For hardware requirements, see protocol 1 materials list

Basic Protocol 3: Use of PLINK to Perform Burden Tests for Rare or Non‐Overlapping CNVs

  • Output file from the PennCNV software (see protocol 1) that contains all called CNVs
  • PLINK FAM file from the Genome‐Wide Association Study (GWAS) SNP data.
  • Optional files describing user‐specified genomic regions for burden tests. For example, a file containing the coordinates of all known genes on the human genome. Each row specifies one genomic region (chromosome, start, and end positions).
  • Linux environment with PLINK installed. We assume the user has installed PLINK or has the knowledge on how to obtain and install it (∼purcell/plink/). Comprehensive documentation is available there.
  • For hardware requirements, see protocol 1 materials list

Support Protocol 1: Visually Inspect CNVs on the UCSC Genome Browser

  • Output file from the PennCNV software (see protocol 1) that contains all called CNVs
  • A Web browser that is compatible with the UCSC Genome Browser.
  • Format CNV files into the BED format. A BED file is a tab‐delimited file that represents genomic features, such as genes or CNVs as integer intervals one interval per line, and describes how these intervals to be displayed on the UCSC browser as a custom track (see unit 18.6). Only the first three fields—chromosome/scaffold name, start position and end position—describing the genomic location are required but the optional fields, such as name, strand, etc., and the “track line,” make the visualization more informative. Please refer to for more details. For this protocol, we put seven fields and a track line in one BED file. An example may look like this:track name=test1 description=test1 visibility=3 colorByStrand="255,0,0 0,0,255" useScore=0 Although the track line appears as two lines, it is in fact one single line. Both of the last two fields being identical to the second one, i.e., the start position, makes the bars representing the CNVs thinner so as to accommodate more CNVs in one given space. For a small number of items, a generic text editor or Excel is sufficient to do the conversion manually. For a large number, however, a program is usually needed to do it efficiently and correctly. The following is an example Perl script that reads a PennCNV output file and converts it into a BED file. The script gives the contrast between deletions and duplications by assigning a strand status to each CNV (‘+’ when CN < 2 and ‘−’ when CN > 2), and using the “colorByStrand” attribute in the track line. The two colors for the two strands are specified by RGB color codes and divided by a space. To visualize the contrast between cases and controls, then the coding for strand should be used to encode disease status instead, and thus duplications and deletions should be separated into two tracks:
    • #!/bin/perl
    • use strict;
    • ## This script prints the output to STDOUT. Use redirect to output the results to a file.
    • # check if track name and input filename are provided
    • die "Usage: $0 trackname infile\n" if scalar @ARGV < 2;
    • my ($track, $infile) = @ARGV;
    • # print the track line
    • printf("track name=$track description=$track visibility=3 colorByStrand=\"255,0,0 0,0,255\" useScore=0\n");
    • # open the input file and start processing line by line
    • open(FIN, $infile) ∥ die "cannot open $infile\n";
    • while (<FIN>) {
    • # split one line into fields (the delimiter can be one or multiple spaces)
    • my @arr=split(/\s+/,$_);
    • # further split the first field into chr and positions
    • my @ele=split(/[:‐]/,$arr[0]);
    • # convert to 0‐based position
    • my $start = $ele[1] ‐ 1;
    • # split the copy number field
    • my @cn=split(/[,=]/,$ele[3]);
    • # assign deletion (CN<2) to positive strand '+' and duplication '‐' printf("%s\t%d\t%d\t%s\t%s\t%d\t%d\n",$ele[0],$start,$ele[2],$arr[4],$cn[2] < 2 ? '+' : '‐', $start, $start)
    • }
    • close FIN;
  • For hardware requirements, see protocol 1 materials list
PDF or HTML at Wiley Online Library



Literature Cited

  Barnes, C., Plagnol, V., Fitzgerald, T., Redon, R., Marchini, J., Clayton, D., and Hurles, M.E. 2008. A robust statistical method for case‐control association testing with copy number variation. Nat. Genet. 40:1245‐1252.
  Bochukova, E.G., Huang, N., Keogh, J., Henning, E., Purmann, C., Blaszczyk, K., Saeed S., Hamilton‐Shield, J., Clayton‐Smith, J., O'Rahilly, S., Hurles, M.E., and Farooqi, I.S. 2010. Large, rare chromosomal deletions associated with severe early‐onset obesity. Nature 463:666‐670.
  Colella, S., Yau, C., Taylor, J.M., Mirza, G., Butler, H., Clouston, P., Bassett, A.S., Seller, A., Holmes, C.C., and Ragoussis, J. 2007. QuantiSNP: An Objective Bayes Hidden‐Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 35:2013‐2025.
  Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P., Fitzgerald, T., Hum, M., Ihm, C.H., Kristiansson, K., Macarthur, D.G., Macdonald, J.R., Onyiah, I., Pang, A.W., Robson, S., Stirrups, K., Valsesia, A., Walter, K., Wei, J.; Wellcome Trust Case Control Consortium, Tyler‐Smith, C., Carter, N.P., Lee, C., Scherer, S.W., and Hurles, M.E. 2010. Origins and functional impact of copy number variation in the human genome. Nature 464:704‐712.
  Diskin, S.J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J.M., and Wang, K. 2008. Adjustment of genomic waves in signal intensities from whole‐genome SNP genotyping platforms. Nucleic Acids Res. 36:e126.
  Kidd, J.M., Cooper, G.M., Donahue, W.F., Hayden, H.S., Sampas, N., Graves, T., Hansen, N., Teague, B., Alkan, C., Antonacci, F., Haugen, E., Zerr, T., Yamada, N.A., Tsang, P., Newman, T.L., Tüzün, E., Cheng, Z., Ebling, H.M., Tusneem, N., David, R., Gillett, W., Phelps, K.A., Weaver, M., Saranga, D., Brand, A., Tao, W., Gustafson, E., McKernan, K., Chen, L., Malig, M., Smith, J.D., Korn, J.M., McCarroll, S.A., Altshuler, D.A., Peiffer, D.A., Dorschner, M., Stamatoyannopoulos, J., Schwartz, D., Nickerson, D.A., Mullikin, J.C., Wilson, R.K., Bruhn, L., Olson, M.V., Kaul, R., Smith, D.R., and Eichler, E.E. 2008. Mapping and sequencing of structural variation from eight human genomes. Nature 453:56‐64.
  Korn, J.M., Kuruvilla, F.G., McCarroll, S.A., Wysoker, A., Nemesh, J., Cawley, S., Hubbell, E., Veitch, J., Collins, P.J., Darvishi, K., Lee, C., Nizzari, M.M., Gabriel, S.B., Purcell, S., Daly, M.J., and Altshuler, D. 2008. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 40:1253‐1260.
  Merikangas, A.K., Corvin, A.P., and Gallagher, L. 2009. Copy‐number variants in neurodevelopmental disorders: Promises and challenges. Trends Genet. 25:536‐544.
  Purcell, S., Neale, B., Todd‐Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., and Sham, P.C. 2007. PLINK: A tool set for whole‐genome association and population‐based linkage analyses. Am. J. Hum. Genet. 81:559‐575.
  Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W., Cho, E.K., Dallaire, S., Freeman, J.L., González, J.R., Gratacòs, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J.R., Marshall, C.R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M.J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D.F., Estivill, X., Tyler‐Smith, C., Carter, N.P., Aburatani, H., Lee, C., Jones, K.W., Scherer, S.W., and Hurles, M.E. 2006. Global variation in copy number in the human genome. Nature 444:444‐454.
  Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S.F., Hakonarson, H., and Bucan, M. 2007. PennCNV: An integrated hidden Markov model designed for high‐resolution copy number variation detection in whole‐genome SNP genotyping data. Genome Res. 17:1665‐1674.
  Wellcome Trust Case Control Consortium, Craddock, N., Hurles, M. E., Cardin, N., Pearson, R. D., Plagnol, V., Robson, S., Vukcevic, D., Barnes, C., Conrad, D.F., Giannoulatou, E., Holmes, C., Marchini, J.L., Stirrups, K., Tobin, M.D., Wain, L.V., Yau, C., Aerts, J., Ahmad, T., Andrews, T.D., Arbury, H., Attwood, A., Auton, A., Ball, S.G., Balmforth, A.J., Barrett, J.C., Barroso, I., Barton, A., Bennett, A.J., Bhaskar, S., Blaszczyk, K., Bowes, J., Brand, O.J., Braund, P.S., Bredin, F., Breen, G., Brown, M.J., Bruce, I.N., Bull, J., Burren, O.S., Burton, J., Byrnes, J., Caesar, S., Clee, C.M., Coffey, A.J., Connell, J.M., Cooper, J.D., Dominiczak, A.F., Downes, K., Drummond, H.E., Dudakia, D., Dunham, A., Ebbs, B., Eccles, D., Edkins, S., Edwards, C., Elliot, A., Emery, P., Evans, D.M., Evans, G., Eyre, S., Farmer, A., Ferrier, I.N., Feuk, L., Fitzgerald, T., Flynn, E., Forbes, A., Forty, L., Franklyn, J.A., Freathy, R.M., Gibbs, P., Gilbert, P., Gokumen, O., Gordon‐Smith, K., Gray, E., Green, E., Groves, C.J., Grozeva, D., Gwilliam, R., Hall, A., Hammond, N., Hardy, M., Harrison, P., Hassanali, N., Hebaishi, H., Hines, S., Hinks, A., Hitman, G.A., Hocking, L., Howard, E., Howard, P., Howson, J.M., Hughes, D., Hunt, S., Isaacs, J.D., Jain, M., Jewell, D.P., Johnson, T., Jolley, J.D., Jones, I.R., Jones, L.A., Kirov, G., Langford, C.F., Lango‐Allen, H., Lathrop, G.M., Lee, J., Lee, K.L., Lees, C., Lewis, K., Lindgren, C.M., Maisuria‐Armer, M., Maller, J., Mansfield, J., Martin, P., Massey, D.C., McArdle, W.L., McGuffin, P., McLay, K.E., Mentzer, A., Mimmack, M.L., Morgan, A.E., Morris, A.P., Mowat, C., Myers, S., Newman, W., Nimmo, E.R., O'Donovan, M.C., Onipinla, A., Onyiah, I., Ovington, N.R., Owen, M.J., Palin, K., Parnell, K., Pernet, D., Perry, J.R., Phillips, A., Pinto, D., Prescott, N.J., Prokopenko, I., Quail, M.A., Rafelt, S., Rayner, N.W., Redon, R., Reid, D.M., Ring, S.M., Robertson, N., Russell, E., St Clair, D., Sambrook, J.G., Sanderson, J.D., Schuilenburg, H., Scott, C.E., Scott, R., Seal, S., Shaw‐Hawkins, S., Shields, B.M., Simmonds, M.J., Smyth, D.J., Somaskantharajah, E., Spanova, K., Steer, S., Stephens, J., Stevens, H.E., Stone, M.A., Su, Z., Symmons, D.P., Thompson, J.R., Thomson, W., Travers, M.E., Turnbull, C., Valsesia, A., Walker, M., Walker, N.M., Wallace, C., Warren‐Perry, M., Watkins, N.A., Webster, J., Weedon, M.N., Wilson, A.G., Woodburn, M., Wordsworth, B.P., Young, A.H., Zeggini, E., Carter, N.P., Frayling, T.M., Lee, C., McVean, G., Munroe, P.B., Palotie, A., Sawcer, S.J., Scherer, S.W., Strachan, D.P., Tyler‐Smith, C., Brown, M.A., Burton, P.R., Caulfield, M.J., Compston, A., Farrall, M., Gough, S.C., Hall, A.S., Hattersley, A.T., Hill, A.V., Mathew, C.G., Pembrey, M., Satsangi, J., Stratton, M.R., Worthington, J., Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W., Parkes, M., Rahman, N., Todd, J.A., Samani, N.J., and Donnelly, P. 2010. Genome‐wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464:713‐720.
Internet Resources
  PennCNV Web site. Users can download the PennCNV source code, compile, and install on their own computers. The Web site also contains a wealth of information including program manual, annotation files, tutorials for the PennCNV software, and other useful tips such as visualization and quality control recommendations.
  R Web site. R is a free program for statistical computing and visualization. Users can download the compiled R package for their specific computing platforms. The Web site also lists URLs to the Comprehensive R Archive Network (CRAN). CRAN hosts user‐contributed packages that provide additional analysis capabilities.
  Illumina GenomeStudio Web site: The Web site contains instructions and FAQs for the GenomeStudio software, which is required to export SNP intensities from Illumina Chip projects for CNV calling. Illumina customers can obtain the software for free.
  PLINK Web site. PLINK is developed by Shaun Purcell at Harvard University. The free, open‐source program is widely used by the research community to process and analyze genome‐wide association studies (GWAS). Users can download the source code or obtain pre‐compiled binaries for installation from this Web site. This Web site also contains very detailed instructions on how to use the program.
  UCSC Genome Browser. Users can go to UCSC Genome Browser to download genomic annotations, or visualize CNV calls on the reference genome as outlined in the Support Protocol.
  List of Genetic variation databases. The Center for Human and Clinical Genetics at Leiden University Medical Center maintains a comprehensive list of genetic variation databases, including CNV databases.
  The Human Genome Structural Variation Project. This Web site, maintained by the Eichler lab at the University of Washington, provides a detailed map of CNVs and large structural variants.
  The Copy Number Variation (CNV) Project. The database is maintained by the Wellcome Trust Sanger Institute. It hosts CNVs identified through a variety of genotyping and hybridization approaches and provides extensive information of known CNV/phenotype associations.
  The Database of Genomic Variants. This database is maintained by the University of Toronto Centre for Applied Genomics. The database is a comprehensive catalog of structural variants in the human genome by collecting published reports on healthy controls in the literature. It can be used as controls in studies to correlate CNVs with diseases and traits.
PDF or HTML at Wiley Online Library

Supplementary Material