Quality Control Procedures for Genome‐Wide Association Studies

Stephen Turner1, Loren L. Armstrong2, Yuki Bradford1, Christopher S. Carlson3, Dana C. Crawford1, Andrew T. Crenshaw4, Mariza de Andrade5, Kimberly F. Doheny6, Jonathan L. Haines1, Geoffrey Hayes2, Gail Jarvik7, Lan Jiang1, Iftikhar J. Kullo8, Rongling Li9, Hua Ling6, Teri A. Manolio9, Martha Matsumoto5, Catherine A. McCarty10, Andrew N. McDavid3, Daniel B. Mirel4, Justin E. Paschall11, Elizabeth W. Pugh6, Luke V. Rasmussen10, Russell A. Wilke12, Rebecca L. Zuvich1, Marylyn D. Ritchie1

1 Center for Human Genetics Research, Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, Tennessee, 2 Division of Endocrinology, Metabolism, and Molecular Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, 3 Cancer Prevention, Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, 4 Genetic Analysis Platform and Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, 5 Division of Biostatistics and Informatics, Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota, 6 Center for Inherited Disease Research, Johns Hopkins University, Baltimore, Maryland, 7 Department of Genome Sciences, University of Washington, Seattle, Washington, 8 Division of Cardiovascular Diseases, Department of Medicine, Mayo Clinic, Rochester, Minnesota, 9 Office of Population Genomics, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, 10 Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, 11 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, 12 Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University, Nashville, Tennessee
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 1.19
DOI:  10.1002/0471142905.hg0119s68
Online Posting Date:  January, 2011
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Genome‐wide association studies (GWAS) are being conducted at an unprecedented rate in population‐based cohorts and have increased our understanding of the pathophysiology of complex disease. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the electronic MEdical Records and Genomics (eMERGE) network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. We discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research. Curr. Protoc. Hum. Genet. 68:1.19.1‐1.19.18 © 2011 by John Wiley & Sons, Inc.

Keywords: genome‐wide association studies; GWAS; quality control; QC; biobanks; electronic medical records; eMERGE

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • GWAS Data Format
  • Sample Quality
  • Marker Quality
  • Batch Effects
  • Evaluation of QC After Association Analysis
  • Future Directions
  • Acknowledgements
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

   Aulchenko, Y.S., de Koning, D.J., and Haley, C. 2007. Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigree‐based quantitative trait loci association analysis. Genetics 177:577‐585.
   Barber, M.J., Mangravite, L.M., Hyde, C.L., Chasman, D.I., Smith, J.D., McCarty, C.A., Li, X., Wilke, R.A., Rieder, M.J., Williams, P.T., Ridker, P.M., Chatterjee, A., Rotter, J.I., Nickerson, D.A., Stephens, M., and Krauss, R.M. 2010. Genome‐wide association of lipid‐lowering response to statins in combined study populations. PLoS One 5:e9763.
   Broman, K.W. 1999. Cleaning genotype data. Genet. Epidemiol. 17:S79‐S83.
   Cardon, L.R. and Palmer, L.J. 2003. Population stratification and spurious allelic association. Lancet 361:598‐604.
   Carlson, C.S., Smith, J.D., Stanaway, I.B., Rieder, M.J., and Nickerson, D.A. 2006. Direct detection of null alleles in SNP genotyping data. Hum. Mol. Genet. 15:1931‐1937.
   Chanock, S.J., Manolio, T., Boehnke, M., Boerwinkle, E., Hunter, D.J., Thomas, G., Hirschhorn, J.N., Abecasis, G., Altshuler, D., Bailey‐Wilson, J.E., Brooks, L.D., Cardon, L.R., Daly, M., Donnelly, P., Fraumeni, J.F. Jr., Freimer, N.B., Gerhard, D.S., Gunter, C., Guttmacher, A.E., Guyer, M.S., Harris, E.L., Hoh, J., Hoover, R., Kong, C.A., Merikangas, K.R., Morton, C.C., Palmer, L.J., Phimister, E.G., Rice, J.P., Roberts, J., Rotimi, C., Tucker, M.A., Vogan, K.J., Wacholder, S., Wijsman, E.M., Winn, D.M., and Collins, F.S. 2007. Replicating genotype‐phenotype associations. Nature 447:655‐660.
   Dadd, T., Weale, M.E., and Lewis, C.M. 2009. A critical evaluation of genomic control methods for genetic association studies. Genet. Epidemiol. 33:290‐298.
   Daly, A.K., Donaldson, P.T., Bhatnagar, P., Shen, Y., Pe'er, I., Floratos, A., Daly, M.J., Goldstein, D.B., John, S., Nelson, M.R., Graham, J., Park, B.K., Dillon, J.F., Bernal, W., Cordell, H.J., Pirmohamed, M., Aithal, G.P., Day, C.P.; DILIGEN Study; International SAE Consortium. 2009. HLA‐B*5701 genotype is a major determinant of drug‐induced liver injury due to flucloxacillin. Nat Genet 41:816‐819.
   Devlin, B. and Roeder, K. 1999. Genomic control for association studies. Biometrics 55:997‐1004.
   Devlin, B., Bacanu, S.A., and Roeder, K. 2004. Genomic Control to the extreme. Nat. Genet. 36:1129‐1130.
   Dumitrescu, L.C., Ritchie, M.D., Brown‐Gentry, K., Pulley, J.J., Basford, M., Denny, J., Oksenberg, J.R., Roden, D.M., Haines, J.L., and Crawford, D.C. 2010. Assessing the accuracy of observer‐reported ancestry in a biorepository linked to electronic medical records. Genet. Med. In press.
   Frayling, T.M. 2007. Genome‐wide association studies provide new insights into type 2 diabetes aetiology. Nat. Rev. Genet. 8:657‐662.
   Gauderman, W.J. 2002. Sample size requirements for matched case‐control studies of gene‐environment interaction. Stat. Med. 21:35‐50.
   Gorlov, I.P., Gorlova, O.Y., Sunyaev, S.R., Spitz, M.R., and Amos, C.I. 2008. Shifting paradigm of association studies: Value of rare single‐nucleotide polymorphisms. Am. J. Hum. Genet. 82:100‐112.
   Grady, B.J., Torstenson, E., Dudek, S.M., Giles, J., Sexton, D., and Ritchie, M.D. 2010. Finding unique filter sets in plato: A precursor to efficient interaction analysis in gwas data. Pac. Symp. Biocomput. 2010:315‐326.
   Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. 2009. Potential etiologic and functional implications of genome‐wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. U.S.A. 106:9362‐9367.
   International HapMap consortium. 2003. The International HapMap Project. Nature 426:789‐796.
   International HapMap Consortium. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851‐861.
   Kathiresan, S., Willer, C.J., Peloso, G.M., Demissie, S., Musunuru, K., Schadt, E.E., Kaplan, L., Bennett, D., Li, Y., Tanaka, T., Voight, B.F., Bonnycastle, L.L., Jackson, A.U., Crawford, G., Surti, A., Guiducci, C., Burtt, N.P., Parish, S., Clarke, R., Zelenika, D., Kubalanza, K.A., Morken, M.A., Scott, L.J., Stringham, H.M., Galan, P., Swift, A.J., Kuusisto, J., Bergman, R.N., Sundvall, J., Laakso, M., Ferrucci, L., Scheet, P., Sanna, S., Uda, M., Yang, Q., Lunetta, K.L., Dupuis, J., de Bakker, P.I., O'Donnell, C.J., Chambers, J.C., Kooner, J.S., Hercberg, S., Meneton, P., Lakatta, E.G., Scuteri, A., Schlessinger, D., Tuomilehto, J., Collins, F.S., Groop, L., Altshuler, D., Collins, R., Lathrop, G.M., Melander, O., Salomaa, V., Peltonen, L., Orho‐Melander, M., Ordovas, J.M., Boehnke, M., Abecasis, G.R., Mohlke, K.L., and Cupples, L.A. 2009. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 41:56‐65.
   Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., Sangiovanni, J.P., Mane, S.M., Mayne, S.T., Bracken, M.B., Ferris, F.L., Ott, J., Barnstable, C., and Hoh, J. 2005. Complement factor H polymorphism in age‐related macular degeneration. Science 308:385‐389.
   Laurie, C., Mirel, D., Pugh, E., Bierut, L., Bhangale, T., Boehm, F., Caporaso, N., Edenburgh, H., Gabriel, S., Harris, E., Hu, F.B., Jacobs, K.B., Kraft, P., Landi, M.T., Lumley, T., Manolio, T.A., McHugh, C., Painter, I., Paschall, J., Rice, J.P., Rice, K.M., Zheng, X., Weir, B.S.; GENEVA Investigators. 2010. Quality control and quality assurance in genotypic data for genome‐wide association studies. Genet. Epidemiol. 34:591‐602.
   Link, E., Parish, S., Armitage, J., Bowman, L., Heath, S., Matsuda, F., Gut, I., Lathrop, M., and Collins, R. 2008. SLCO1B1 variants and statin‐induced myopathy: A genomewide study. 2008. N. Engl. J. Med. 359:789‐799.
   Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L., Popova, N., Pretel, S., Ziyabari, L., Lee, M., Shao, Y., Wang, Z.Y., Sirotkin, K., Ward, M., Kholodov, M., Zbicz, K., Beck, J., Kimelman, M., Shevelev, S., Preuss, D., Yaschenko, E., Graeff, A., Ostell, J., and Sherry, S.T. 2007. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 39:1181‐1186.
   Manolio, T.A. 2009. Collaborative genome‐wide association studies of diverse diseases: Programs of the NHGRI's office of population genomics. Pharmacogenomics 10:235‐241.
   Marchini, J., Cardon, L.R., Phillips, M.S., and Donnelly, P. 2004. The effects of human population structure on large genetic association studies. Nat. Genet. 36:512‐517.
   McCarty, C., Chrisolm, R., Chute, C., Kullo, I., Jarvik, G., Larson, E., Li, R., Masys, D., Ritchie, M., Roden, D. et al. 2010. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Medical Genomics In press.
   Miyagawa, T., Nishida, N., Ohashi, J., Kimura, R., Fujimoto, A., Kawashima, M., Koike, A., Sasaki, T., Tanii, H., Otowa, T., Momose, Y., Nakahara, Y., Gotoh, J., Okazaki, Y., Tsuji, S., and Tokunaga, K. 2008. Appropriate data cleaning methods for genome‐wide association study. J. Hum. Genet. 53:886‐893.
   Newton‐Cheh, C., Johnson, T., Gateva, V., Tobin, M.D., Bochud, M., Coin, L., Najjar, S.S., Zhao, J.H., Heath, S.C., Eyheramendy, S., Papadakis, K., Voight, B.F., Scott, L.J., Zhang, F., Farrall, M., Tanaka, T., Wallace, C., Chambers, J.C., Khaw, K.T., Nilsson, P., van der Harst, P., Polidoro, S., Grobbee, D.E., Onland‐Moret, N.C., Bots, M.L., Wain, L.V., Elliott, K.S., Teumer, A., Luan, J., Lucas, G., Kuusisto, J., Burton, P.R., Hadley, D., McArdle, W.L.; Wellcome Trust Case Control Consortium, Brown, M., Dominiczak, A., Newhouse, S.J., Samani, N.J., Webster, J., Zeggini, E., Beckmann, J.S., Bergmann, S., Lim, N., Song, K., Vollenweider, P., Waeber, G., Waterworth, D.M., Yuan, X., Groop, L., Orho‐Melander, M., Allione, A., Di Gregorio, A., Guarrera, S., Panico, S., Ricceri, F., Romanazzi, V., Sacerdote, C., Vineis, P., Barroso, I., Sandhu, M.S., Luben, R.N., Crawford, G.J., Jousilahti, P., Perola, M., Boehnke, M., Bonnycastle, L.L., Collins, F.S., Jackson, A.U., Mohlke, K.L., Stringham, H.M., Valle, T.T., Willer, C.J., Bergman, R.N., Morken, M.A., Döring, A., Gieger, C., Illig, T., Meitinger, T., Org, E., Pfeufer, A., Wichmann, H.E., Kathiresan, S., Marrugat, J., O'Donnell, C.J., Schwartz, S.M., Siscovick, D.S., Subirana, I., Freimer, N.B., Hartikainen, A.L., McCarthy, M.I., O'Reilly, P.F., Peltonen, L., Pouta, A., de Jong, P.E., Snieder, H., van Gilst, W.H., Clarke, R., Goel, A., Hamsten, A., Peden, J.F., Seedorf, U., Syvänen, A.C., Tognoni, G., Lakatta, E.G., Sanna, S., Scheet, P., Schlessinger, D., Scuteri, A., Dörr, M., Ernst, F., Felix, S.B., Homuth, G., Lorbeer, R., Reffelmann, T., Rettig, R., Völker, U., Galan, P., Gut, I.G., Hercberg, S., Lathrop, G.M., Zelenika, D., Deloukas, P., Soranzo, N., Williams, F.M., Zhai, G., Salomaa, V., Laakso, M., Elosua, R., Forouhi, N.G., Völzke, H., Uiterwaal, C.S., van der Schouw, Y.T., Numans, M.E., Matullo, G., Navis, G., Berglund, G., Bingham, S.A., Kooner, J.S., Connell, J.M., Bandinelli, S., Ferrucci, L., Watkins, H., Spector, T.D., Tuomilehto, J., Altshuler, D., Strachan, D.P., Laan, M., Meneton, P., Wareham, N.J., Uda, M., Jarvelin, M.R., Mooser, V., Melander, O., Loos, R.J., Elliott, P., Abecasis, G.R., Caulfield, M., and Munroe, P.B. 2009. Genome‐wide association study identifies eight loci associated with blood pressure. Nat. Genet. 41:666‐676.
   Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., Stephens, M., and Bustamante, C.D. 2008. Genes mirror geography within Europe. Nature 456:98‐101.
   Patterson, N., Price, A.L., and Reich, D. 2006. Population structure and eigenanalysis. PLoS Genet.2:e190.
   Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. 2006. Principal components analysis corrects for stratification in genome‐wide association studies. Nat. Genet. 38:904‐909.
   Pritchard, J.K., Stephens, M., and Donnelly, P. 2000. Inference of population structure using multilocus genotype data. Genetics 155:945‐959.
   Purcell, S., Neale, B., Todd‐Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., and Sham, P.C. 2007. PLINK: A tool set for whole‐genome association and population‐based linkage analyses. Am. J. Hum. Genet. 81:559‐575.
   Reich, D.E. and Goldstein, D.B. 2001. Detecting association in a case‐control study while correcting for population stratification. Genet. Epidemiol. 20:4‐16.
   Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29:308‐311.
   Simon‐Sanchez, J., Scholz, S., Fung, H.C., Matarin, M., Hernandez, D., Gibbs, J.R., Britton, A., de Vrieze, F.W., Peckham, E., Gwinn‐Hardy, K., Crawley, A., Keen, J.C., Nash, J., Borgaonkar, D., Hardy, J., and Singleton, A. 2007. Genome‐wide SNP assay reveals structural genomic variation, extended homozygosity and cell‐line induced alterations in normal individuals. Hum. Mol. Genet. 16:1‐14.
   Skol, A.D., Scott, L.J., Abecasis, G.R., and Boehnke, M. 2006. Joint analysis is more efficient than replication‐based analysis for two‐stage genome‐wide association studies. Nat. Genet. 38:209‐213.
   Tang, H., Quertermous, T., Rodriguez, B., Kardia, S.L., Zhu, X., Brown, A., Pankow, J.S., Province, M.A., Hunt, S.C., Boerwinkle, E., Schork, N.J., and Risch, N.J. 2005. Genetic structure, self‐identified race/ethnicity, and confounding in case‐control association studies. Am. J. Hum. Genet. 76:268‐275.
   Thompson, J.F., Hyde, C.L., Wood, L.S., Paciga, S.A., Hinds, D.A., Cox, D.R., Hovingh, G.K., and Kastelein, J.J. 2009. Comprehensive whole‐genome and candidate gene analysis for response to statin therapy in the Treating to New Targets (TNT) cohort. Circ. Cardiovasc. Genet. 2:173‐181.
   Willer, C.J., Sanna, S., Jackson, A.U., Scuteri, A., Bonnycastle, L.L., Clarke, R., Heath, S.C., Timpson, N.J., Najjar, S.S., Stringham, H.M., Strait, J., Duren, W.L., Maschio, A., Busonero, F., Mulas, A., Albai, G., Swift, A.J., Morken, M.A., Narisu, N., Bennett, D., Parish, S., Shen, H., Galan, P., Meneton, P., Hercberg, S., Zelenika, D., Chen, W.M., Li, Y., Scott, L.J., Scheet, P.A., Sundvall, J., Watanabe, R.M., Nagaraja, R., Ebrahim, S., Lawlor, D.A., Ben‐Shlomo, Y., Davey‐Smith, G., Shuldiner, A.R., Collins, R., Bergman, R.N., Uda, M., Tuomilehto, J., Cao, A., Collins, F.S., Lakatta, E., Lathrop, G.M., Boehnke, M., Schlessinger, D., Mohlke, K.L., and Abecasis, G.R. 2008. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 40:161‐169.
   Wittke‐Thompson, J.K., Pluzhnikov, A., and Cox, N.J. 2005. Rational inferences about departures from Hardy‐Weinberg equilibrium. Am. J. Hum. Genet. 76:967‐986.
   Zhang, F., Wang, Y., and Deng, H.W. 2008. Comparison of population‐based association study methods correcting for population stratification. PLoS One 3:e3392.
Internet Resources
  Census 2000. Profile of Demographic Characteristics, Marshfield, Wisconsin.
  Illumina Technical Note: “TOP/BOT” Strand and “A/B” Allele (2009).
  Illumina GenCall Data Analysis Software (2008).
  R Development Core Team: R: A language and environment for statistical computing. ISBN 3900051070, Vienna, Austria: R Foundation for Statistical Computing (2005).
  STRUCTURE (2009).
  Turner, S.D. 2009. Visualizing sample relatedness in a GWAS using PLINK and R.
PDF or HTML at Wiley Online Library