Quality Control Procedures for Genome‐Wide Association Studies

Stephen Turner1, Loren L. Armstrong2, Yuki Bradford1, Christopher S. Carlson3, Dana C. Crawford1, Andrew T. Crenshaw4, Mariza de Andrade5, Kimberly F. Doheny6, Jonathan L. Haines1, Geoffrey Hayes2, Gail Jarvik7, Lan Jiang1, Iftikhar J. Kullo8, Rongling Li9, Hua Ling6, Teri A. Manolio9, Martha Matsumoto5, Catherine A. McCarty10, Andrew N. McDavid3, Daniel B. Mirel4, Justin E. Paschall11, Elizabeth W. Pugh6, Luke V. Rasmussen10, Russell A. Wilke12, Rebecca L. Zuvich1, Marylyn D. Ritchie1

1 Center for Human Genetics Research, Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, Tennessee, 2 Division of Endocrinology, Metabolism, and Molecular Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, 3 Cancer Prevention, Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, 4 Genetic Analysis Platform and Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, 5 Division of Biostatistics and Informatics, Department of Health Sciences Research, Mayo Clinic College of Medicine, Rochester, Minnesota, 6 Center for Inherited Disease Research, Johns Hopkins University, Baltimore, Maryland, 7 Department of Genome Sciences, University of Washington, Seattle, Washington, 8 Division of Cardiovascular Diseases, Department of Medicine, Mayo Clinic, Rochester, Minnesota, 9 Office of Population Genomics, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, 10 Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, 11 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, 12 Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University, Nashville, Tennessee
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 1.19
DOI:  10.1002/0471142905.hg0119s68
Online Posting Date:  January, 2011
Genome‐wide association studies (GWAS) are being conducted at an unprecedented rate in population‐based cohorts and have increased our understanding of the pathophysiology of complex disease. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the electronic MEdical Records and Genomics (eMERGE) network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. We discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research. Curr. Protoc. Hum. Genet. 68:1.19.1‐1.19.18 © 2011 by John Wiley & Sons, Inc.

Keywords: genome‐wide association studies; GWAS; quality control; QC; biobanks; electronic medical records; eMERGE

Table of Contents

  • Introduction
  • GWAS Data Format
  • Sample Quality
  • Marker Quality
  • Batch Effects
  • Evaluation of QC After Association Analysis
  • Future Directions
  • Acknowledgements
  • Literature Cited
  • Figures
  • Tables
Literature Cited

Internet Resources
  Census 2000. Profile of Demographic Characteristics, Marshfield, Wisconsin.
  Illumina Technical Note: “TOP/BOT” Strand and “A/B” Allele (2009).
  Illumina GenCall Data Analysis Software (2008).
  R Development Core Team: R: A language and environment for statistical computing. ISBN 3900051070, Vienna, Austria: R Foundation for Statistical Computing (2005).
  STRUCTURE (2009).
  Turner, S.D. 2009. Visualizing sample relatedness in a GWAS using PLINK and R.
