Genotype Imputation in Genome‐Wide Association Studies

Eleonora Porcu1, Serena Sanna1, Christian Fuchsberger2, Lars G. Fritsche2

1 Istituto di Ricerca Genetica e Biomedica (IRGB), CNR, Monserrato, Cagliari, 2 Department of Biostatistics, Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, Michigan
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 1.25
DOI:  10.1002/0471142905.hg0125s78
Online Posting Date:  July, 2013
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


Imputation is an in silico method that can increase the power of association studies by inferring missing genotypes, harmonizing data sets for meta‐analyses, and increasing the overall number of markers available for association testing. This unit provides an introductory overview of the imputation method and describes a two‐step imputation approach that consists of the phasing of the study genotypes and the imputation of reference panel genotypes into the study haplotypes. Detailed steps for data preparation and quality control illustrate how to run the computationally intensive two‐step imputation with the high‐density reference panels of the 1000 Genomes Project, which currently integrates more than 39 million variants. Additionally, the influence of reference panel selection, input marker density, and imputation settings on imputation quality are demonstrated with a simulated data set to give insight into crucial points of successful genotype imputation. Curr. Protoc. Hum. Genet. 78:1.25.1‐1.25.14. © 2013 by John Wiley & Sons, Inc.

Keywords: genome‐wide association studies; imputation; linkage disequilibrium; inference; imputation; 1000 Genomes Project; HapMap Project; rare variants; genotyping arrays

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Imputation Methods: Overview
  • Data Preparation
  • Step 1: Prephasing
  • Step 2: Imputation
  • Measuring Imputation Quality
  • Association Testing
  • Conclusions
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

   1000 Genomes Project Consortium. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491:56‐65.
   Abecasis, G.R. and Wigginton, J.E. 2005. Handling marker‐marker linkage disequilibrium: Pedigree analysis with clustered markers. Am. J. Hum. Genet. 77:754‐767.
   Browning, B.L. and Browning, S.R. 2009. A unified approach to genotype imputation and haplotype‐phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84:210‐223.
   de Bakker, P.I., Ferreira, M.A., Jia, X., Neale, B.M., Raychaudhuri, S., and Voight, B.F. 2008. Practical aspects of imputation‐driven meta‐analysis of genome‐wide association studies. Hum. Mol. Genet. 17:R122‐R128.
   Howie, B.N., Donnelly, P., and Marchini, J. 2009. A flexible and accurate genotype imputation method for the next generation of genome‐wide association studies. PLoS Genet. 5:e100529.
   Howie, B.N., Fuchsberger, C., Stephens, M., Marchini, J., and Abecasis, G.R. 2012. Fast and accurate genotype imputation in genome‐wide association studies through pre‐phasing. Nat. Genet. 44:955‐959.
   Huang, L., Li, Y., Singleton, A.B., Hardy, J.A., Abecasis, G., Rosenberg, N.A., and Scheet, P. 2009. Genotype‐imputation accuracy across worldwide human populations. Am. J. Hum. Genet. 84:235‐250.
   International HapMap 3 Consortium. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467:52‐58.
   Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M., Mayne, S.T., Bracken, M.B., Ferris, F.L., Ott, J., Barnstable, C., and Hoh, J. 2005. Complement factor H polymorphism in age‐related macular degeneration. Science 308:385‐389.
   Kong, A., Masson, G., Frigge, M.L., Gylfason, A., Zusmanovich, P., Thorleifsson, G., Olason, P.I., Ingason, A., Steinberg, S., Rafnar, T., Sulem, P., Mouy, M., Jonsson, F., Thorsteinsdottir, U., Gudbjartsson, D.F., Stefansson, H., and Stefansson, K. 2008. Detection of sharing by descent, long‐range phasing and haplotype imputation. Nat. Genet. 40:1068‐1075.
   Li, Y., Willer, C.J., Ding, J., Scheet, P., and Abecasis, G.R. 2010. MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34:816‐834.
   Liu, E.Y., Buyske, S., Aragaki, A.K., Peters, U., Boerwinkle, E., Carlson, C., Carty, C., Crawford, D.C., Haessler, J., Hindorff, L.A., Marchand, L.L., Manolio, T.A., Matise, T., Wang, W., Kooperberg, C., North, K.E., and Li, Y. 2012. Genotype imputation of Metabochip SNPs using a study‐specific reference panel of ˜4,000 haplotypes in African Americans from the Women's Health Initiative. Genet. Epidemiol. 36:107‐117.
   Meschia, J.F., Nalls, M., Matarin, M., Brott, T.G., Brown, R.D. Jr., Hardy, J., Kissela, B., Rich, S.S., Singleton, A., Hernandez, D., Ferrucci, L., Pearce, K., Keller, M., and Worrall, B.B. 2011. Siblings With Ischemic Stroke Study Investigators. Siblings with ischemic stroke study: Results of a genome‐wide scan for stroke loci. Stroke. 42:2726‐2732.
   Metzker, M.L. 2010. Sequencing technologies ‐ the next generation. Nat. Rev. Genet. 11:31‐46.
   O'Connell, J.R. and Weeks, D.E. 1998. PedCheck: A program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet. 63:259‐266.
   Purcell, S., Neale, B., Todd‐Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., and Sham, P.C. 2007. PLINK: A tool set for whole‐genome association and population‐based linkage analyses. Am. J. Hum. Genet. 81:559‐575.
   Sanna, S., Jackson, A.U., Nagaraja, R., Willer, C.J., Chen, W.M., Bonnycastle, L.L., Shen, H., Timpson, N., Lettre, G., Usala, G., Chines, P.S., Stringham, H.M., Scott, L.J., Dei, M., Lai, S., Albai, G., Crisponi, L., Naitza, S., Doheny, K.F., Pugh, E.W., Ben‐Shlomo, Y., Ebrahim, S., Lawlor, D.A., Bergman, R.N., Watanabe, R.M., Uda, M., Tuomilehto, J., Coresh, J., Hirschhorn, J.N., Shuldiner, A.R., Schlessinger, D., Collins, F.S., Davey Smith, G., Boerwinkle, E., Cao, A., Boehnke, M., Abecasis, G.R., and Mohlke, K.L. 2008. Common variants in the GDF5‐UQCC region are associated with variation in human height. Nat. Genet. 40:198‐203.
   Scheet, P. and Stephens, M. 2006. A fast and flexible statistical model for large‐scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78:629‐644.
   Scott, L.J., Mohlke, K.L., Bonnycastle, L.L., Willer, C.J., Li, Y., Duren, W.L., Erdos, M.R., Stringham, H.M., Chines, P.S., Jackson, A.U., Prokunina‐Olsson, L., Ding, C.J., Swift, A.J., Narisu, N., Hu, T., Pruim, R., Xiao, R., Li, X.Y., Conneely, K.N., Riebow, N.L., Sprau, A.G., Tong, M., White, P.P., Hetrick, K.N., Barnhart, M.W., Bark, C.W., Goldstein, J.L., Watkins, L., Xiang, F., Saramies, J., Buchanan, T.A., Watanabe, R.M., Valle, T.T., Kinnunen, L., Abecasis, G.R., Pugh, E.W., Doheny, K.F., Bergman, R.N., Tuomilehto, J., Collins, F.S., and Boehnke, M. 2007. A genome‐wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 316:1341‐1345.
   Southam, L., Panoutsopoulou, K., Rayner, N.W., Chapman, K., Durrant, C., Ferreira, T., Arden, N., Carr, A., Deloukas, P., Doherty, M., Loughlin, J., McCaskie, A., Ollier, W.E., Ralston, S., Spector, T.D., Valdes, A.M., Wallis, G.A., Wilkinson, J.M., arcOGEN Consortium, Marchini, J., and Zeggini, E. 2011. The effect of genome‐wide association scan quality control on imputation outcome for common variants. Eur. J. Hum. Genet. 19:610‐614.
   Su, Z., Marchini, J., and Donnelly, P. 2011. HAPGEN2: Simulation of multiple disease SNPs. Bioinformatics 27:2304‐2305.
   Turner, S., Armstrong, L.L., Bradford, Y., Carlson, C.S., Crawford, D.C., Crenshaw, A.T., de Andrade, M., Doheny, K.F., Haines, J.L., Hayes, G., Jarvik, G., Jiang, L., Kullo, I.J., Li, R., Ling, H., Manolio, T.A., Matsumoto, M., McCarty, C.A., McDavid, A.N., Mirel, D.B., Paschall, J.E., Pugh, E.W., Rasmussen, L.V., Wilke, R.A., Zuvich, R.L., and Ritchie, M.D. 2011. Quality control procedures for genome‐wide association studies. Curr. Protoc. Hum. Genet. 68:1.19.1‐1.19.18.
   Voight, B.F., Kang, H.M., Ding, J., Palmer, C.D., Sidore, C., Chines, P.S., Burtt, N.P., Fuchsberger, C., Li, Y., Erdmann, J., Frayling, T.M., Heid, I.M., Jackson, A.U., Johnson, T., Kilpelainen, T.O., Lindgren, C.M., Morris, A.P., Prokopenko, I., Randall, J.C., Saxena, R., Soranzo, N., Speliotes, E.K., Teslovich, T.M., Wheeler, E., Maguire, J., Parkin, M., Potter, S., Rayner, N.W., Robertson, N., Stirrups, K., Winckler, W., Sanna, S., Mulas, A., Nagaraja, R., Cucca, F., Barroso, I., Deloukas, P., Loos, R.J., Kathiresan, S., Munroe, P.B., Newton‐Cheh, C., Pfeufer, A., Samani, N.J., Schunkert, H., Hirschhorn, J.N., Altshuler, D., McCarthy, M.I., Abecasis, G.R., and Boehnke, M. 2012. The metabochip, a custom genotyping array for genetic studies of metabolic, cardiovascular, and anthropometric traits. PLoS Genet. 8:e1002793.
   Wetterstrand, K. 2013. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP).
   Wigginton, J.E. and Abecasis, G.R. 2005. PEDSTATS: Descriptive statistics, graphics and quality assessment for gene mapping data. Bioinformatics 21:3445‐3447.
Internet Resources
  Tutorial for the MACH 1.0 program for carrying out genotype imputation.
  Frequently asked questions about the MaCH program.
  Using the minimac program to carry out genotype imputation.
  The 1000 Genomes Imputation Cookbook contains detailed documentation and example scripts for the MaCH+minimac platform.
  The 1000 Genomes Imputation Cookbook contains detailed documentation and example scripts for the IMPUTE2 platform.
  The 1000 Genomes Project Web site.
  The HapMap Project Web site.∼yunmli/software.html
  Web site for Li Group Software.
  HAPGEN software for simulating haplotypes.
PDF or HTML at Wiley Online Library