From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline

Geraldine A. Van der Auwera1, Mauricio O. Carneiro1, Christopher Hartl1, Ryan Poplin1, Guillermo del Angel1, Ami Levy‐Moonshine1, Tadeusz Jordan1, Khalid Shakir1, David Roazen1, Joel Thibault1, Eric Banks1, Kiran V. Garimella2, David Altshuler1, Stacey Gabriel1, Mark A. DePristo1

1 Broad Institute, Cambridge, Massachusetts, 2 University of Oxford, Oxford
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 11.10
DOI:  10.1002/0471250953.bi1110s43
Online Posting Date:  October, 2013
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high‐quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data‐processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK. Curr. Protoc. Bioinform. 43:11.10.1‐11.10.33. © 2013 by John Wiley & Sons, Inc.

Keywords: NGS; WGS; exome; variant detection; genotyping

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Strategic Planning
  • Basic Protocol 1: From FASTQ to Analysis‐Ready BAM: Preparing the Sequence Data
  • Basic Protocol 2: From Analysis‐Ready BAM to Raw Variants: Calling Variants in Diploid Organisms with HaplotypeCaller
  • Basic Protocol 3: From Raw to Analysis‐Ready Variants: Variant Quality Score Recalibration
  • Alternate Protocol 1: From Analysis‐Ready BAM to Raw Variants: Calling Variants in Non‐Diploid Organisms with UnifiedGenotyper
  • Alternate Protocol 2: From Raw to Analysis‐Ready Variants: Hard Filtering Small Datasets
  • Support Protocol 1: Obtaining and Installing the Software Used in This Unit
  • Support Protocol 2: From BAM Back to FASTQ: Reprocessing Old Data
  • Support Protocol 3: Fixing Improperly Formatted BAM Files
  • Support Protocol 4: Adding Variant Annotations with VariantAnnotator
  • Acknowledgments
  • Literature Cited
  • Figures
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
  1000 Genomes Project Consortium. 2010. A map of human genome variation from population‐scale sequencing. Nature 467:1061‐1073.
  DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., and Daly, M.J. 2011. A framework for variation discovery and genotyping using next‐generation DNA sequencing data. Nat. Genet. 43:491‐498.
  Fisher, R.A. 1922. On the interpretation of c2 from contingency tables, and the calculation of p. J. R. Stat. Soc. 85:87‐94.
  International HapMap 3 Consortium, Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M., Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu, F., Chang, K., Hawes, A., Lewis, L.R., Ren, Y., Wheeler, D., Gibbs, R.A., Muzny, D.M., Barnes, C., Darvishi, K., Hurles, M., Korn, J.M., Kristiansson, K., Lee, C., McCarrol, S.A., Nemesh, J., Dermitzakis, E., Keinan, A., Montgomery, S.B., Pollack, S., Price, A.L., 2Soranzo, N., Bonnen, P.E., Gibbs, R.A., Gonzaga‐Jauregui, C., Keinan, A., Price, A.L., Yu, F., Anttila, V., Brodeur, W., Daly, M.J., Leslie, S., McVean, G., Moutsianas, L., Nguyen, H., Schaffner, S.F., Zhang, Q., Ghori, M.J., McGinnis, R., McLaren, W., Pollack, S., Price, A.L., Schaffner, S.F., Takeuchi, F., Grossman, S.R., Shlyakhter, I., Hostetter, E.B., Sabeti, P.C., Adebamowo, C.A., Foster, M.W., Gordon, D.R., Licinio, J., Manca, M.C., Marshall, P.A., Matsuda, I., Ngare, D., Wang, V.O., Reddy, D., Rotimi, C.N., Royal, C.D., Sharp, R.R., Zeng, C., Brooks, L.D., and McEwen, J.E. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467:52‐58.
  Li, H. and Durbin, R. 2010. Fast and accurate long‐read alignment with Burrows‐Wheeler transform. Bioinformatics (Oxford) 26:589‐595.
  Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. 1000 Genome Project Data Processing Subgroup 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford) 25:2078‐2079.
  Mann, H.B. and Whitney, D.R. 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18:50‐60.
  McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M.A. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Res. 20:1297‐1303.
  Mills, R.E., Luttig, C.T., Larkins, C.E., and Beauchamp, A. 2006. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16:1182‐1190.
  Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29:308‐311.
  Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer, New York.
PDF or HTML at Wiley Online Library