An Introduction to the Informatics of “Next‐Generation” Sequencing

Lincoln D. Stein1

1 Ontario Institute for Cancer Research, Toronto, Ontario, Canada
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 11.1
DOI:  10.1002/0471250953.bi1101s36
Online Posting Date:  December, 2011
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

Next‐generation sequencing (NGS) packs the sequencing throughput of a 2000's‐era genome center into a single affordable machine. However, software developed for conventional sequencing technologies is often inadequate to deal with the nature of NGS technologies, which produce short, massively parallel reads. This unit surveys the software packages that are available for managing and analyzing NGS data. Curr. Protoc. Bioinform. 36:11.1.1‐11.1.9. © 2011 by John Wiley & Sons, Inc.

Keywords: “next generation”; genome; sequencing; bioinformatics; analysis

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Informatics Challenges
  • Next‐Generation Sequence Analysis Software
  • Next‐Generation Sequence Analysis in the Cloud
  • Literature Cited
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

Literature Cited
   Abyzov, A., Urban, A.E., Snyder, M., and Gerstein, M. 2011. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21:974‐984.
   Afgan, E., Baker, D., Coraor, N., Chapman, B., Nekrutenko, A., and Taylor, J. 2010. Galaxy CloudMan: Delivering cloud compute clusters. BMC Bioinformatics 12:S4.
   Ajay, S.S., Parker, S.C., Ozel Abaan, H., Fuentes Fajardo, K.V., and Margulies, E.H. 2011. Accurate and comprehensive sequencing of personal genomes. Genome Res. 21:1498‐1505.
   Alkan, C., Kidd, J.M., Marques‐Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., Sahinalp, S.C., Gibbs, R.A., and Eichler, E.E. 2009. Personalized copy number and segmental duplication maps using next‐generation sequencing. Nat. Genet. 41:1061‐1067.
   Au, K.F., Jiang, H., Lin, L., Xing, Y., and Wong, W.H. 2010. Detection of splice junctions from paired‐end RNA‐seq data by SpliceMap. Nucleic Acids Res. 38:4570‐4578.
   Bao, S., Jiang, R., Kwan, W., Wang, B., Ma, X., and Song, Y.Q. 2011. Evaluation of next‐generation sequencing software in mapping and assembly. J. Hum. Genet. 56:406‐414.
   Bohnert, R. and Rätsch, G. 2010. rQuant.web: A tool for RNA‐Seq‐based transcript quantitation. Nucleic Acids Res. 38:W348‐W351.
   Burrows, M. and Wheeler, D. 1994. A block sorting lossless data compression algorithm, Technical Report 124 1994, Digital Equipment Corporation.
   Chen, K., Wallis, J.W., McLellan, M.D., Larson, D.E., Kalicki, J.M., Pohl, C.S., McGrath, S.D., Wendl, M.C., Zhang, Q., Locke, D.P., Shi, X., Fulton, R.S., Ley, T.J., Wilson, R.K., Ding, L., and Mardis, E.R. 2009. BreakDancer: An algorithm for high‐resolution mapping of genomic structural variation. Nat. Methods 6:677‐681.
   Dimon, M.T., Sorber, K., and DeRisi, J.L. 2010. HMMSplicer: A tool for efficient and sensitive discovery of known and novel splice junctions in RNA‐Seq data. PLoS One 5:e13875.
   Ewing, B., Hillier, L., Wendl, M., and Green, P. 1998. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175‐185.
   Feng, X., Grossman, R., and Stein, L. 2011. PeakRanger: A cloud‐enabled peak caller for ChIP‐seq data. BMC Bioinformatics 12:139.
   Fox, A. 2011. Computer science. Cloud computing‐what's in it for me as a scientist? Science 331:406‐407.
   Goya, R., Sun, M.G., Morin, R.D., Leung, G., Ha, G., Wiegand, K.C., Senz, J., Crisan, A., Marra, M.A., Hirst, M., Huntsman, D., Murphy, K.P., Aparicio, S., and Shah, S.P. 2010. SNVMix: Predicting single nucleotide variants from next‐generation sequencing of tumors. Bioinformatics 26:730‐736.
   Griffith, M., Griffith, O.L., Mwenifumbo, J., Goya, R., Morrissy, A.S., Morin, R.D., Corbett, R., Tang, M.J., Hou, Y.C., Pugh, T.J., Robertson, G., Chittaranjan, S., Ally, A., Asano, J.K., Chan, S.Y., Li, H.I., McDonald, H., Teague, K., Zhao, Y., Zeng, T., Delaney, A., Hirst, M., Morin, G.B., Jones, S.J., Tai, I.T., and Marra, M.A. 2010. Alternative expression analysis by RNA sequencing. Nat. Methods 7:843‐847.
   Homer, N., Merriman, B., and Nelson, S.F. 2009. BFAST: An alignment tool for large scale genome resequencing. PLoS One 4:e7767.
   Hormozdiari, F., Hajirasouliha, I., Dao, P., Hach, F., Yorukoglu, D., Alkan, C., Eichler, E.E., and Sahinalp, S.C. 2010. Next‐generation VariationHunter: Combinatorial algorithms for transposon insertion discovery. Bioinformatics 26:i350‐i357.
   Janitz, M. (ed.) 2008. Next‐Generation Genome Sequencing: Towards Personalized Medicine. Wiley‐VCH Verlag GmbH & Co. KGaA, Weinheim.
   Kent, W.J. 2002. BLAT‐the BLAST‐like alignment tool. Genome Res. 12:656‐664.
   Kim, D. and Salzberg, S.L. 2011. TopHat‐Fusion: An algorithm for discovery of novel fusion transcripts. Genome Biol. 12:R72.
   Kriseman, J., Busick, C., Szelinger, S., and Dinu, V. 2010. BING: Biomedical informatics pipeline for next‐generation sequencing. J. Biomed. Inform. 43:428‐434.
   Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and Marra, M.A. 2009. Circos: An information aesthetic for comparative genomics. Genome Res. 19:1639‐1645.
   Langmead, B., Schatz, M.C., Lin, J., Pop, M., and Salzberg, S.L. 2009a. Searching for SNPs with cloud computing. Genome Biol. 10:R134.
   Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. 2009b. Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome Biol. 10:R25.
   Li, H. and Durbin, R. 2010. Fast and accurate long‐read alignment with Burrows‐Wheeler transform. Bioinformatics 26:589‐595.
   Li, H., Ruan, J., and Durbin, R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851‐1858.
   Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. 2009. 1000 genome project data processing subgroup. The sequence alignment/map format and SAMtools. Bioinformatics 25:2078‐2079.
   Li, R., Li, Y., Kristiansen, K., and Wang, J. 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics 24:713‐714.
   Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Irzyk, G.P., Jando, S.C., Alenquer, M.L., Jarvie, T.P., Jirage, K.B., Kim, J.B., Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., McKenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., and Srinivasan, M. 2005. Genome sequencing in microfabricated high‐density picolitre reactors. Nature 437:376‐380.
   McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M.A. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Res. 20:1297‐1303.
   Medvedev, P., Fiume, M., Dzamba, M., Smith, T., and Brudno, M. 2010. Detecting copy number variation with mated short reads. Genome Res. 20:1613‐1622.
   Miller, J.R., Koren, S., and Sutton, G. 2010. Assembly algorithms for next‐generation sequencing data. Genomics 95:315‐327.
   Mitchelson, K.R. 2001. The application of capillary electrophoresis for DNA polymorphism analysis. Methods Mol. Biol. 162:3‐26.
   Mitelman, F., Johansson, B., and Mertens, F. 2007. The impact of translocations and gene fusions on cancer causation. Nat. Rev. Cancer 7:233‐245.
   Morgan, M., Anders, S., Lawrence, M., Aboyoun, P., Pagès, H., and Gentleman, R. 2009. ShortRead: A bioconductor package for input, quality assessment and exploration of high‐throughput sequence data. Bioinformatics 25:2607‐2608.
   Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. 2008. Mapping and quantifying mammalian transcriptomes by RNA‐Seq. Nat. Methods 5:621‐628.
   Pevzner, P.A., Tang, H., and Waterman, M.S. 2001. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U.S.A. 98:9748‐9753.
   Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S.D., Mungall, K., Lee, S., Okada, H.M., Qian, J.Q., Griffith, M., Raymond, A., Thiessen, N., Cezard, T., Butterfield, Y.S., Newsome, R., Chan, S.K., She, R., Varhol, R., Kamoh, B., Prabhu, A.L., Tam, A., Zhao, Y., Moore, R.A., Hirst, M., Marra, M.A., Jones, S.J., Hoodless, P.A., and Birol, I. 2010. De novo assembly and analysis of RNA‐seq data. Nat. Methods 7:909‐912.
   Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M., and Gerstein, M.B. 2009. PeakSeq enables systematic scoring of ChIP‐seq experiments relative to controls. Nat. Biotechnol. 27:66‐75.
   Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., and Brudno, M. 2009. SHRiMP: Accurate mapping of short color‐space reads. PLoS Comput. Biol. 5:e100386.
   Sboner, A., Habegger, L., Pflueger, D., Terry, S., Chen, D.Z., Rozowsky, J.S., Tewari, A.K., Kitabayashi, N., Moss, B.J., Chee, M.S., Demichelis, F., Rubin, M.A., and Gerstein, M.B. 2010. FusionSeq: A modular framework for finding gene fusions by analyzing paired‐end RNA‐sequencing data. Genome Biol. 11:R104.
   Schatz, M.C. 2009. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics 25:1363‐1369.
   Schröder, J., Schröder, H., Puglisi, S.J., Sinha, R., and Schmidt, B. 2009. SHREC: A short‐read error correction method. Bioinformatics 25:2157‐2163.
   Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., and Birol, I. 2009. ABySS: A parallel assembler for short read sequence data. Genome Res. 19:1117‐1123.
   Stein, L.D. 2010. The case for cloud computing in genome informatics. Genome Biol. 11:207.
   Trapnell, C., Pachter, L., and Salzberg, S.L. 2009. TopHat: Discovering splice junctions with RNA‐Seq. Bioinformatics 25:1105‐1111.
   Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. 2010. Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28:511‐515.
   Wang, K., Singh, D., Zeng, Z., Coleman, S.J., Huang, Y., Savich, G.L., He, X., Mieczkowski, P., Grimm, S.A., Perou, C.M., MacLeod, J.N., Chiang, D.Y., Prins, J.F., and Liu, J. 2010. MapSplice: Accurate mapping of RNA‐seq reads for splice junction discovery. Nucleic Acids Res. 38:e178.
   Waszak, S.M., Hasin, Y., Zichner, T., Olender, T., Keydar, I., Khen, M., Stütz, A.M., Schlattl, A., Lancet, D., and Korbel, J.O. 2010. Systematic inference of copy‐number genotypes from personal genome sequencing data reveals extensive olfactory receptor gene content diversity. PLoS Comput Biol. 6:e100988.
   Zeitouni, B., Boeva, V., Janoueix‐Lerosey, I., Loeillet, S., Legoix‐né, P., Nicolas, A., Delattre, O., and Barillot, E. 2010. SVDetect: A tool to identify genomic structural variations from paired‐end and mate‐pair sequencing data. Bioinformatics 26:1895‐1896.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library