RNA‐seq Data: Challenges in and Recommendations for Experimental Design and Analysis

Alexander G. Williams1, Sean Thomas2, Stacia K. Wyman1, Alisha K. Holloway2

1 Gladstone Institute of Cardiovascular Disease, San Francisco, California, 2 University of California, San Francisco, California
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 11.13
DOI:  10.1002/0471142905.hg1113s83
Online Posting Date:  October, 2014
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


RNA‐seq is widely used to determine differential expression of genes or transcripts as well as identify novel transcripts, identify allele‐specific expression, and precisely measure translation of transcripts. Thoughtful experimental design and choice of analysis tools are critical to ensure high‐quality data and interpretable results. Important considerations for experimental design include number of replicates, whether to collect paired‐end or single‐end reads, sequence length, and sequencing depth. Common analysis steps in all RNA‐seq experiments include quality control, read alignment, assigning reads to genes or transcripts, and estimating gene or transcript abundance. Our aims are two‐fold: to make recommendations for common components of experimental design and assess tool capabilities for each of these steps. We also test tools designed to detect differential expression, since this is the most widespread application of RNA‐seq. We hope that these analyses will help guide those who are new to RNA‐seq and will generate discussion about remaining needs for tool improvement and development. Curr. Protoc. Hum. Genet. 83:11.13.1‐11.13.20. © 2014 by John Wiley & Sons, Inc.

Keywords: RNA‐seq experimental design; biological replicates; sequence length; sequencing depth; splice‐aware alignment; paired‐end sequencing; transcript abundance; differential expression

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Data Set
  • Technical Considerations and Quality Control
  • Factors Affecting Alignment and Tools for Splice‐Junction Mapping
  • Estimating Transcript Abundance
  • Detecting Differential Expression: Sequencing Depth, Biological Replicates, Normalization, and Estimating Variance
  • Summary Recommendations
  • Acknowledgements
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
   Anders, S. and Huber, W. 2010. Differential expression analysis for sequence count data. Genome Biol. 11:R106.
   Anders, S. and Huber, W. 2012. Differential expression of RNA‐Seq data at the gene level–the DESeq package. http://watson.nci.nih.gov/bioc_mirror/packages/2.11/bioc/vignettes/DESeq/inst/doc/DESeq.pdf.
   Anders, S. , McCarthy, D.J. , Chen, Y. , Okoniewski, M. , Smyth, G.K. , Huber, W. , and Robinson, M.D. 2013. Count‐based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc. 8:1765‐1786.
   Anders, S. , Pyl, P.T. , and Huber, W. 2014. HTSeq‐A Python framework to work with high‐throughput sequencing data. bioRxiv. doi: http://dx.doi.org/10.1101/002824.
   Aschoff, M. , Hotz‐Wagenblatt, A. , Glatting, K.H. , Fischer, M. , Eils, R. , and Konig, R. 2013. SplicingCompass: Differential splicing detection using RNA‐Seq data. Bioinformatics 29:1141‐1148.
   Au, K.F. , Jiang, H. , Lin, L. , Xing, Y. , and Wong, W.H. 2010. Detection of splice junctions from paired‐end RNA‐seq data by SpliceMap. Nucleic Acids Res. 38:4570‐4578.
   Auer, P.L. and Doerge, R.W. 2010. Statistical design and analysis of RNA sequencing data. Genetics 185:405‐416.
   Baggerly, K.A. , Deng, L. , Morris, J.S. , and Aldaz, C.M. 2004. Overdispersed logistic regression for SAGE: modelling multiple groups and covariates. BMC Bioinformatics 5:144.
   Bashir, A. , Bansal, V. , and Bafna, V. 2010. Designing deep sequencing experiments: Detecting structural variation and estimating transcript abundance. BMC Genomics 11:385.
   Biswas, S. , Agrawal, Y.N. , Mucyn, T.S. , Dangl, J.L. , and Jones, C.D. 2013. Biological averaging in RNA‐seq. arXiv 1309.0670 [q‐bio.QM]. http://arxiv.org/abs/1309.0670.
   Bullard, J.H. , Purdom, E. , Hansen, K.D. , and Dudoit, S. 2010. Evaluation of statistical methods for normalization and differential expression in mRNA‐Seq experiments. BMC Bioinformatics 11:94.
   Busby, M.A. , Stewart, C. , Miller, C.A. , Grzeda, K.R. , and Marth, G.T. 2013. Scotty: A web tool for designing RNA‐Seq experiments to measure differential gene expression. Bioinformatics 29:656‐657.
   Chung, L.M. , Ferguson, J.P. , Zheng, W. , Qian, F. , Bruno, V. , Montgomery, R.R. , and Zhao, H. 2013. Differential expression analysis for paired RNA‐seq data. BMC Bioinformatics 14:110.
   Cox, D.R. and Reid, N. 1987. Parameter orthogonality and approximate conditional inference. J. R. Stat. Soc.Series B Stat. Methodol. 49:1‐39.
   Dillies, M.‐A. , Rau, A. , Aubert, J. , Hennequet‐Antier, C. , Jeanmougin, M. , Servant, N. , Keime, C. , Marot, G. , Castel, D. , Estelle, J. , Guernec, G. , Jagla, B. , Jouneau, L. , Laloë, D. , Le Gall, C. , Schaëffer, B. , Le Crom, S. , Guedj, M. , Jaffrézic, F. ; French StatOmique Consortium. 2013. A comprehensive evaluation of normalization methods for Illumina high‐throughput RNA sequencing data analysis. Brief. Bioinform. 14:671‐683.
   Dobin, A. , Davis, C.A. , Schlesinger, F. , Drenkow, J. , Zaleski, C. , Jha, S. , Batut, P. , Chaisson, M. , and Gingeras, T.R. 2013. STAR: Ultrafast universal RNA‐seq aligner. Bioinformatics 29:15‐21.
   Emig, D. , Salomonis, N. , Baumbach, J. , Lengauer, T. , Conklin, B.R. , and Albrecht, M. 2010. AltAnalyze and DomainGraph: Analyzing and visualizing exon expression data. Nucleic Acids Res. 38:W755‐W762.
   Fang, Z. and Cui, X. 2011. Design and validation issues in RNA‐seq experiments. Brief. Bioinform. 12:280‐287.
   Flicek, P. , Amode, M.R. , Barrell, D. , Beal, K. , Billis, K. , Brent, S. , Carvalho‐Silva, D. , Clapham, P. , Coates, G. , Fitzgerald, S. , Gil, L. , Girón, C.G. , Gordon, L. , Hourlier, T. , Hunt, S. , Johnson, N. , Juettemann, T. , Kähäri, A.K. , Keenan, S. , Kulesha, E. , Martin, F.J. , Maurel, T. , McLaren, W.M. , Murphy, D.N. , Nag, R. , Overduin, B. , Pignatelli, M. , Pritchard, B. , Pritchard, E. , Riat, H.S. , Ruffier, M. , Sheppard, D. , Taylor, K. , Thormann, A. , Trevanion, S.J. , Vullo, A. , Wilder, S.P. , Wilson, M. , Zadissa, A. , Aken, B.L. , Birney, E. , Cunningham, F. , Harrow, J. , Herrero, J. , Hubbard, T.J. , Kinsella, R. , Muffato, M. , Parker, A. , Spudich, G. , Yates, A. , Zerbino, D.R. , and Searle, S.M. 2014. Ensembl 2014. Nucleic Acids Res. 42:D749‐D755.
   Grant, G.R. , Farkas, M.H. , Pizarro, A. , Lahens, N. , Schug, J. , Brunk, B. , Stoeckert, C.J. , Hogenesch, J.B. , and Pierce, E.A. 2011. Comparative analysis of RNA‐Seq alignment algorithms and the RNA‐Seq Unified Mapper (RUM). Bioinformatics 27:2518‐2528.
   Hansen, K.D. , Irizarry, R.A. , and Wu, Z. 2012. Removing technical variability in RNA‐seq data using conditional quantile normalization. Biostatistics 13:204‐216.
   Hart, S.N. , Therneau, T.M. , Zhang, Y. , Poland, G.A. , and Kocher, J.‐P. 2013. Calculating sample size estimates for RNA sequencing data. J. Comput. Biol. 20:970‐978.
   Heap, G.A. , Yang, J.H.M. , Downes, K. , Healy, B.C. , Hunt, K.A. , Bockett, N. , Franke, L. , Dubois, P.C. , Mein, C.A. , Dobson, R.J. , Albert, T.J. , Rodesch, M.J. , Clayton, D.G. , Todd, J.A. , van Heel, D.A. , and Plagnol, V. 2010. Genome‐wide analysis of allelic expression imbalance in human primary cells by high‐throughput transcriptome resequencing. Hum. Mol. Genet. 19:122‐134.
   Hooper, J.E. 2014. A survey of software for genome‐wide discovery of differential splicing in RNA‐Seq data. Hum. Genomics 8:3.
   Hu, Y. , Huang, Y. , Du, Y. , Orellana, C.F. , Singh, D. , Johnson, A.R. , Monroy, A. , Kuan, P.‐F. , Hammond, S.M. , Makowski, L. , Randell, S.H. , Chiang, D.Y. , Hayes, D.N. , Jones, C. , Liu, Y. , Prins, J.F. , Liu, J. 2013. DiffSplice: The genome‐wide detection of differential splicing events with RNA‐seq. Nucleic Acids Res. 41:e39.
   Ingolia, N.T. , Ghaemmaghami, S. , Newman, J.R.S. , and Weissman, J.S. 2009. Genome‐wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218‐223.
   Katz, Y. , Wang, E.T. , Airoldi, E.M. , and Burge, C.B. 2010. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7:1009‐1015.
   Kim, D. , Pertea, G. , Trapnell, C. , Pimentel, H. , Kelley, R. , and Salzberg, S.L. 2013. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14:R36.
   Kvam, V.M. , Liu, P. , and Si, Y. 2012. A comparison of statistical methods for detecting differentially expressed genes from RNA‐seq data. Am. J. Bot. 99:248‐256.
   Łabaj, P.P. , Leparc, G.G. , Linggi, B.E. , Markillie, L.M. , Wiley, H.S. , and Kreil, D.P. 2011. Characterization and improvement of RNA‐Seq precision in quantitative transcript expression profiling. Bioinformatics 27:i383‐i391.
   Langmead, B. and Salzberg, S.L. 2012. Fast gapped‐read alignment with Bowtie 2. Nat. Methods 9:357‐359.
   Law, C.W. , Chen, Y. , Shi, W. , and Smyth, G.K. 2014. Voom: Precision weights unlock linear model analysis tools for RNA‐seq read counts. Genome Biol. 15:R29.
   Li, B. and Dewey, C.N. 2011. RSEM: Accurate transcript quantification from RNA‐Seq data with or without a reference genome. BMC Bioinformatics 12:323.
   Li, H. and Durbin, R. 2010. Fast and accurate long‐read alignment with Burrows‐Wheeler transform. Bioinformatics 26:589‐595.
   Li, Y. , Li‐Byarlay, H. , Burns, P. , Borodovsky, M. , Robinson, G.E. , and Ma, J. 2013. TrueSight: A new algorithm for splice junction detection using RNA‐seq. Nucleic Acids Res. 41:e51.
   Lovén, J. , Orlando, D.A. , Sigova, A.A. , Lin, C.Y. , Rahl, P.B. , Burge, C.B. , Levens, D.L. , Lee, T.I. , and Young, R.A. 2012. Revisiting global gene expression analysis. Cell 151:476‐482.
   Lu, J. , Tomfohr, J.K. , and Kepler, T.B. 2005. Identifying differential expression in multiple SAGE libraries: An overdispersed log‐linear model approach. BMC Bioinformatics 6:165.
   Marioni, J.C. , Mason, C.E. , Mane, S.M. , Stephens, M. , and Gilad, Y. 2008. RNA‐seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18:1509‐1517.
   McCarthy, D.J. , Chen, Y. , and Smyth, G.K. 2012. Differential expression analysis of multifactor RNA‐Seq experiments with respect to biological variation. Nucleic Acids Res. 40:4288‐4297.
   McIntyre, L.M. , Lopiano, K.K. , Morse, A.M. , Amin, V. , Oberg, A.L. , Young, L.J. , and Nuzhdin, S.V. 2011. RNA‐seq: Technical variability and sampling. BMC Genomics 12:293.
   McKenna, A. , Hanna, M. , Banks, E. , Sivachenko, A. , Cibulskis, K. , Kernytsky, A. , Garimella, K. , Altshuler, D. , Gabriel, S. , Daly, M. , and De Pristo, M.A. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Res. 20:1297‐1303.
   Nariai, N. , Hirose, O. , Kojima, K. , and Nagasaki, M. 2013. TIGAR: Transcript isoform abundance estimation method with gapped alignment of RNA‐Seq data by variational Bayesian inference. Bioinformatics 29:2292‐2299.
   Nicolae, M. , Mangul, S. , Măndoiu, I.I. , and Zelikovsky, A. 2011. Estimation of alternative splicing isoform frequencies from RNA‐Seq data. Algorithms Mol. Biol 6:9.
   Nookaew, I. , Papini, M. , Pornputtapong, N. , Scalcinati, G. , Fagerberg, L. , Uhlén, M. , and Nielsen, J. 2012. A comprehensive comparison of RNA‐Seq‐based transcriptome analysis from reads to differential gene expression and cross‐comparison with microarrays: A case study in Saccharomyces cerevisiae . Nucleic Acids Res. 40:10084‐10097.
   Richard, H. , Schulz, M.H. , Sultan, M. , Nurnberger, A. , Schrinner, S. , Balzereit, D. , Dagand, E. , Rasche, A. , Lehrach, H. , Vingron, M. , Haas, S.A. , Yaspo, M.L. 2010. Prediction of alternative isoforms from exon expression levels in RNA‐Seq experiments. Nucleic Acids Res. 38:e112.
   Risso, D. 2013. EDASeq: Exploratory data analysis and normalization for RNA‐Seq.
   Roberts, A. , Pimentel, H. , Trapnell, C. , and Pachter, L. 2011. Identification of novel transcripts in annotated genomes using RNA‐Seq. Bioinformatics 27:2325‐2329.
   Robinson, M.D. and Oshlack, A. 2010. A scaling normalization method for differential expression analysis of RNA‐seq data. Genome Biol. 11:R25.
   Robinson, M.D. , McCarthy, D.J. , and Smyth, G.K. 2009. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139‐140.
   Robles, J.A. , Qureshi, S.E. , Stephen, S.J. , Wilson, S.R. , Burden, C.J. , and Taylor, J.M. 2012. Efficient experimental design and analysis strategies for the detection of differential expression using RNA‐Sequencing. BMC Genomics 13:484.
   Sacomoto, G.A.T. , Kielbassa, J. , Chikhi, R. , Uricaru, R. , Antoniou, P. , Sagot, M.‐F. , Peterlongo, P. , and Lacroix, V. 2012. KISSPLICE: De‐novo calling alternative splicing events from RNA‐seq data. BMC Bioinformatics 13:55.
   Sakarya, O. , Breu, H. , Radovich, M. , Chen, Y. , Wang, Y.N. , Barbacioru, C. , Utiramerur, S. , Whitley, P.P. , Brockman, J.P. , Vatta, P. , Zhang, Z. , Popescu, L. , Muller, M.W. , Kudlingar, V. , Garg, N. , Li, C.Y. , Kong, B.S. , Bodeau, J.P. , Nutter, R.C. , Gu, J. , Bramlett, K.S. , Ichikawa, J.K. , Hyland, F.C. , and Siddiqui, A.S. 2012. RNA‐seq mapping and detection of gene fusions with a suffix array algorithm. PLoS Comput. Biol. 8:e1002464.
   Sing, T. , Sander, O. , Beerenwinkel, N. , and Lengauer, T. 2005. ROCR: Visualizing classifier performance in R. Bioinformatics 21:3940‐3941.
   Singh, D. , Orellana, C.F. , Hu, Y. , Jones, C.D. , Liu, Y. , Chiang, D.Y. , Liu, J. , and Prins, J.F. 2011. FDM: A graph‐based statistical method to detect differential transcription using RNA‐seq data. Bioinformatics 27:2633‐2640.
   Skelly, D.A. , Johansson, M. , Madeoy, J. , Wakefield, J. , and Akey, J.M. 2011. A powerful and flexible statistical framework for testing hypotheses of allele‐specific gene expression from RNA‐seq data. Genome Res. 21:1728‐1737.
   Smyth, G.K. 2005. limma: Linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor ( R. Gentleman , V. Carey , S. Dudoit , R. Irizarry , and W. Huber , eds.) pp. 397‐420. Springer‐Verlag, New York.
   Soneson, C. and Delorenzi, M. 2013. A comparison of methods for differential expression analysis of RNA‐seq data. BMC Bioinformatics 14:91.
   Trapnell, C. , Pachter, L. , and Salzberg, S.L. 2009. TopHat: Discovering splice junctions with RNA‐Seq. Bioinformatics 25:1105‐1111.
   Trapnell, C. , Williams, B.A. , Pertea, G. , Mortazavi, A. , Kwan, G. , van Baren, M.J. , Salzberg, S.L. , Wold, B.J. , and Pachter, L. 2010. Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28:511‐515.
   Vijay, N. , Poelstra, J.W. , Kuenstner, A. , and Wolf, J.B.W. 2013. Challenges and strategies in transcriptome assembly and differential gene expression quantification: A comprehensive in silico assessment of RNA‐seq experiments. Mol. Ecol. 22:620‐634.
   Wang, K. , Singh, D. , Zeng, Z. , Coleman, S.J. , Huang, Y. , Savich, G.L. , He, X. , Mieczkowski, P. , Grimm, S.A. , Perou, C.M. , MacLeod, J.N. , Chiang, D.Y. , Prins, J.F. , and Liu, J. 2010. MapSplice: Accurate mapping of RNA‐seq reads for splice junction discovery. Nucleic Acids Res. 38:e178.
   Wang, L. , Feng, Z. , Wang, X. , Wang, X. , and Zhang, X. 2009. DEGseq: An R package for identifying differentially expressed genes from RNA‐seq data. Bioinformatics 26:136‐138.
   Wang, L. , Wang, S. , and Li, W. 2012. RSeQC: Quality control of RNA‐seq experiments. Bioinformatics 28:2184‐2185.
   Wu, H. , Wang, C. , and Wu, Z. 2013. A new shrinkage estimator for dispersion improves differential expression detection in RNA‐seq data. Biostatistics 14:232‐243.
   Young, M.D. , McCarthy, D.J. , Wakefield, M.J. , Smyth, G.K. , Oshlack, A. , and Robinson, M.D. 2011. Differential expression for RNA sequencing (RNA‐seq) data: Mapping, summarization, statistical analysis, and experimental design. In Bioinformatics for High Throughput Sequencing, Chapter 10 ( A.M. Aransay , M.L. Hackenberg , and N. Rodriguez‐Ezpeleta , eds.) pp. 169‐190. Springer, New York.
PDF or HTML at Wiley Online Library