Data Analysis Pipeline for RNA‐seq Experiments: From Differential Expression to Cryptic Splicing

Hari Krishna Yalamanchili1, Ying‐Wooi Wan1, Zhandong Liu2

1 Bioinformatics Core, Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Houston, Texas, 2 Department of Pediatrics‐Neurology, Baylor College of Medicine, Houston, Texas
Publication Name:  Current Protocols in Bioinformatics
Unit Number:  Unit 11.15
DOI:  10.1002/cpbi.33
Online Posting Date:  September, 2017
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library


RNA sequencing (RNA‐seq) is a high‐throughput technology that provides unique insights into the transcriptome. It has a wide variety of applications in quantifying genes/isoforms and in detecting non‐coding RNA, alternative splicing, and splice junctions. It is extremely important to comprehend the entire transcriptome for a thorough understanding of the cellular system. Several RNA‐seq analysis pipelines have been proposed to date. However, no single analysis pipeline can capture dynamics of the entire transcriptome. Here, we compile and present a robust and commonly used analytical pipeline covering the entire spectrum of transcriptome analysis, including quality checks, alignment of reads, differential gene/transcript expression analysis, discovery of cryptic splicing events, and visualization. Challenges, critical parameters, and possible downstream functional analysis pipelines associated with each step are highlighted and discussed. This unit provides a comprehensive understanding of state‐of‐the‐art RNA‐seq analysis pipeline and a greater understanding of the transcriptome. © 2017 by John Wiley & Sons, Inc.

Keywords: RNA‐seq; differential gene expression; differential isoform usage; alternative splicing; cryptic splicing

PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Strategic Planning
  • Basic Protocol 1: Differential Gene Expression Analysis of RNA‐seq
  • Basic Protocol 2: Beyond DEGs: Differential Expression and Usage of Isoforms
  • Basic Protocol 3: Deep Into RNA‐seq: Cryptic Splicing
  • Guidelines for Understanding Results
  • Acknowledgments
  • Literature Cited
  • Figures
  • Tables
PDF or HTML at Wiley Online Library


PDF or HTML at Wiley Online Library



Literature Cited

Literature Cited
  Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq: A Python framework to work with high‐throughput sequencing data. Bioinformatics, 31(2), 166–169. doi: 10.1093/bioinformatics/btu638.
  Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near‐optimal probabilistic RNA‐seq quantification. Nature Biotechnology, 34(5), 525–527. doi: 10.1038/nbt.3519.
  Conesa, A., Madrigal, P., Tarazona, S., Gomez‐Cabrero, D., Cervera, A., McPherson, A., … Mortazavi, A. (2016). A survey of best practices for RNA‐seq data analysis. Genome Biology, 17, 13. doi: 10.1186/s13059‐016‐0881‐8.
  Dou, T., Xu, J., Gao, Y., Gu, J., Ji, C., Xie, Y., & Zhou, Y. (2010). Evolution of peroxisome proliferator‐activated receptor gamma alternative splicing. Frontiers in Bioscience (Elite Edition), 2, 1334–1343.
  Drăghici, S. (2012). Statistics and data analysis for microarrays using R and Bioconductor. Boca Raton, FL: CRC Press.
  Garcia‐Blanco, M. A., Baraniak, A. P., & Lasda, E. L. (2004). Alternative splicing in disease and therapy. Nature Biotechnology, 22(5), 535–546. doi: 10.1038/nbt964.
  Gene Ontology Consortium. (2015). Gene ontology consortium: Going forward. Nucleic Acids Research, 43(Database issue), D1049–1056. doi: 10.1093/nar/gku1179.
  Green, M. R. (1986). Pre‐mRNA splicing. Annual Review of Genetics, 20, 671–708. doi: 10.1146/
  Jin, H., Wan, Y. W., & Liu, Z. (2017). Comprehensive evaluation of RNA‐seq quantification methods for linearity. BMC Bioinformatics, 18(Suppl 4), 117. doi: 10.1186/s12859‐017‐1526‐y.
  Kapustin, Y., Chan, E., Sarkar, R., Wong, F., Vorechovsky, I., Winston, R. M., … Dibb, N. J. (2011). Cryptic splice sites and split genes. Nucleic Acids Research, 39(14), 5837–5844. doi: 10.1093/nar/gkr203.
  Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013). TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4), R36. doi: 10.1186/gb‐2013‐14‐4‐r36.
  Levin, J. Z., Yassour, M., Adiconis, X., Nusbaum, C., Thompson, D. A., Friedman, N., … Regev, A. (2010). Comprehensive comparative analysis of strand‐specific RNA sequencing methods. Nature Methods, 7(9), 709–715. doi: 10.1038/nmeth.1491.
  Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–2079. doi: 10.1093/bioinformatics/btp352.
  Ling, J. P., Pletnikova, O., Troncoso, J. C., & Wong, P. C. (2015). TDP‐43 repression of nonconserved cryptic exons is compromised in ALS‐FTD. Science, 349(6248), 650–655. doi: 10.1126/science.aab0983.
  Lister, R., O'Malley, R. C., Tonti‐Filippini, J., Gregory, B. D., Berry, C. C., Millar, A. H., & Ecker, J. R. (2008). Highly integrated single‐base resolution maps of the epigenome in Arabidopsis. Cell, 133(3), 523–536. doi: 10.1016/j.cell.2008.03.029.
  Lo, Y. H., Chung, E., Li, Z., Wan, Y. W., Mahe, M. M., Chen, M. S., … Shroyer, N. F. (2017). Transcriptional regulation by ATOH1 and its target SPDEF in the intestine. Cellular and Molecular Gastroenterology and Hepatology, 3(1), 51–71. doi: 10.1016/j.jcmgh.2016.10.001.
  Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA‐seq data with DESeq2. Genome Biology, 15(12), 550. doi: 10.1186/s13059‐014‐0550‐8.
  Menon, R., & Omenn, G. S. (2010). Proteomic characterization of novel alternative splice variant proteins in human epidermal growth factor receptor 2/neu‐induced breast cancers. Cancer Research, 70(9), 3440–3449. doi: 10.1158/0008‐5472.CAN‐09‐2631.
  Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA‐Seq. Nature Methods, 5(7), 621–628. doi: 10.1038/nmeth.1226.
  Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., & Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320(5881), 1344–1349. doi: 10.1126/science.1158441.
  Niu, L., Huang, W., Umbach, D. M., & Li, L. (2014). IUTA: A tool for effectively detecting differential isoform usage from RNA‐Seq data. BMC Genomics [Electronic Resource], 15, 862. doi: 10.1186/1471‐2164‐15‐862.
  Padgett, R. A., Grabowski, P. J., Konarska, M. M., Seiler, S., & Sharp, P. A. (1986). Splicing of messenger RNA precursors. Annual Review of Biochemistry, 55, 1119–1150. doi: 10.1146/
  Pathan, M., Keerthikumar, S., Ang, C. S., Gangoda, L., Quek, C. Y., Williamson, N. A., … Mathivanan, S. (2015). FunRich: An open access standalone functional enrichment and interaction network analysis tool. Proteomics, 15(15), 2597–2601. doi: 10.1002/pmic.201400515.
  Pham, T. V., Piersma, S. R., Warmoes, M., & Jimenez, C. R. (2010). On the beta‐binomial model for analysis of spectral count data in label‐free tandem mass spectrometry‐based proteomics. Bioinformatics, 26(3), 363–369. doi: 10.1093/bioinformatics/btp677.
  Quinlan, A. R., & Hall, I. M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. doi: 10.1093/bioinformatics/btq033.
  Sztainberg, Y., Chen, H. M., Swann, J. W., Hao, S., Tang, B., Wu, Z., … Zoghbi, H. Y. (2015). Reversal of phenotypes in MECP2 duplication mice using genetic rescue or antisense oligonucleotides. Nature, 528(7580), 123–126. doi: 10.1038/nature16159.
  Tan, Q., Yalamanchili, H. K., Park, J., De Maio, A., Lu, H. C., Wan, Y. W., … Zoghbi, H. Y. (2016). Extensive cryptic splicing upon loss of RBM17 and TDP43 in neurodegeneration models. Human Molecular Genetics, 25(23), 5083–5093. doi: 10.1093/hmg/ddw337.
  Thorvaldsdottir, H., Robinson, J. T., & Mesirov, J. P. (2013). Integrative Genomics Viewer (IGV): High‐performance genomics data visualization and exploration. Briefings in Bioinformatics, 14(2), 178–192. doi: 10.1093/bib/bbs017.
  Wang, G. S., & Cooper, T. A. (2007). Splicing in disease: Disruption of the splicing code and the decoding machinery. Nature Reviews Genetics, 8(10), 749–761. doi: 10.1038/nrg2164.
  Yalamanchili, H. K., Li, Z., Wang, P., Wong, M. P., Yao, J., & Wang, J. (2014). SpliceNet: Recovering splicing isoform‐specific differential gene networks from RNA‐Seq data of normal and diseased samples. Nucleic Acids Research, 42(15), e121. doi: 10.1093/nar/gku577.
  Yalamanchili, H. K., Xiao, Q. W., & Wang, J. (2012). A novel neural response algorithm for protein function prediction. BMC Systems Biology, 6(Suppl 1), S19. doi: 10.1186/1752‐0509‐6‐S1‐S19.
  Yalamanchili, H. K., Yan, B., Li, M. J., Qin, J., Zhao, Z., Chin, F. Y., & Wang, J. (2014). DDGni: Dynamic delay gene‐network inference from high‐temporal data using gapped local alignment. Bioinformatics, 30(3), 377–383. doi: 10.1093/bioinformatics/btt692.
PDF or HTML at Wiley Online Library