Integrative Analysis of Histone ChIP‐seq and RNA‐seq Data

Hans‐Ulrich Klein1, Martin Schäfer2

1 Institute of Medical Informatics, University of Münster, Münster, 2 Mathematical Institute, Heinrich Heine University Düsseldorf, Düsseldorf
Publication Name:  Current Protocols in Human Genetics
Unit Number:  Unit 20.3
DOI:  10.1002/cphg.17
Online Posting Date:  July, 2016
GO TO THE FULL TEXT: PDF or HTML at Wiley Online Library

Abstract

The R package epigenomix has been designed to detect differentially transcribed gene isoforms that, in addition, exhibit altered histone modifications at their respective genomic loci. The package provides methods to map histone ChIP‐seq profiles to isoforms and estimate their transcript abundances from RNA‐seq data. Based on the differences observed between case and control samples in the RNA‐seq and ChIP‐seq data, a correlation measure is calculated for each isoform. The distribution of this correlation measure is further investigated by a Bayesian mixture model to (i) reveal the relationship between the studied histone modification and transcriptional activity, and (ii) detect specific isoforms with differences in both transcription values and histone modifications. The method is designed for experiments with a few or no replicates, and is superior to separate analyses of both data types in that setting. This unit illustrates the integrative analysis of ChIP‐seq and RNA‐seq data with epigenomix. © 2016 by John Wiley & Sons, Inc.

Keywords: Bayesian mixture model; data integration; differential ChIP‐seq analysis; differential gene expression

     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Table of Contents

  • Introduction
  • Basic Protocol 1: Detecting Isoforms With Differences in RNA‐seq and ChIP‐seq Data
  • Support Protocol 1: Pre‐Processing of RNA‐seq Data
  • Support Protocol 2: Pre‐Processing of ChIP‐seq Data
  • Commentary
  • Literature Cited
  • Figures
  • Tables
     
 
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Materials

Basic Protocol 1: Detecting Isoforms With Differences in RNA‐seq and ChIP‐seq Data

  Materials
  • Estimated transcript abundances obtained from RNA‐seq. Transcription values should be stored together with isoform annotation as an R object of class ExpressionSet. See protocol 2 for a description of how to quantify transcript abundances from raw fastq files, get transcript annotation, and store the information in appropriate R data structures.
  • Aligned reads from ChIP‐seq experiments in bam file format. See protocol 3 for a description of how to align reads and create bam files.
  • R programming environment with packages epigenomix (version ≥ 1.11.6), GenomicRanges, and GenomicAlignments. The packages depend on other packages which will be automatically installed along with these packages. The Bioconductor Web site (http://www.bioconductor.org) provides documentation about installing R and R/Bioconductor packages. All steps of this protocol are performed within the R environment and run on a standard desktop computer or laptop with ≥8 GB memory.

Support Protocol 1: Pre‐Processing of RNA‐seq Data

  Materials
  • Raw fastq files from RNA‐seq. The example data set consists of two single‐end samples available at the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra); PrEC cells ID: SRR513107, LNCaP cells ID: SRR513108)
  • Reference transcriptome fasta file. The human transcriptome from Ensembl (version GRCh38) is used in the example. Reference transcriptomes for various species can be downloaded from the Ensembl ftp server (http://www.ensembl.org).
  • Installed version of the kallisto software. Kallisto can be downloaded at http://pachterlab.github.io/kallisto.
  • The R software including packages biomaRt and Biobase. (http://www.bioconductor.org)

Support Protocol 2: Pre‐Processing of ChIP‐seq Data

  Materials
  • Raw fastq files from ChIP‐seq. The example data set consists of two single‐end anti‐H3K4me3 ChIP‐seq samples available at the Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra; PrEC cells ID: SRR513113, LNCaP cells ID: SRR513114)
  • Reference genome fasta file. The human genome version GRCh38 is used in the example. Fasta files for each chromosome can be downloaded from the Ensembl ftp server (http://www.ensembl.org). The files have to be concatenated to one large file in order to run BWA.
  • Installed version of the Burrows Wheeler Aligner (BWA). BWA can be downloaded at http://bio‐bwa.sourceforge.net.
  • Picard tools together with a Java run time environment. The Picard tools can be downloaded at http://broadinstitute.github.io/picard.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library

Figures

Videos

Literature Cited

  Anders, S. and Huber, W. 2010. Differential expression analysis for sequence count data. Genome Biol. 11:R106. doi: 10.1186/gb‐2010‐11‐10‐r106.
  Angelini, C. and Costa, V. 2014. Understanding gene regulatory mechanisms by integrating ChIP‐seq and RNA‐seq data: Statistical solutions to biological problems. Front Cell Dev. Biol. 2:51. doi: 10.3389/fcell.2014.00051.
  Bert, S.A., Robinson, M.D., Strbenac, D., Statham, A.L., Song, J.Z., Hulf, T., Sutherland, R.L., Coolen, M.W., Stirzaker, C., and Clark, S.J. 2013. Regional activation of the cancer genome by long‐range epigenetic remodeling. Cancer Cell 23:9‐22. doi: 10.1016/j.ccr.2012.11.006.
  Bray, N.L., Pimentel, H., Melsted, P., and Pachter, L. 2016. Near‐optimal RNA‐seq quantification. Nat. Biotechnol. [Epub ahead of print]. doi: 10.1038/nbt.3519.
  Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., and Huber, W. 2005. BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21:3439‐3440. doi: 10.1093/bioinformatics/bti525.
  Kharchenko, P.V., Tolstorukov, M.Y., and Park, P.J. 2008. Design and analysis of ChIP‐seq experiments for DNA‐binding proteins. Nat. Biotechnol. 26:1351‐1359. doi: 10.1038/nbt.1508.
  Klein, H.U., Schäfer, M., Porse, B.T., Hasemann, M.S., Ickstadt, K., and Dugas, M. 2014. Integrative analysis of histone ChIP‐seq and transcription data using Bayesian mixture models. Bioinformatics 30:1154‐1162. doi: 10.1093/bioinformatics/btu003.
  Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows‐Wheeler transform. Bioinformatics 25:1754‐1760. doi: 10.1093/bioinformatics/btp324.
  Lun, A.T. and Smyth, G.K. 2014. De novo detection of differentially bound regions for ChIP‐seq data using peaks and windows: Controlling error rates correctly. Nucleic Acids Res. 42:e95. doi: 10.1093/nar/gku351.
  Robinson, M.D., McCarthy, D.J., and Smyth, G.K. 2010. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139‐140. doi: 10.1093/bioinformatics/btp616.
  Schäfer, M., Lkhagvasuren, O., Klein, H.U., Elling, C., Wüstefeld, T., Müller‐Tidow, C., Zender, L., Koschmieder, S., Dugas, M., and Ickstadt, K. 2012. Integrative analyses for omics data: A Bayesian mixture model to assess the concordance of ChIP‐chip and ChIP‐seq measurements. J. Toxicol. Environ. Health A 75:461‐470. doi: 10.1080/15287394.2012.674914.
  Taslim, C., Wu, J., Yan, P., Singer, G., Parvin, J., Huang, T., Lin, S., and Huang, K. 2009. Comparative study on ChIP‐seq data: Normalization and binding pattern characterization. Bioinformatics 25:2334‐2340. doi: 10.1093/bioinformatics/btp384.
  Thorvaldsdóttir, H., Robinson, J.T., and Mesirov, J.P. 2013. Integrative Genomics Viewer (IGV): High‐performance genomics data visualization and exploration. Brief Bioinform. 14:178‐192. doi: 10.1093/bib/bbs017.
  Wagner, G.P., Kin, K., and Lynch, V.J. 2012. Measurement of mRNA abundance using RNA‐seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131:281‐285. doi: 10.1007/s12064‐012‐0162‐3.
  Yates, A., Akanni, W., Amode, M.R., Barrell, D., Billis, K., Carvalho‐Silva, D., Cummins, C., Clapham, P., Fitzgerald, S., Gil, L., Girón, C.G., Gordon, L., Hourlier, T., Hunt, S.E., Janacek, S.H., Johnson, N., Juettemann, T., Keenan, S., Lavidas, I., Martin, F.J., Maurel, T., McLaren, W., Murphy, D.N., Nag, R., Nuhn, M., Parker, A., Patricio, M., Pignatelli, M., Rahtz, M., Riat, H.S., Sheppard, D., Taylor, K., Thormann, A., Vullo, A., Wilder, S.P., Zadissa, A., Birney, E., Harrow, J., Muffato, M., Perry, E., Ruffier, M., Spudich, G., Trevanion, S.J., Cunningham, F., Aken, B.L., Zerbino, D.R., and Flicek, P. 2016. Ensembl 2016. Nucleic Acids Res. 44:D710‐D716. doi: 10.1093/nar/gkv1157.
Internet Resources
  http://www.bioconductor.org
  The Bioconductor Web site provides information on how to install R and the required Bioconductor packages. Detailed manuals and use cases are available for each Bioconductor package. The documentation of the packages epigenomix, GenomicRanges, and SummarizedExperiment might be particularly helpful.
  http://pachterlab.github.io/kallisto
  The kallisto software for RNA‐seq quantification and respective documentation can be downloaded from this site.
  http://bio‐bwa.sourceforge.net
  The Burrows‐Wheeler Aligner (bwa) and respective documentation can be downloaded from this site.
  http://broadinstitute.github.io/picard
  The Picard tools for manipulating sam and bam files and respective documentation can be downloaded from this site.
GO TO THE FULL PROTOCOL:
PDF or HTML at Wiley Online Library