InTAD: chromosome conformation guided analysis of enhancer target genes

Bioinformatics

New technologies for analyzing the three-dimensional chromosome organization in a genome-wide manner have revealed mechanisms by which chromosome communication is established [1]. By using different types of high-throughput techniques, such as ChIP-sequencing sensitive for different types of histone modifications, whole genome bisulfite sequencing, ATAC-sequencing, and DNase-Seq, many studies have discovered a large number of enhancers involved in gene regulation. Importantly, the analysis of active chromatin can uncover potential targets relevant for precision treatment of cancer [2]. To associate enhancers with their target genes in the absence of sample-matched chromosome conformation data, several computational methods have been developed.

A widely used approach to associate enhancers with their target genes is to consider the closest genes along the linear DNA. For example, the R package ELMER uses 450 K DNA methylation array data to first define enhancers based on hypo-methylated CpGs and then predicts enhancer target genes by computing the correlation between DNA methylation and gene expression restricting the analysis to the 10 closest genes up- and downstream of the enhancer [3]. Another example is TENET, an analytical approach that associates genome-wide expression changes of transcription factors with gain or loss in enhancer activities by correlating DNA methylation levels at enhancers with the gene expression of transcription factors [4]. However, both tools require DNA methylation array data as input and restrict the correlation to the ‘closest genes’ or to transcription factors that regulate enhancers.

The 11-zinc finger DNA-binding protein CCCTC-binding factor (CTCF) plays an important role in chromatin organization [5]. To improve the identification of gene-enhancer interactions, information on CTCF binding sites can be leveraged. The PreSTIGE method employs this strategy by accessing CTCF ChIP-seq data derived from 13 cell types [6]. Here, CTCF binding sites are considered as insulators separating enhancers from their target genes. This method is currently available as an online application, however, its functionality is limited to the available reference data only and each sample is analyzed independently.

A fundamental concept of chromatin organization is topologically associated domains (TADs). TADs are segments of the genome characterized by frequent chromosome interactions within themselves and they are insulated from adjacent TADs [7]. It has been shown that mutations disturbing the integrity of TADs can lead to the activation of proto-oncogenes causing tumor development [8, 9].

We have developed an R package, InTAD, that tests for significant correlations between genes and enhancers co-located in the same TAD (Fig. 1). Previously we employed this strategy to identify and validate enhancer-associated genes in different pediatric brain tumor types including medulloblastoma (n = 25 samples) [10], atypical teratoid/rhabdoid tumors (n = 11 samples) [11] and ependymoma (n = 24 samples) [12]. Importantly, InTAD is not restricted to specific data types and can detect enhancer-gene correlations in any cohort of samples analyzed by genome-wide gene expression and epigenetic profiling. While this approach cannot entirely compensate for the lack of condition-specific chromosome conformation data, it can predict proximal and distal enhancer target genes without limiting the analysis to the ‘closest gene’. The package is open-source and available at Bioconductor.

Fig. 1

Chromatin is organized in topologically associated domains (TADs). The InTAD software package tests for significant correlations between genes and enhancers restricted by TAD boundaries

Implementation

The structure of the InTAD package is outlined in Fig. 2a. InTAD requires three input data sets including a data matrix of epigenetic signals (e.g. normalized RPKM values at predefined enhancers derived from ChIP-seq data) and a gene expression matrix (e.g. normalized RPKM values from RNA-seq data). To identify enhancers and genes co-located in the same TAD, each data matrix has to contain the genomic coordinates of the enhancers or genes, respectively. The input data can be provided either as standard R objects, such as data frame, or as paths to the text files in common formats for count tables and genomic annotations. The function that generates the central data object performs inconsistency checks of the input data and provides various options, such as multi-core data processing to increase the performance. As indicated in Fig. 2a, the analysis starts by initialization of a MultiAssayExperiment R object [13].

Fig. 2

a Structure of the InTAD package. b Simulated Hi-C map based on correlations between enhancers (x-axis) and genes (y-axis). TAD boundaries are indicated as dashed boxes. Marked is EPHB2, a validated ependymoma oncogene that correlates significantly with proximal and distal enhancers. c The correlation plot reveals co-activation of EPHB2 and a distal enhancer element located 200 kbp away from the transcription start site. Both, EPHB2 and the distal enhancer element, are specifically expressed in ependymomas of the molecular subgroup ST-EPN-RELA

Moreover, InTAD requires a predefined set of TAD regions as input. Since approximately 60–80% of TADs remain stable across cell types [14], the package comes with a set of TADs derived from IMR90 human fibroblast cell lines [7], which we have accessed in previous studies [1012]. However, to take into account cell-type specific TAD boundaries, other HiC data can also be integrated by providing the resulting TAD regions as input in BED format.

Various parameters allow to control further steps of the analysis workflow. Genes can optionally be filtered based on the analysis of their expression distribution or by selecting specific types of RNA. Further, enhancers and genes are combined when their genomic coordinates are embedded in the same TAD. Since the boundaries of TADs have shown to be sensitive to the analytical method applied and may vary across cell types, genes that do not fall into a TAD are assigned to the nearest TAD by default. Subsequently, correlations between all enhancer-gene pairs within the same TAD are computed by selecting one of the supported methods: Pearson, Kendal or Spearman correlation. In addition, adjusted p-values can be calculated to control the false discovery rate using the R/Bioconductor package qvalue [15]. The final result table includes detailed information about the computed correlation values, adjusted p-values, and Euclidian distances as an additional measure that allows to identify potential correlations that suffer from scale invariance.

The results can be visualized by simulated Hi-C maps highlighting significant correlations at selected genomic loci (Fig. 2b). Additionally, correlations between a selected gene and enhancer pair can be visualized with custom colors by providing annotations that reflect groups of samples (Fig. 2c).

Articles You May Like

New variety of zebra chip disease threatens potato production in southwestern Oregon
How slippery surfaces allow sticky pastes and gels to slide
Behavioral disorders in kids with autism linked to reduced brain connectivity
A universal framework combining genome annotation and undergraduate education
Carbon dioxide from Silicon Valley affects the chemistry of Monterey Bay

Leave a Reply

Your email address will not be published. Required fields are marked *