New technologies for analyzing the three-dimensional chromosome organization in a genome-wide manner have revealed mechanisms by which chromosome communication is established . By using different types of high-throughput techniques, such as ChIP-sequencing sensitive for different types of histone modifications, whole genome bisulfite sequencing, ATAC-sequencing, and DNase-Seq, many studies have discovered a large number of enhancers involved in gene regulation. Importantly, the analysis of active chromatin can uncover potential targets relevant for precision treatment of cancer . To associate enhancers with their target genes in the absence of sample-matched chromosome conformation data, several computational methods have been developed.
A widely used approach to associate enhancers with their target genes is to consider the closest genes along the linear DNA. For example, the R package ELMER uses 450 K DNA methylation array data to first define enhancers based on hypo-methylated CpGs and then predicts enhancer target genes by computing the correlation between DNA methylation and gene expression restricting the analysis to the 10 closest genes up- and downstream of the enhancer . Another example is TENET, an analytical approach that associates genome-wide expression changes of transcription factors with gain or loss in enhancer activities by correlating DNA methylation levels at enhancers with the gene expression of transcription factors . However, both tools require DNA methylation array data as input and restrict the correlation to the ‘closest genes’ or to transcription factors that regulate enhancers.
The 11-zinc finger DNA-binding protein CCCTC-binding factor (CTCF) plays an important role in chromatin organization . To improve the identification of gene-enhancer interactions, information on CTCF binding sites can be leveraged. The PreSTIGE method employs this strategy by accessing CTCF ChIP-seq data derived from 13 cell types . Here, CTCF binding sites are considered as insulators separating enhancers from their target genes. This method is currently available as an online application, however, its functionality is limited to the available reference data only and each sample is analyzed independently.
A fundamental concept of chromatin organization is topologically associated domains (TADs). TADs are segments of the genome characterized by frequent chromosome interactions within themselves and they are insulated from adjacent TADs . It has been shown that mutations disturbing the integrity of TADs can lead to the activation of proto-oncogenes causing tumor development [8, 9].
Moreover, InTAD requires a predefined set of TAD regions as input. Since approximately 60–80% of TADs remain stable across cell types , the package comes with a set of TADs derived from IMR90 human fibroblast cell lines , which we have accessed in previous studies [10–12]. However, to take into account cell-type specific TAD boundaries, other HiC data can also be integrated by providing the resulting TAD regions as input in BED format.
Various parameters allow to control further steps of the analysis workflow. Genes can optionally be filtered based on the analysis of their expression distribution or by selecting specific types of RNA. Further, enhancers and genes are combined when their genomic coordinates are embedded in the same TAD. Since the boundaries of TADs have shown to be sensitive to the analytical method applied and may vary across cell types, genes that do not fall into a TAD are assigned to the nearest TAD by default. Subsequently, correlations between all enhancer-gene pairs within the same TAD are computed by selecting one of the supported methods: Pearson, Kendal or Spearman correlation. In addition, adjusted p-values can be calculated to control the false discovery rate using the R/Bioconductor package qvalue . The final result table includes detailed information about the computed correlation values, adjusted p-values, and Euclidian distances as an additional measure that allows to identify potential correlations that suffer from scale invariance.
The results can be visualized by simulated Hi-C maps highlighting significant correlations at selected genomic loci (Fig. 2b). Additionally, correlations between a selected gene and enhancer pair can be visualized with custom colors by providing annotations that reflect groups of samples (Fig. 2c).