HiCcompare: an R-package for joint normalization and comparison of HI-C datasets

Bioinformatics

Hi-C data representation and properties

HiCcompare focuses on the joint analysis of multiple Hi-C datasets represented by chromatin interaction matrices, where rows and columns represent genomic regions (bins), and cells contain interaction counts (frequencies). A chromosome-specific Hi-C matrix is a square matrix of size N × N, where N is the number of genomic regions (bins) of size X on a chromosome. The size X of the genomic regions defines the resolution of the Hi-C data. Each cell in the matrix contains an interaction frequency IFi, j, where i and j are the indices of the interacting regions. The values on the diagonal trace represent interaction frequencies (IFs) of self-interacting regions. Each off-diagonal trace of values represents interaction frequencies for a pair of regions at a given unit-length distance. The unit-length distance is expressed in terms of resolution of the data (the size of genomic regions, typically measured in millions (thousands) of base pairs, MB (KB)). The concept of considering interaction frequencies at each off-diagonal trace is central for the joint normalization and differential chromatin interaction detection (Fig. 2).

Fig. 2

Distance-centric (off-diagonal) view of chromatin interaction matrices. Each off-diagonal vector of interaction frequencies represents interactions at a given distance between pairs of regions. Triangles mark pairs of genomic regions interacting at the same distance. Data for chromosome 1, K562 cell line, 50 KB resolution, spanning 0–7.5 Mb is shown

The interaction frequency drops as the distance between interacting regions increases. Numerous attempts have been made to parametrically model the inverse relationship between chromatin interaction frequency and the distance between interacting regions. However, Hi-C data are affected by technology- and DNA sequence-driven biases [1315], unpredictably altering chromatin interaction frequencies. Consequently, parametric approaches fail to model interaction frequencies across the full range of distances [12], confirmed by our observations (Additional file 1: Figure 2.1). For this study, data in the sparse upper triangular format from the GM12878, K562, and RWPE1 cell lines were used (Supplemental Methods, Additional file 1).

It is also important to note that HiCcompare is designed to analyze pre-processed Hi-C data, unlike many other tools which require the user to deal with the raw sequencing data. There are a growing number of Hi-C libraries, already processed into matrix format, available for download on many public repositories such as GEO. HiCcompare is specifically designed to make it easy for the user to perform their own analyses on these pre-processed Hi-C matrices.

Visualization of the differences between two Hi-C datasets

The first step of the HiCcompare procedure is to convert the data into what we refer to as an MD plot. The MD plot is similar to the MA plot (Bland-Altman plot) commonly used to visualize gene expression differences [22]. M is defined as the log difference between the two data sets M = log2(IF2/IF1), where IF1 and IF2 are interaction frequencies of the first and the second Hi-C datasets, respectively. D is defined as the distance between two interacting regions, expressed in unit-length of the X resolution of the Hi-C data. In terms of chromatin interaction matrices, D corresponds to the off-diagonal traces of interaction frequencies (Fig. 2). Because chromatin interaction matrices are sparse, i.e., contain an excess of zero interaction frequencies, and it cannot be determined if a zero IF represents missing data or a true absence of interaction, by default only the non-zero pairwise interaction are used for the construction of the MD plot. However, if the user wishes to include partial zero interactions, i.e. with a zero value in one of the matrices and a non-zero IF in the other the option is available.

Elimination of biases in jointly, but not individually, normalized Hi-C data

Discovery of biases in Hi-C data led to the development of numerous methods for normalizing individual datasets [6, 1416]. Although normalization of individual datasets improves reproducibility of replicated Hi-C data [13, 15], these methods focus on correcting biological and internal biases and do not explicitly account for biases between multiple Hi-C datasets. When the goal is to compare two Hi-C libraries it can be assumed that many of these internal and biological biases affect both libraries similarly and thus their correction is less important. It is the between-dataset biases that are particularly problematic when comparing Hi-C datasets between biological conditions (Section 4, Additional file 1). To detect chromatin interaction differences due to biology, not biases, it is critical to use a normalization method that removes the between-dataset biases.

To assess the between-dataset biases, we visualize two Hi-C datasets on a single MD plot. Visualizing replicates of Hi-C data (Gm12878 cell line) showed the presence of biases in the individually normalized datasets (Fig. 3 and Section 4, Additional file 1), suggesting that the performance of individual normalization methods may be sub-optimal when comparing multiple Hi-C datasets.

Fig. 3

MD plot data visualization and the effects of different normalization techniques. MD plots of the differences M between two replicated Hi-C datasets (GM12878 cell line, chromosome 11, 1 MB resolution, DpnII and MboI restriction enzymes) plotted vs. distance D between interacting regions. a Before normalization, b after loess joint normalization, c ChromoR, d Iterative Correction and Eigenvector decomposition (ICE), e Knight-Ruiz (KR), f Sequential Component Normalization (SCN). The general shift of the data above M = 0 is due to one of the Hi-C libraries having more total reads. The trends emphasized by the loess curve imposed on the data are due to distance dependent between-dataset biases which only HiCcompare’s joint normalization procedure can successfully remove

To account for between-dataset biases, we developed a non-parametric joint normalization method that makes no assumptions about the theoretical distribution of the chromatin interaction frequencies. It utilizes the well-known loess (locally weighted polynomial regression) smoothing algorithm – a regression-based method for fitting simple models to segments of data [23]. The main advantage of loess is that it accounts for any local irregularities between the datasets that cannot be modeled by parametric methods. Thus, loess is particularly appealing when normalizing two Hi-C datasets, as the internal biases in Hi-C data are poorly understood (Fig. 3).

The HiCcompare joint normalization procedure proceeds by first plotting the data on the MD plot, then loess regression [23] is performed with D as the predictor for M. The fitted values are then used to normalize the original IFs:

$$ left{begin{array}{l} lo{g}_2left({hat{IF}}_{1D}right)= lo{g}_2left(I{F}_{1D}right)+f(D)/2\ {} lo{g}_2left({hat{IF}}_{2D}right)= lo{g}_2left(I{F}_{2D}right)-f(D)/2end{array}right. $$

where f(D) is the predicted value from the loess regression at a distance D. The ( lo{g}_2left(hat{IF}right) ) values are then anti-logged to obtain the normalized IFs. Note that for both Hi-C datasets the average interaction frequency remains unchanged, as IF1 is increased by the factor of f(D)/2 while IF2 is decreased by the same amount. Any normalized IFs with values less than one are not considered in further analyses. The joint normalization was tested against five methods for normalizing individual Hi-C matrices, ChromoR [24], ICE [15], KR [16], SCN [14], MA [25] (Supplemental Methods, Additional file 1).

Existing Hi-C data at high resolutions (e.g., 10 kb) still suffer from a limited dynamic range of chromatin interaction frequencies, with the majority of them being small or zero, especially at large distances between interacting regions. This sparsity places limits on loess joint normalization, as it builds a rescaling model from many non-zero pairwise comparisons. A way to alleviate this limitation is to consider interactions only within a range of short interaction distances, where genomic regions interact more frequently, and the proportion of zero interaction frequencies is the lowest. Our evaluation of loess joint normalization showed it performs best at resolutions between 1 MB and 50 KB (Section 4 & Section 7, Additional file 1). The issue of sparsity limiting the usefulness of loess normalization will be alleviated as sequencing techniques continue to improve and Hi-C datasets with deeper sequencing become available.

Excluding potentially problematic regions from the joint normalization

Some between-dataset biases may occur due to large-scale genomic rearrangements and copy number variants (CNVs), a frequent case in tumor-normal comparisons [18]. Similar to removing other biases, the joint loess normalization removes CNV-driven biases by design, allowing for the detection of chromatin interaction differences within CNV regions. However, CNVs introduce large changes in chromatin interactions [17], which may be of interest to consider separately. Therefore, unless cells/tissues with normal karyotypes are compared, we provide optional functionality for the detection and removal of genomic regions containing CNVs from the joint normalization. The QDNAseq [26] R package is used to detect and exclude CNVs from the HiCcompare analysis. Alternatively, CNV regions can be detected separately and provided to HiCcompare as a BED file. Additionally, the HiCcompare package includes the ENCODE blacklisted regions for hg19 and hg38 genome assemblies, which can be excluded from further analysis.

Detecting differential chromatin interactions

After joint normalization, the chromatin interaction matrices are ready to be compared for differences. Again, the MD plot is used to represent the differences M between two normalized datasets at a distance D. The jointly normalized M values are centered around 0 and are approximately normally distributed across all distances (Supplemental Methods, Additional file 1). M values can be converted to Z-scores using the standard approach:

$$ {Z}_i=frac{M_i-overline{M}}{sigma_M} $$

where ( overline{M} ) is the mean value of all M’s on the chromosome and σM is the standard deviation of all M values on the chromosome and i is the ith interacting pair on the chromosome.

During Z-score conversion, the average expression of each interacting pair is considered. Due to the nature of M, a difference represented by an interacting pair with IFs 1 and 10 is equivalent to an interacting pair of IFs 10 and 100 with both differences producing an M value of 3.32. However, the average expression of these two differences is 5.5 and 55, respectively. Differences with higher average expression are supported by the larger number of sequencing reads and are therefore more trustworthy than the low average expression differences. Thus, we filter out differences with low average expression by setting the Z-scores to 0 when average expression (A) is less than a user set value of A (Supplemental Methods, Additional file 1). Filtering takes place such that the ( overline{M} ) and σM are calculated using only the M values remaining after filtering. The Z-scores can then be converted to p-values using the standard normal distribution.

Analyzing Hi-C data for differences necessarily involves testing of multiple hypotheses. Multiple testing correction (False Discovery Rate (FDR)) is applied on a per-distance basis by default, with an option to apply it on a chromosomal basis. If a method other than FDR is desired, all other standard multiple testing corrections are available for the user to choose from.

As there is no “gold standard” for differential chromatin interactions, we created such a priori known differences by introducing controlled changes to replicate Hi-C datasets [27]. To introduce these a priori known differences, we start with two replicates of Hi-C data from the same cell type. It is assumed that any differences in these replicates are due to noise or technical biases. Next, we randomly sample a specified number of entries in the contact matrix. These sampled entries are where the changes will be introduced. The IFs for each of these entries in the two matrices are set to their average value between the replicates, and then one of them is multiplied by a specified fold change. This introduces a true difference at an exact fold change between the two replicates. The benefit of using joint normalization vs. individually normalized datasets was quantified by the improvement in power of detecting the pre-defined chromatin interaction differences. Standard classifier performance measures (Section “Availability and requirements”, Additional file 1), summarized in the Matthews Correlation Coefficient (MCC) metric, were assessed. HiCcompare is able to detect most of the added differences with a relatively low number of false positives across the range of fold changes (Table 1, Section “Availability and requirements”, Additional file 1).

Table 1

Evaluation of the effect of normalization on differential chromatin interaction detection

2

0.847

0.823

0.835

0.768

0.748

0.149

3

0.973

0.934

0.802

0.721

0.764

0.380

4

0.995

0.98

0.953

0.881

0.868

0.532

Differential regions overlap with CTCF sites

We hypothesized that regions, detected as differentially interacting, most likely represent biologically relevant boundaries of topologically associated domains changing between two conditions. As such, we investigated whether differentially interacting regions are enriched in CTCF binding sites, an insulator protein known to bind at TAD boundaries [28]. To test that, we compared Hi-C data from GM12878 and K562 cell lines at 100 MB resolution using HiCcompare. A total of 2365 interactions were identified as interacting differentially (FDR < 0.05) which represented 2783 distinct 100 KB genomic regions. We found that a total of 130,675 CTCF binding sites overlapped with these regions. The amount of overlaps observed was significant (permutation p-value = 0.002), confirming our hypothesis that the differentially interacting regions detected by HiCcompare play an import biological role in chromatin structural organization.

Example HiCcompare analysis using mouse neuronal differentiation

As an example case for the usage of HiCcompare, we performed an analysis to compare the 3D structure of the chromatin between mouse embryonic stem cells (ESC), neural progenitor cells (NPC), and neurons. The data was obtained from a study by Fraser et al. [29] deposited on GEO [GSE59027]. The Hi-C matrices for each cell type were downloaded at 100 KB resolution and read into HiCcompare. We performed three comparisons between the cell types, ESC vs. NPC, NPC vs. neuron, and ESC vs. neuron. In each comparison, the data were normalized, low average expression interactions were filtered out, and the differences between the cell types were detected. We also performed a functional enrichment analysis of genes located in differentially interacting regions.

As expected, the ESC vs. neuron had the largest number of differentially interacting regions at 951 (FDR < 0.05). The ESC and NPC had 279 differentially interacting regions, and the NPC and neuron had only 127 differentially interacting regions. These differences expectedly suggest that the undifferentiated ESCs and fully differentiates neuronal cells have many chromatin interaction differences, while the intermediate neural progenitor cells have less differences when compared with either ESCs or neuron cells. These observations suggest that the chromatin structure plays a key role in the process of cell differentiation.

The enrichment analysis for the ESC vs. the neuron found genes enriched in protein binding function, ion channel regulator activity, and “Axon guidance” pathway among others (Additional file 2). The enrichment of these pathways outlines the ESC-to-neuron differentiation processes that are governed by changes in the 3D structure of the genome. When comparing the ESC and NPC cells, genes were found to be enriched in voltage-gated calcium channel activity, ion transporters, and serotonin metabolic processes (Additional file 3). The enrichment results between the NPC and neuron had fewer results but included IgG receptor activity and binding and cytoskeletal protein binding (Additional file 4). These results indicate that the changes in the chromatin structure contain functionally relevant genes for the cell differentiation process.

The results of this HiCcompare analysis show that our methods are capable of detecting biologically meaningful differences in chromatin conformation when comparing different cell types. Together with the results of Fraser et al. [29], the HiCcompare results indicate that the cellular differentiation process involves structural changes of the chromatin, likely leading to the changes in gene expression and the associated biological pathways.

Comparison with diffHiC

The diffHiC pipeline was designed to process raw Hi-C sequencing datasets and detect chromatin interaction differences using the generalized linear model framework developed in the edgeR package [25]. We compared the results of Hi-C data analyzed in the diffHiC paper (human prostate epithelial cells RWPE1 over-expressing the EGR protein or GFP [18]) with the results obtained by HiCcompare. Because diffHic takes unaligned Hi-C data as input it was not possible to directly compare our method to diffHic using introduced known changes. An additional point to consider for the use of diffHic is that since it is based on the negative binomial GLM methods of edgeR, it requires replicates (or multiple samples per condition) in order to more accurately estimate the negative binomial dispersion parameter. Due to the high costs and relative newness of Hi-C technology, many public datasets do not have any (or very few) replicates thus hampering the estimation of the dispersion factor.

To compare HiCcompare with diffHic we performed a HiCcompare analysis on the RWPE1 Hi-C data [18]. This was compared to the analysis performed in the diffHic paper [25]. We performed the analysis at a 1 MB resolution as described in the diffHic paper. diffHic detected a total of 5737 significant differences (FDR < 0.05), while HiCcompare tended to be more conservative, detecting 680 differences (FDR < 0.05) and 5215 differences when multiple testing correction was not applied (p-value < 0.05). Of the 680 differences, 208 overlapped with the regions detected by diffHic. Surprisingly, although diffHiC used CNV correction in their analysis, 2567 (44.7%) of the detected differentially interacting regions overlapped with CNV regions detected in our analysis, and/or blacklisted regions. diffHic tended to detect differentially interacting regions with smaller fold changes as compared to HiCcompare, and at shorter distances between interacting regions, while HiCcompare can detect differences across the full range of distances (Section 6, Additional file 1). These results suggest that detecting chromatin interaction differences represented in the MD coordinates, as implemented in HiCcompare, may be useful in detecting large chromatin interaction differences across the full range of distances, potentially having a more significant biological effect.

Comparison with FIND

The recently published FIND tool uses a spatial Poisson process to detect differences between two Hi-C experimental conditions [30]. FIND is presented as a tool for high-resolution Hi-C data and treats interactions as spatially dependent on surrounding interactions. In order to compare HiCcompare with FIND, we performed a comparative analysis between Hi-C data from K562 and GM12878 cells lines (Section 7, Additional file 1) as done in the FIND paper [30]. The maximum resolution of each Hi-C matrix was calculated using the calculate_map_resolution.sh function from Juicer [31]. Briefly, two replicates for each cell line were obtained (see Methods), and the replicate contact matrices were combined for the HiCcompare analysis. HiCcompare was used to jointly normalize the data between the cell lines and then detect differences. HiCcompare analyses were performed at 1 MB, 100 KB, 50 KB, 10 KB, and 5 KB resolutions. Additionally, the analyses of GM12878 and K562 were used to compare the run times of HiCcompare and FIND (Section 7, Additional file 1).

The number of differences detected by HiCcompare at 5 KB resolution was much lower than the number FIND detected (~ 150,000) [30]. The drop off of the number of differential interactions detected at high resolution by HiCcompare can be explained by the sparsity and the limited dynamic range of interaction frequencies at 5 KB resolution. Additionally, the large number of differences detected by FIND at 5 KB resolution are questionable given that the maximum resolution of the K562 and GM12878 data was found to be ~ 39 KB and ~ 9 KB, respectively (Section 7, Additional File 1).

The differentially interacting regions detect by HiCcompare at different resolutions were intersected with gene locations, and a KEGG pathway enrichment analysis was performed. The enrichment analysis showed that many of the differential regions contained genes involved in the immune system (Table 2). We also found that the enrichment analyses of HiCcompare-detected differences at each resolution were relatively consistent further indicating the strength of HiCcompare at detecting biologically relevant differences across data resolutions. Despite the differences in resolution of data used for differential analysis (5 kb for FIND and 50 kb – 1 Mb for HiCcompare) the enrichment analysis of HiCcompare-detected differences identified pathways related to the immune system, similar to the results of the FIND analysis. These observations suggest that both methods can detect biologically significant differences.

Table 2

Gene enrichment results for HiCcompare analyses

Systemic lupus erythematosus

3.807e-06

6.302e-17

1.025e-02

Antigen processing and presentation

3.807e-06

6.808e-01

9.974e-01

Staphylococcus aureus infection

8.170e-03

2.354e-01

7.604e-01

Viral myocarditis

8.170e-03

1.038e-01

9.657e-01

Allograft rejection

8.170e-03

1.518e-01

9.974e-01

Viral carcinogenesis

3.327e-02

3.659e-08

3.273e-01

Pathways in cancer

9.162e-01

2.236e-02

9.409e-01

To compare the performance of FIND and HiCcompare when a priori known differences were introduced we used replicated data for GM12878 cells. The GM12878 replicates are expected to contain minimal differences, thus suitable for introducing a priori controlled changes and applying both tools in order to detect them. For the data to be entered into FIND, we used the VC squared normalization method from Juicer as described in the FIND paper and the raw data was entered into HiCcompare. We performed this analysis at a resolution of 1 MB (we encountered issues due to extremely long run times of FIND when attempting comparisons at higher resolutions) with fold changes of 2, 3, and 5 for the true changes. HiCcompare successfully detected the majority of the controlled changes while FIND detected smaller differences and was missing most of the introduced controlled changes (Section 7, Additional File 1). Additionally, we found that the run time of FIND on Hi-C matrices at resolutions between 100 KB and 10 KB was extremely long (> 72 h) even when run in parallel on 16 cores, while HiCcompare was able to complete an analysis within minutes (Additional file 1: Figure 3.1). These results further strengthen the notion that HiCcompare detects large chromatin interaction differences potentially having a larger biological impact on genome structure, and does it across the full range of distances.

Preservation of A/B compartments

A/B compartments are the best known genomic structures that can be detected from Hi-C data [6]. To understand the consequences of the joint vs. individual normalization methods on the detection of A/B compartments we compared principal components defining compartments in raw vs. normalized data. The concordance of compartment detection was evaluated using three metrics: 1) the Pearson correlation coefficient between the vectors of principal components (PCs) detected from raw and normalized data, 2) the overlap of signs of PCs defining A (positive) and B (negative) compartments, and 3) the Jaccard overlap statistics. A/B compartments detected following joint normalization were the most similar to those detected in the raw data (Table 3). These results suggest that the joint HiCcompare normalization preserves properties of Hi-C data needed for the accurate detection of A/B compartments.

Table 3

Similarity between A/B compartments detected following various normalization methods

Loess vs. Raw

0.9954

0.8537

0.7971

0.7823

MA vs. Raw

0.9950

0.8539

0.7881

0.7706

ICE vs. Raw

0.9795

0.7850

0.6731

0.6277

KR vs. Raw

0.9489

0.7771

0.5945

0.5000

SCN vs. Raw

0.9309

0.8083

0.6134

0.5495

ChromoR vs. Raw

0.8093

0.6810

0.5210

0.4803

Summary and future directions

HiCcompare can be used to compare processed Hi-C libraries between two biological conditions. HiCcompare represents a user-friendly method for the scientific community to begin analyzing the differences in the 3D genome while making use of publicly available datasets. HiCcompare can also easily be integrated into the existing juicer [31], HiC-Pro [17], and other Hi-C pre-processing pipelines for those generating and analyzing new Hi-C experiments. A future extension of HiCcompare is planned to make use of Hi-C experiments where multiple replicates or samples are available for each group.

Articles You May Like

Solving the mystery of fertilizer loss from Midwest cropland
A novel data-compression technique for faster computer programs
The history of humanity in your face
Scientists use eBird data to propose optimal bird conservation plan
What makes a jellyfish?

Leave a Reply

Your email address will not be published. Required fields are marked *