Feature related multi-view nonnegative matrix factorization for identifying conserved functional modules in multiple biological networks

Bioinformatics

In this section, we first present simulation studies to demonstrate the performance of ConMod to detect conserved modules in synthetic multiple networks. We compare ConMod with four state-of-the-art methods, including NetsTensor [11], SC-ML [16], multi-view pairwise co-regularized spectral clustering (pairwiseCRSC) [15] and multi-view centroid-based co-regularized spectral clustering (centroidCRSC) [15]. NetsTensor introduced a tensor-based computational framework to identify recurrent heavy subgraphs in multiple biological networks. SC-ML modeled each graph layer as a subspace on a Grassmann manifold and then efficiently merge these subspaces find a unified clustering of the vertices. PairwiseCRSC and centroidCRSC employed a spectral clustering-based co-regularization framework for clustering across multiple views. Furthermore, to test whether ConMod is effective for finding conserved modules with meaningful biological functions, we apply ConMod to two sets of real biological networks, a set of 33 cancer type-specific gene co-expression networks and a set of human 15 brain tissue-specific protein interaction networks.

Results on synthetic networks

Simulation data

To test the performance, we first evaluate our method using synthetic networks. We generate two sets of synthetic networks that contain different types of conserved modules: (1) conserved modules are common to a given set of networks and (2) conserved modules are present only in a subset of networks and they are the overlapping parts of specific modules across different networks.

We consider the first type of synthetic multiple networks with M=30 networks and N=500 nodes. We generate five modules with 80 nodes in each module and these modules are randomly assigned into 25, 20, 15, 10 and 5 networks, respectively. In this way, each network contains up to five modules. In each network, we connect nodes with a possibility of α (0 αβ (0 βα). An example is shown in Fig. 2a. In order to introduce edge weights, we embed Gaussian noise on the networks (See more details in Additional file 1).

Fig. 2

Performance in terms of TPR, FPR and MCC with different α. a The conserved modules are common to a given set of networks. b The conserved modules are the overlapping parts of specific modules across different networks

For the second type of synthetic dataset, we consider multiple networks with M=15 networks and N=500 nodes. In each network, a module consists of two parts, a common part, in which the nodes are common to a set of networks, and a specific part, in which nodes present only in its individual network. The common parts of every module are regarded as conserved modules in this case. We set two conserved modules of this type for this synthetic dataset. A conserved module has 50 nodes and another has 40 nodes. An example is shown in Fig. 2b. Other procedures for synthetic networks construction is the same as mentioned above (See more details in Additional file 1).

In this study, we experiment on synthetic networks with α = 0.1, 0.3, 0.5 and 0.7 and β = 0.05. Lower value of α means modules are fuzzier and harder to detect.

Evaluation measures

We use true positive rate (TPR), false positive rate (FPR) and Matthew’s correlation coefficient (MCC) [22] to quantify the performance of methods, which are defined as:

$$ TPR=frac{TP}{TP+ FN} $$

(8)

$$ FPR=frac{FP}{FP+ TN} $$

(9)

$$ MCC=frac{TPtimes TN- FPtimes FN}{sqrt{left( TP+ FPright)left( TP+ FNright)left( TN+ FPright)left( TN+ FNright)}}, $$

(10)

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. A TP decision assigns two related nodes to the same module. A TN decision assigns two unrelated nodes to different modules. An FP decision assigns two unrelated nodes to the same module. An FN decision assigns two related nodes to different modules. MCC returns a value in [−1, 1]. A value of + 1 represents a perfect prediction, 0 is no better than random prediction and − 1 indicates total disagreement between prediction and observation.

Performance

We generate synthetic datasets with different value of α. For our method, we use the parameters λs = 0.01, λp = 0.05 and θ = 2. The effects of parameters will be discussed later in more detail. All experiments are repeated 50 times on random generated datasets and the average results are reported for consistency. Figure 2 shows the examples of synthetic multiple networks with different type of conserved modules and the accuracy of each method in terms of TPR, FPR and MCC. ConMod outperforms the other methods in various value of α whenever the conserved modules are common to a given set of networks (Fig. 2a) or are the overlapping parts of specific modules across different networks (Fig. 2b). In particular, ConMod performs the best when the module structures are fuzzier (α = 0.1).

Next, we evaluate the efficiency of ConMod. We conduct the experiments on a 2.10GHz desktop with 128GB memory. Figure 3a shows the running time when varying the number of nodes and keeping the number of networks as 50. Figure 3b shows the running time when varying the number of networks and keeping each network size as 10,000. We do not compare with NetsTensor and omit the results of SC-ML, PairwiseCSRC and CentroidCSRC when the number of nodes is larger than 10,000 because of their high memory and running time cost. As can be seen from Fig. 3, the running time of ConMod is very low and is almost not affected by the number of networks, especially in large scale multiple networks. Additional figures regarding the other number of networks and nodes are put in the additional file (Additional file 1: Figure S1 and S2).

Fig. 3

Running time evaluation. a The running time when varying the number of nodes and keeping the number of networks as 50. b The running time when varying the number of networks and keeping each network size as 10,000

Conserved functional modules in cancer type-specific gene co-expression networks

In this section, we apply ConMod to multiple large-scale gene co-expression networks of 33 cancers. We aim at finding common signatures and biological functions in different cancers by identifying conserved functional modules. Such conserved gene co-expression modules can help reveal the gene expression regulatory basis for common traits in cancer [23].

We download the mRNA-sequencing data of all available 33 cancer types from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). For each cancer type, we only select samples labeled as tumor. The Fragments Per Kilobase Million (FPKM) of each gene is transformed by log2(FPKM + 1). For each cancer type, coding genes with FPKM > 1 in more than 50% of all samples are selected. Then the intersection of expressed genes in all cancer types are used for constructing cancer type-specific gene co-expression networks based on Pearson’s correlation coefficient. Meaningful relations are selected based on first-order partial correlation and information theory by PCIT R package [24]. Finally, we obtain a set of 33 cancer type-specific gene co-expression networks with 7,526 genes for each network.

We compare the performance of ConMod with NetsTensor [11], SC-ML [16], pairwiseCRSC [15] and centroidCRSC [15] by assessing the biological relevance of identified conserved functional modules. Here, we perform systematic enrichment analysis for genes of each module using Gene Ontology (GO) biological process [25, 26]. We use precision, recall and f-score as the evaluation measures in this case. Precision is defined as the fraction of predicted modules that significantly overlap with reference gene sets. Recall is defined as the fraction of reference gene sets that significantly overlaps with predicted modules. F-score is defined as the harmonic mean of precision and recall. We calculate statistical significance p-value using Fisher’s exact test and raw p-values were corrected using the method of Benjamin-Hochberg [21].

Figure 4 shows the performance of ConMod and other methods in terms of precision, recall and f-score w.r.t. different number of candidate modules k. We clearly see that ConMod is more stable than other methods and performs the best in most cases. Besides, we can see that the f-score is high enough when k=150 and has no significant increase after k=150, which can provide a reference for the selection of k. Note that NetsTensor does not need to specify the number of modules in advance, however it does not perform well because of the low node coverage and high overlap between discovered modules.

Fig. 4

Precision, recall and f-score with different k in 33 cancer type-specific gene co-expression networks. Biological relevance of conserved modules is evaluated by GO biological process

Next, after parameter optimizations, we set k=150 and θ = 3.5 and obtain 150 conserved functional modules covering 7,182 genes. The average module size is 113.2. We evaluated the resulting gene modules using multiple gold-standard gene set annotations from MsigDB [27] of GSEA [28], including the biological process category of Gene Ontology (GO) [25, 26], Canonical pathways (CP), Biocarta [29], KEGG [30] and REACTOME [31]. ConMod achieves higher f-scores than other four methods using all reference sets (Fig. 5a). We find that 86 (57%) and 60 (40%) of conserved modules are significantly enriched in at least one GO biological process and KEGG pathway (BH-adjusted p-value5c and d respectively. We observe that these biological functions are related to ribosome protein, energy metabolism, cell cycle and immune response. Most of these functions are necessary to maintain a cell’s life. These modules, acting as housekeeping roles, universally expressed in different tissues. However, cancers require a great deal of DNA replication and protein synthesis. Thus, most of the conserved modules and their functions are also closely associated with cancer. In particular, two significant GO biological processes, antigen processing and presentation and interferon-gamma-mediated signaling pathway, are both essential for immune response, which is often observed to be inhibited in the tumor microenvironment [7, 32]. In addition, we test the relationship between the functional modules and cancer driver genes [33, 34]. By following a previous work [35], we utilized 2,372 genes from the Network of Cancer Genes (NCG) [36] as benchmarking cancer genes, including 711 known cancer genes from the Cancer Gene Census (CGC) [37]. We use Fisher’s exact test to validate whether the modules are significantly associated with benchmark cancer genes (BH-adjusted p-value1: Figure S3). This result indicates that the conserved functional modules identified by our method are able to reveal the characteristics of cancer.

Fig. 5

Illustration of results on a set of 33 cancer type-specific gene co-expression networks. a F-score of five methods. Gene modules found by each method are evaluated by multiple gold-standard gene set annotations. b The scatter plot for the average value of connection strength and participation coefficient of each module. Each point represents an identified module. c Top five significant biological process enriched by the identified modules. d Top five significant KEGG pathways enriched by the identified modules. e Heat map of distribution of modules in multiple cancers. Each row represents a conserved module and each column corresponds to a cancer type. If a module significantly distributes in a cancer’s network (BH-adjusted p-value< 0.01, permutation test), it will be labeled as 1. Otherwise, a module will be labeled as 0

We compute the average value of connection strength and participation coefficient respectively for each conserved module, and we observe that the two features are highly correlated (Pearson correlation coefficient r=0.83) (Fig. 5b). It is easily understood that a dense module conserved in more networks tend to has larger connection strength. After module validation, we can know how the conserved modules distribute in multiple networks (Fig. 5e). We consider that a module exists in a network if its Benjamin-Hochberg adjusted p-value< 0.01 using a permutation test. Modules that do not exist in more than half of all networks are removed. From Fig. 5e we observe that about 25% of identified modules are common in all cancers and almost all modules are conserved in more than half of these cancers. Furthermore, similar cancers can be naturally clustered together only based on the distribution of identified modules (see the hierarchical clustering for cancers in Fig. 5e), such as SKCM (Skin Cutaneous Melanoma) and UVM (Uveal Melanoma); and THYM (Thymoma) and DLBC (Lymphoid Neoplasm Diffuse Large B-cell Lymphoma). Actually, SKCM and UVM are two types of melanoma, THYM and DLBC are both originated in the lymphatic system that participates in immune response.

Here, we take module 27 and module 111 as examples. Module 27, which has the largest connection strength and participation coefficient (Fig. 5b), significant exists in all cancers (BH-adjusted p-value 3840]. In order to investigate the alterations gene expression patterns of this RP related gene module in different cancers, we compute the log2 fold-change for significantly differentially expressed genes in module 27 (Fig. 6b). 17 cancers with at least five normal samples are selected for this experiment. For each cancer, we use DESeq2 [41] to detect differentially expressed genes relative to normal samples. As shown in Fig. 6b, most genes of module 27 are significantly up-regulated in more than half cancers, especially in COAD (Colon Adenocarcinoma), LIHC (Liver Hepatocellular Carcinoma), PRAD (Portal Prostate adenocarcinoma) and three kinds of kidney cancers (KIRC (Kidney renal clear cell carcinoma), KIRP (Kidney renal papillary cell carcinoma) and KICH (Kidney Chromophobe)). Even though cancer cells require continuous ribosome biogenesis and protein translation to maintain their high proliferation rate [39], it is reported that many RP genes have been found overexpressed in cancer and their mutations have been detected in the genome of cancer cells [40, 42], for example, in prostate cancer [43, 44] and in colorectal cancer [45, 46]. Hence, targeting ribosome biogenesis of tumor cells could be an effective strategy [40].

Fig. 6

Illustration of module 27 and module 111. a Top significant biological terms enriched by the genes of module 27. The top five GO terms in biological process and the only one enriched KEGG pathway are displayed. b The heat map illustrates the log2 fold-change of gene expression in module 27 for each cancer relative to normal samples. The bar plot under the heat map shows the average log2 fold-change of all genes in module 27. Cancers with at least five normal samples are selected. c Top significant GO terms in biological process enriched by the genes of module 111. d The presentations of module 111 in DLBC and LUAD. The module mainly split into two dense sub-modules in DLBC, a small part consists of MHC class II genes and a large part consists other immune-related genes

Module 111 consist of 137 genes. Genes in this module mainly involve in antigen processing and neutrophil, leukocyte or T cell related processes, which are all closely related with cancers due to their important roles in immune system (Fig. 6c). This module, however, does not exist in THYM and DLBC (Fig. 5e). Actually, the module mainly splits into two dense sub-modules in THYM and DLBC respectively, but maintains a complete module in the rest cancers, e.g. in LUAD (Lung Adenocarcinoma) (Fig. 6d). In particular, module 111 in DLBC consist of a large sub-module and a small sub-module. The small part comprise 10 genes (CD74, HLA-DQB1, HLA-DRB1, HLA-DQA1, HLA-DRB5, HLA-DMA, HLA-DRA, HLA-DPB1, HLA-DPA1, HLA-DMB), all of which are MHC (major histocompatibility complex) class II genes in HLA (human leucocyte antigen). The separation of the two sub-modules results from the weak correlation in expression between the MHC class II genes in the small sub-module and other immune-related genes in the large sub-module, suggesting a disruption of the co-operation of these genes to exert immunity responses. Actually, DLBC is a cancer of B cells. Cancerous B cells can not normally produce MHC class II molecules, which are exported to B cell’s surface and interact with their intended T cells to initiate immune response [47].

Conserved function modules in human brain tissue-specific interaction networks

The human brain is a complex system organized by structural and functional relationships between its functional regions, such as the thalamus, brainstem and other brain tissues. Recently, multiple brain networks and their applications in neuroscience have successfully uncovered brain-associated features [48, 49]. We now aim to identify conserved protein modules across human tissue-specific networks, which may reveal important function units for brain activity.

We run ConMod on a set of 15 human brain tissue-specific protein interaction networks [50] to find conserved protein modules. There are 2,721 proteins in total. It should be noted that, different from 33 cancer type-specific networks in the above section, all networks of this dataset are unweighted and they have different number of nodes.

We compare ConMod with SC-ML and NetsTensor on this data because other methods are not suitable for the dataset in which the set of data objects is different in each network. Figure 7 shows the performance of ConMod and other methods in terms of precision, recall and f-score w.r.t. different number of candidate modules k. As is shown, ConMod outperforms SC-ML and NetsTensor in precision for all settings of k while maintaining comparable recall values. As an average, ConMod has a better performance in f-score.

Fig. 7

Precision, recall and f-score with different k in 15 brain tissue-specific protein interaction networks. Modules are evaluated by GO biological process

After parameter optimizations, we set k=120 and θ = 4 and obtained 114 conserved functional modules covering 1,414 genes. The average module size is 23.2. We evaluated these modules using multiple gold-standard gene set annotations as the same procedure mentioned in the above section. As shown in Fig. 8a, ConMod achieves higher f-score when evaluated using all reference sets. The identified conserved modules mainly relate to nervous system development, mRNA processing, etc. (Additional file 1: Figure S4). Here, we take module 7 as an example (Fig. 8c). Module 7, which has the largest connection strength and participation coefficient in this dataset (Fig. 8b), consists of seven proteins with a significant number (6) of proteins (EIF2B1, EIF2B2, EIF2B3, EIF2B4, EIF2B5 and EGFR) for glial cell development (BH-adjusted p-value = 3.94E-11). Leegwater et al. [51] reported that the gene mutations of the five subunit proteins (EIF2B1, EIF2B2, EIF2B3, EIF2B4 and EIF2B5) of EIF2B complex can lead to white matter abnormalities, a serious hereditary neurodegenerative disease. Another important gene is EGFR, which is widely distributed in glial cells of mammalian brain. EGFR activation is essential for the proliferation of multipotent neural precursors, as well as the survival, migration, and differentiation of the immature daughter cells [52].

Fig. 8

Illustration of results on 15 brain tissue-specific interaction networks. a F-score of three methods. Gene modules found by each method are evaluated by multiple gold-standard gene set annotations. b The scatter plot for the average value of connection strength and participation coefficient of each module. c A case of identified conserved module that are significant related with glial cell development (FDR = 3.94E-11)

Parameter discussion

There are four parameters in our ConMod method: the regularization parameters λs and λp in multi-view NMF, the number of modules k and the threshold θ for nodes selection. We first discuss the influence of parameters λs and λp. Following a similar approach as proposed in Ref. [14], we set λv to be the same for convenience, that is λv = λs = λp, and varying it from 10−3 to 1 on two synthetic datasets with α = 0.1 and α = 0.3. The optimal values appear when λv is around 0.01 and the accuracy is relatively stable when λv < 0.1 (Additional file 1: Figure S5). We aim at finding conserved modules, thus we let the participation coefficient has a larger effect by denoting λs = 0.01 and λp = 0.05 for all the experiments.

The selection of the parameter k has a significant effect on the results. While the choice of k is often data-dependent and is a long-standing open problem. The lower k of the reduced space is a key parameter for this study. For the synthetic datasets, we set k as the real number of modules. However, for real biological datasets that the real number of modules is unknown, we select a proper k by assessing the enrichment rate of gene modules with respect to GO biological process. A low k with a relative high f-score is selected for each datasets. We have shown in the experiments that we choose k=150 for multiple co-expression networks of cancers (Fig. 4; Additional file 1: Figure S6) and k = 120 for multiple brain-specific protein interaction networks (Fig. 7; Additional file 1: Figure S7).

The parameter θ determines the size of a module. A larger θ means a small size of module, but a more significant signal in the consensus factors. θ is generally larger than 2, because the corresponding p-value is smaller than 0.05. In our experiments we choose θ = 2 for all the synthetic datasets, θ = 3.5 for multiple co-expression networks of cancers (Additional file 1: Figure S6) and θ = 4 for multiple brain-specific protein interaction networks (Additional file 1: Figure S7). The reason for selection these values of parameter θ is that we try to keep small size of modules and a high coverage of total number of nodes while ensure a high accuracy.

Articles You May Like

Rhythms of life: circadian disruption and brain disorders across the lifespan
Einstein was wrong: Why ‘normal’ physics can’t explain reality
Modern lifestyles shaped our evolution only a few thousand years ago
As Insurers Offer Discounts For Fitness Trackers, Wearers Should Step With Caution
A jaw-dropping binary star is about to go supernova, and could produce a gamma-ray burst

Leave a Reply

Your email address will not be published. Required fields are marked *