We propose here an integrative analysis strategy to identify key regulators and pathways in cancers from the TCGA data. By comparing gene expression, genetic variation, and DNA methylation data between normal and cancer samples, we extracted the triple-evidenced genes for 13 cancers and analyzed the characteristics of these genes (Fig. 1a).
The EAGLING model expands the 450 K array methylation data 18 times
In order to expand the Illumina 450 K array DNA methylation data, we previously developed prediction models based on local methylation patterns and sequence features. In this work, we proposed a new model dubbed as EAGLING that has two steps to predict methylation levels of CpGs based on the Illumina 450 K array data. First, it finds a WGBS methylome that shares the most similar local methylation profile around the CpG L under consideration, and the methylation value of the CpG in the selected WGBS methylome is taken as an input feature; second, the methylation value of the closest CpG from Illumina 450 K array is taken as another input feature. A logistic regression model was built on these two features to predict the methylation level at L (Fig. 1b, see details in Methods). Note that this procedure is repeated for each CpG so that different CpGs may be predicted from different WGBS methylomes. Thirty-three tissues/cell lines in which both 450 K array and WGBS data were available were used to optimize the parameters. There are three major improvements over our previous models.2,3 First, DNA sequence features are not included in EAGLING, which significantly improves the speed without deteriorating the performance; second, the methylation value of the closest CpG is used because of its higher precision compared to the weighted neighbor CpGs used in our previous models;2,3 third, more training data are included (33 versus 14 tissues/cell lines), which is expected to improve the model.
We have searched for the optimal number of CpGs to represent the local methylation pattern in step 1, which is crucial to identify the WGBS methylome from which we take the methylation level for the CpG under consideration. We performed the leave-one-tissue-out cross validations using 1–10 neighbor CpGs and 5–50 Kbp flanking regions (Fig. 2a). The flanking regions confine the CpGs we considered. Only the CpGs with the required number of neighbor CpGs in the specified flanking regions were included for expansion, because their local methylation profiles could be accurately represented. We chose four CpGs each on upstream and downstream sides to represent the local methylation profile as there was no performance improvement by including more than four flanking CpGs and 30 Kbp for the flanking region size considering the balance between satisfactory performance and genome coverage (Fig. 2b). Using these parameters, the leave-one-tissue-out cross validations achieved superior performance (Fig. 2c). To show the impact of the training size on the model performance, we trained the EAGLING model using 14 (the training sample size for our previous model in reference 3) and 23 WGBS data sets (the 14 WGBS data plus another nine randomly chosen WGBS data) separately. We compared their predicted results on another 10 WGBS samples not included in the training set. We repeated this cross validation 10 times and the results are shown in Figure S1a. The EAGLING model trained using 23 WGBS data showed improved correlation coefficient (0.8441), concordance (0.8532), accuracy (89.65%), and AUC (0.8645) compared to those trained using 14 WGBS data (0.8321, 0.8375, 87.11%, and 0.8595). Importantly, not including DNA sequence features in EAGLING does not impair the prediction performance while removing the time consuming step of considering sequence features in the previous model (Figure S1b) to achieve a 10 times faster speed.
Furthermore, we applied our EAGLING model on 450 K array data of K562 and HepG2, two independent cancer cell lines from the ENCODE project. The predicted methylation levels were well correlated with the WGBS data in the same cell lines: the correlation, accuracy and AUC on K562 and HepG2 were 0.84, 84.13%, 0.84, and 0.91, 92.27%, 0.87, respectively. The correlation, accuracy and AUC in K562 and HepG2 using our previous model3 were 0.82, 81.13%, 0.80, and 0.89, 90.13%, 0.82, respectively. The superior performance further validated the EAGLING model.
Expanding the Illumina 450 K array data in the TCGA samples
Using the EAGLING model trained on the 33 tissues/cell lines (see details in Table S1), we expanded the Illumina 450 K methylation array data in 13 cancers from TCGA (LUAD, LUSC, BRCA, BLCA, COAD, KIRC, KIRP, PRAD, ESCA, LIHC, THCA, UCEC, and HNSC) that have at least 10 normal samples of Illumina 450 K array and RNA-seq data. The expanded data increased the coverage of CpGs to 18.9 times (about 30% of CpGs in the human genome). Particularly, the intergenic coverage was significantly increased from 39.02% in 450 K array to 50.94% in the expanded data and the non-CpG island coverage also increased from 69.03 to 79.98%, which is important to identify functional enhancers (Fig. 2d, g). The location distribution of the expanded data is much closer to that of all the CpGs in human genome (Fig. 2j) than the original 450 K array. Furthermore, we identified the hyper-methylation (>0.7) and hypo-methylation (<0.3) CpGs from the original 450 K array data and calculated their percentages among all the CpGs for the tumor and normal samples of the 13 cancers (Fig. 2e, f). Obviously, the ratio distributions of the expanded CpGs (Fig. 2h, i) are much closer to those of the WGBS data (Fig. 2k, l), indicating that the analysis results based on the expanded methylation data would be less biased compared to the results from the 450 K array data.
Identification of triple-evidenced genes
Using the expanded methylation data in the 13 cancers, we identified the differentially methylated genes (DMGs) between the tumor and normal samples (see Methods for detail). We also identified the differentially expressed genes (DEGs) using the RNA-seq data and genes containing somatic mutations (see details in Methods and Figure S2). As an example, the overlap between DMGs, DEGs and genes with somatic mutation of LUSC is shown in Fig. 3a. In the 13 cancers, the number of triple-evidenced genes ranges from 396 in PRAD to 1438 in LUSC (Figure S2).
Only a small portion of the triple-evidenced genes were found in more than six cancers (Fig. 3b). The top five triple-evidenced genes found most often in the 13 cancers are listed in Table S2. They were CELSR3, TNXB, TRPM2, KCNAB1, and TRIP13 that were identified as triple-evidenced genes in 11, 11, 11,10, and 10 types of cancers, respectively. We first examined the difference of their methylation and expression levels between tumor and normal samples (Fig. 3c) (difference = normal value − tumor value). CELSR3, TRPM2, and TRIP13 are over-expressed in all the 13 cancers, TNXB is under-expressed in all the 13 cancers, KCNAB1 is over-expressed in KIRC but under-expressed in the other 12 cancers. These genes show abnormal but consistent expression patterns across different cancers. The methylation level does not show clear trend though, indicating that the relationship between gene expression and their promoter methylation is complex, which is consistent with the previous studies.8,9
Four of the five genes have been reported as biomarkers, potential biomarkers or therapeutic targets. CELSR3 was found to be highly expressed in ovarian cancer,4 and hypermethylated in primary oral squamous cell carcinoma, and might be used as a biomarker in OSCC prognostication10 and small intestinal neuroendocrine tumor.5 TNXB was reported to be important for the tumorigenesis of lung adenocarcinoma,6 and was validated as a promising biomarker for early metastasis of breast cancer.7 TRPM2 was reported to be a potential target of the selective treatment of prostate cancer11 and was suggested to be a potential therapeutic target in breast cancer.12 TRIP13 promoted early steps of the DNA double-strand break repair and its presence was associated with progression in prostate cancer and squamous cell carcinoma of the head and neck.13,14 For KCNAB1, there were few reports about its function in cancer, but it was downregulated in follicular carcinoma and could be combined with other genes for the classifier construction.15
As a comparison, we also identified the triple-evidenced genes using the Illumina 450 K data (Figure S3). There are two advantages using the expanded methylation data. First, the triple-evidenced genes can be identified in more cancers. For example, the CELSR3 gene was found as the triple-evidenced gene in two cancers using the original 450 K array data but in 11 cancers using the expanded data; second, consistently, more triple-evidenced genes can be identified in a particular cancer by the expanded data than the original 450 K array data. For example, five genes (FANCI, RECQL4, TACC3, CLU, and SIK1) were reported to function in different cancers10,16,17,18,19 but they could not be identified using the Illumina 450 K array data in any of the 13 cancers; in contrast, all of them were found as triple-evidenced genes in more than six cancers using the expanded data (Table S3).
Triple-evidenced genes are enriched in particular pathways
For each of the 13 cancers, we checked the enriched pathways among the triple-evidenced genes using ingenuity pathway analysis (IPA) (with Benjamini-hochberg adjusted p-value < 0.05). Some enriched pathways are known to be important in cancer, such as MMPs (inhibition of matrix metalloproteinases), VEGF family ligand–receptor interactions, Wnt pathway, NF-kB signaling, MAPK Signaling. The top five pathways most often found in the 13 cancers are shown in Fig. 3d.
Axonal guidance signaling, which belongs to neurotransmitters and other nervous system/organismal growth and development signaling, is enriched in 11 out of 13 cancers (Figure S4). Genes included in the pathway have been implicated in cancer cell growth, survival, invasion, and angiogenesis.20 It was also reported that pancreatic cancer genomes show aberrations in the axonal guidance pathway genes.21 As an example, the triple-evidenced genes in LUSC overlapped with this pathway are marked in purple in Figure S4. The top genes shared in the 11 cancers on the pathway are marked with star shape. Some of them have been targeted by drugs to treat numerous cancers, such as marimastat for breast and lung cancer, and dabrafenib for non small-cell lung cancer.
The other four enriched pathways are hepatic fibrosis/hepatic stellate cell activation, leukocyte extravasation signaling, agranulocyte adhesion, and diapedesis and atherosclerosis signaling. All of them are involved in inflammatory process or response, and their top functions are in cell-to-cell signaling and interaction, cellular movement or immune cell trafficking. The association between the development of cancer and inflammatory is well recognized,22 and about 20% of human cancers are related to chronic inflammatory caused by infections, exposure to irritants or autoimmune disease.23,24 The details of the pathways (LUSC as an example) are shown in Figure S5–S8. Note that the number of cancers in which the enriched pathways was identified is significantly larger than that identified from the original 450 K array data (Figure S9).
Several triple-evidenced genes (MMP9, MMP11, CXCL12, MYL9) appear in three out of the five significantly enriched pathways and in more than half of the 13 cancers. All of these genes have been reported associated with cancers. The promoter methylation of CXCL12 was acted as a prognostic biomarker in prostate cancer patients25 or sporadic breast cancer.26 The low expression level of MYL9 is correlated with a significantly reduced median survival rate in colon cancer patients and might act as clinical biomarkers for the early diagnosis of colon cancer.27 MMP9 and MMP11, both of which belong to Proteins of the matrix metalloproteinase (MMP) family, were reported as tumor biomarkers28 or associated with tumor survival,29 and targeted by an inhibitor marimastat.
Diagnostic power of the triple-evidenced genes
We then investigated whether the triple-evidenced genes are useful in distinguishing cancers from normal samples. We trained a random forest model with the selected triple-evidenced genes to discriminate the pooled cancer samples from the normal ones. As there are a large number of triple evidenced genes in all the cancers, we chose those that appear in more than half of the 10 training cancer types as the candidates. Their associated gene expression and DNA methylation levels of individual CpGs in the promoters of the selected genes in the expanded data were input features; the somatic mutation information could not be included because the mutation information for each gene in every sample was not available. The model was constructed with a cross validation strategy by sampling 10 cancer samples as training data and the remaining three cancer samples as test data for 100 times. LASSO was applied to select features in constructing each random forest model. The features are presumably important if they were most often selected in the 100 cross validations. We list the 47 features selected in more than 50 times of the cross validations in Table S4, including expression of 13 genes and methylation levels of 34 CpGs. As shown in Fig. 4a, most of the test tumor samples could be correctly predicted as cancers while about 75% of the test normal samples were predicted as normal samples. Obviously, using triple-evidenced genes derived from the expanded methylation data outperformed using the original 450 K array data (t-test), particularly the specificity. The AUC of the cancer diagnosis with the triple-evidenced genes is about 0.85, which indicates that there are common differential gene expression and methylation features of the triple-evidenced genes that distinguish tumors from the normal samples.
To validate whether the triple-evidenced genes could also perform well in other datasets, firstly, we extracted gene expression data of normal and tumor tissues of liver, breast, uterus, bladder, esophagea, and colon from GENT,30 and tested the classification performances based on the triple-evidenced genes (Table 1); secondly, we extracted the expression data of five other cancers (STAD, READ, CHOL, GBM and PAAD) not included in the 13 cancers studied here from TCGA, and investigated the prediction performances based on the triple-evidenced genes (Table 2). We used the random forest model constructed from all the tumor samples with 47 features selected in more than half of the 100 cross validations. The prediction results showed satisfactory results and suggested that the triple-evidenced genes are important and robust for pan-cancer analysis.
Furthermore, we investigated whether it is possible to distinguish individual cancers. The candidate features were the combination of all the features selected in any of the 100 cross validations in the above diagnosis analysis. For the 13 cancer samples, multi-class logistic regression model was constructed based on the gene expression and the methylation levels of promoter CpGs using LASSO. The average prediction accuracy with 10-fold cross validation for 100 times was 95.27 ± 0.64%. This accurate prediction indicates that the expression and methylation features based on the triple-evidenced genes reflect the differential patterns not only between cancer and normal samples but also between different cancers. Among the 13 cancers, THCA and PRAD were with the highest accuracies (99.32 and 99.16%), while the LUSC was with the lowest accuracy (87.41%). When looking into the misclassification results, KIRP is prone to be predicted as KIRC and vice versa, LUAD is prone to be predicted as LUSC and vice versa, which is reasonable because they belong to the same tumor category. Also we found that the majority of mis-classified HNSC were predicted as LUSC, and many of the mis-classified LUSC were predicted as HNSC. The interesting results were consistent with the previous reports that patients treated for head and neck squamous cell carcinoma frequently developed second primary tumors in the lung, and they shared many common patterns.31,32
Among the top 20 features (Table S5), 13 features are gene expression and seven are CpG methylation values of the triple-evidenced genes. Some of these features have literature evidences to support their importance in discriminating cancers. For example, the gene expression of SUSD2 is the second most important feature, which is consistent with its reported variable expression in cancers, e.g. down-regulation in colon cancer33 and hepatocellular carcinoma,34 and highly expressed in breast cancer.35 Another example is CYGB expression, the fifth most important feature. CYGB shows variable expression in cancers: it is down-regulated in many cancers36 but up-regulated in lung and brain metastases, and head and neck cancer.37,38 The methylation level at a LAMA4 promoter CpG was found as the seventh most important feature; previously, the aberrant methylation at the LAMA4 promoter was observed in breast carcinoma39 and low methylation was associated with poor progression-free survival.40
Prognostic value of the triple-evidenced genes
We also investigated whether the triple-evidenced genes are useful in predicting survival rate. For the survival data of each cancer (11 cancers with sufficient samples were analyzed, see details in Methods), we applied the LASSO cox proportional hazards regression for feature selection. The candidate features include gene expression and expanded methylation data (the methylation level of all the CpGs in the promoters) of triple-evidenced genes identified in each cancer. The performance was assessed using 10-fold cross validation for 100 times. We compared the concordances (C-indexes) based on four different candidate features (expanded methylation and expression levels, only expanded methylation, only expression level, and the original 450 K methylation and expression levels of the triple-evidenced genes). As an example, the boxplots of the concordances of COAD in cross validation are shown in Fig. 4b. The concordance based on the features selected from using both the expression levels and expanded methylation data is superior to using either data alone, or the combination of the original 450 K methylation and expression levels: the p-values are 0.03 (compared with expanded methylation data), 1.2e-11 (compared with expression data) and 9.1e-6 (compared with the combination of the original 450 K methylation and expression levels).
Furthermore, we focused on the gene features frequently selected among the cross validations to cluster the tumor samples. For example, using the 19 features selected in more than 20% of the cross validations in the COAD samples (as shown in Fig. 4c), hierarchical clustering identified two obvious subgroups, which shows significantly different survival times in the Kaplan-Meier survival plot in Fig. 4d (p-value = 0.00145). The results for the other 10 cancers are shown in Figure S10–S19. In eight out of the 11 cancers, the concordances based on the features selected from the combination of both the expression levels and expanded methylation data are the highest, indicating the usefulness of the expanded methylation data in prognosis analysis. Among the most often selected features, TNXB, RRM2, CELSR3, DBNDD1, and SLC16A3 are the top five most often selected genes in the survival analysis among the 11 cancers (Table S6 and see discussion below).
Triple-evidenced genes important for both diagnosis and prognosis
Nine genes are most often selected in both diagnosis and prognosis analyses: TNXB, RRM2, CELSR3, SLC16A3, FANCI, MMP9, MMP11, SIK1, and TRIM59. Their differential expression and methylation levels between normal and cancer samples as well as the somatic mutations in the 13 cancers are shown in Fig. 5. The expression and somatic mutation patterns of the nine genes are quite consistent in the 13 cancers but the methylation patterns of their promoters vary. These nine genes are currently considered as biomarkers or potential biomarkers for diagnosis or prognosis in specific cancers. However, our analyses suggested that they are likely general biomarkers for at least the 13 cancers analyzed here.
TNXB was reported as a potential marker for prognosis in patients with stage III serous ovarian cancer.41 RRM2 was reported as independent negative prognostic marker for survival in patients with resected pancreas cancer42 and a promising prognostic biomarker and therapeutic target for ER-negative breast cancer patients.43 CELSR3 was suggested as a biomarker in OSCC prognostication,10 and prognostic marker in small intestinal neuroendocrine tumor.5 MMP11 and MMP9 were reported as breast tumor biomarkers and associated with tumor survival.28,29 For SLC16A3, there is no report on its prognostic power but studies showed that it might be an epigenetic marker for clinical outcome in clear cell renal cell carcinoma.44
It is worth noting that FANCI and SIK1 genes could not be identified as the triple-evidenced genes using the original 450 K array data. FANCI belongs to Fanconi anemia complementation group and it was a negative regulator of Akt activation that connects with the oncogenic PI3K-Akt pathway and the tumor suppressing FA pathway.45 This gene has also been linked to drug resistance in cancer treatment.46 SIK1 is stimulated by a cancer suppressor LKB1, which leads to metastatic spread and invasiveness, as well as apoptosis resistance.47 Loss of SIK1 has been found in epithelial ovarian cancer and pancreatic cancer,48 and decreased SIK1 expression is correlated with poor outcome of breast cancer treatment,49 indicating the potential application in prognosis. Our results further support the potential of using FACNI and SIK1 as prognosis marker and provide insight in broadening its application in other cancer types.
Among the nine genes, RRM2, MMP9, MMP11, and SIK1 are known drug targets. For example, they are inhibited by gemcitabine (RRM2), marimastat (MMP9 and MMP11) for treating several cancers. Our analyses suggested these inhibitors may be effective for the majority of the 13 cancers, which suggests possible broader applications of these inhibitors. For example, gemcitabine targeting RRM2 are validated for the treatment of non-small-cell lung cancer,50 ovarian cancer,51 pancreatic cancer,52 adrenocortical cancer,53 and oral squamous cell carcinoma.54 We speculate that gemcitabine can be used to treat bladder, colon, kidney, liver and prostate cancers. Furthermore, HG-9-91-01(SIK1) was reported to induce anti-inflammatory phenotype and could be used to treat certain autoimmune diseases,55,56 we speculate that it can be repurposed to treat cancers as four of the top five enriched pathways in the pan-cancers analysis are closely related to inflammatory processing or inflammation response.