Adaptively capturing the heterogeneity of expression for cancer biomarker identification


Simple simulation data

We first evaluated the proposed method on simple simulation data. The simulation data contain two groups of genes:Group I consists of G = 1000 non-differentially expressed genes between two classes of samples while group II consists of G = 1000 differentially expressed genes. For group I, the expression values of genes in all samples were randomly sampled from standard normal distribution, while for group II, the expression values of genes in the two classes follow two normal distributions with different means (zero or 0.15) and the same deviation (0.1). Considering the influence of sample size, we varied the sample size of each class n = 6, 10, 20, 50, and in each scenario, twenty data sets were randomly generated and used for avoiding randomness on algorithm evaluation.

We compared the simple GRP and aGRP models on the simulation data. To investigate the property of P(U) and P(D), we plotted P(U) against P(D) for each gene on the simulation data. Results (Additional file 1: Figure S1) show that the GRP model had a complex joint distribution of P(U) and P(D): P(U) + P(D) =1 at τ = 0.5 but P(U) + P(D) < 1 at 1 ≥ τ > 0.5, and drops as τ increases, and in contrast aGRP favored a line P(U) + P(D) =1 as expected, suggesting the more favorable performance of aGRP. To examine the asymptotical significance analysis procedure of aGRP, we then compared the resulting p-values with those empirically estimated by permutation tests with randomly shuffled sample labels. Note that we considered B = 10, 50, 100, 1000 permutations of sample labels in the permutation tests respectively to gradually approximate the null distribution. It was revealed that the permutated p-values become closer to the asymptotic estimator as B increases (See Additional file 1: Figure S2), suggesting the justification of the derived significance analysis procedure.

We then investigated the type-I errors and power of the aGRP and GRP models based on the two groups of genes, respectively. Figure 1a barplots the average type-I errors at an ad hoc p-value cutoff of 0.05 by aGRP and GRP over 20 random data sets in each scenario of sample size. From a statistical perspective, the type-I error at an ad hoc p-value cutoff of 0.05 is expected to be 0.05. From this figure, it can be seen that aGRP had type-I errors closer to 0.05 than those by any of the GRP models in all the data scenarios. Figure 1b compared the powers of aGRP and GRP in identifying the G = 1000 differentially expressed genes at an ad hoc p-value cutoff of 0.05, showing that aGRP is more powerful than the GRP models, especially when sample size is small (n = 6 and 10).

Fig. 1

Average type I errors (a) and power (b) of aGRP and GRP models in different scenarios of sample size at an ad hoc p-value cutoff of 0.05 on Simple simulation data

Simulated gene expression data

To evaluate the performance of aGRP on complex data, we next simulated gene expression data by revising the procedure in the reference [21]. The simulation data mimic real gene expression by forcedly adding hidden dependence structures, i.e., correlation background. We assumed totally G = 10,000 simulation genes and divided them into 6000 non-differentially expressed genes between “tumor” and “normal tissue” and 4000 differentially expressed genes, of which one half up-regulated in tumor and the other half down-regulated. Let n be the sample size of each class, we generated a correlation background X [G × 2n] as follows: 1) randomly forming gene clumps of size m{1, 2, 3, , 100} and clump-wise correlation ρ from U(0.5, 1). 2) generating noise vectors e.j of dimension m × 1 from N(0m,(1-ρ)Im + ρ1m1’m) for sample j, j = 1,2, …,2n, and obtaining the background values of the m genes in the clump x.j=μ + diag(ω)e.j, where μ and ω are an m × 1 vector of elements μg( sim 1000{chi}_5^2 ) and of elements ωg = eβ0/2μgβ1/2 respectively. The correlation background increases the variability of data and makes the expression patterns heterogeneous. In the experiment, we set the parameters β0 = − 5, β1 = 2, and rendered the true expression ratios of DEGs to vary among ( 1+{2}^{-1/2}{e}^{beta_0/2}{delta}_gsim Uleft(1.29,1.58right) ),δg~U(1,2). To investigate the effect of sample size, we considered the four sample sizes n = 6, 10, 20 and 50, and as a result, four simulation data scenarios were obtained. In each scenario, 20 random data sets were generated and their average results were used for algorithm evaluation to overcome randomness.

We calculated the sensitivities, specificities, areas under the ROC curve (AUCs) and accuracies of aGRP at an ad hoc p-value cutoff of 0.05 in different scenarios of the simulated gene expression data. For comparison, we also applied previous methods, GRP models, Limma [8], SAM [14] and another popular non-parametric method, Rankprod [22], to analyze the simulation data. The previous methods, Limma, SAM and Rankprod, were implemented using the R packages Limma, siggenes, RankProd from Bioconductor, respectively. Note that for Limma, the proportional parameter was set as default. Table 1 lists the average performances of aGRP and the previous methods over 20 random data sets in each simulation scenario. From this table, we can clearly see that aGRP achieved higher accuracies than all the previous methods and comparable sensitivities and AUCs with Limma in almost all the simulation scenarios, showing the best overall performances of aGRP. Especially, aGRP is more advantageous for data scenarios of small (n = 6) or large (n = 50) sample size, and the higher sensitivities suggest the superior power of detecting subtle but consistent expression changes. For the GRP model, different settings of the regulation confidence cutoff led to similar results lying between those by aGRP and another non-parameter method, RankProd, as expected. Taken together, these results demonstrate the ability of aGRP in dealing with complex expression patterns for cancer biomarker identification.

Table 1

Performance (mean ± std.%) comparison among different methods on the simulated gene expression data

n = 6


33.24 ± 1.35

89.49 ± 0.91

70.11 ± 2.24

67.79 ± 0.94


39.73 ± 3.07

95.01 ± 1.99

78.54 ± 3.18

72.9 ± 2.59


32.95 ± 0.07

82.36 ± 6.68

70.02 ± 5.14

65.4 ± 4


29.92 ± 2.13

96.85 ± 1.07

78.48 ± 3.04

69.08 ± 1.61


40.97 ± 0.05

94.06 ± 3.47

78.61 ± 2.67

71.73 ± 3.07


42.99 ± 0.02

92.86 ± 1.03

77.98 ± 3.35

70.11 ± 3.62


43.45 ± 4.3

93.16 ± 0.85

80.08 ± 2.98

73.63 ± 2.51

n = 10


56.96 ± 1.34

85.48 ± 0.31

73.22 ± 0.85

73.27 ± 0.57


57.04 ± 3.03

95.49 ± 1.28

88.32 ± 2.92

80.17 ± 1.77


51.08 ± 3.05

77.9 ± 5.75

70.73 ± 4.56

68.73 ± 3.45


47.05 ± 3.59

95.34 ± 1.65

85.42 ± 2.87

76.7 ± 0.99


51.35 ± 3.58

95.16 ± 1.68

85.85 ± 2.98

77.89 ± 1.21


51.01 ± 4.09

96.35 ± 1.18

85.87 ± 1.66

77.81 ± 1.71


56.47 ± 3.4

96.16 ± 1.06

87.36 ± 2.67

79.7 ± 1.64

n = 20


56.51 ± 1.29

85.4 ± 0.31

78.03 ± 0.92

73.84 ± 0.54


86.84 ± 1.01

95.30 ± 1.61

96.02 ± 0.43

91.06 ± 0.37


85.37 ± 0.1

92.45 ± 5.56

90.12 ± 3.73

86.46 ± 3.31


80.5 ± 0.99

95.92 ± 0.92

94.00 ± 1.03

89.65 ± 0.87


80.81 ± 1.58

96.28 ± 0.73

95.74 ± 0.85

89.97 ± 0.98


80.69 ± 1.88

96.21 ± 1.02

94.43 ± 1.02

90.13 ± 0.85


86.4 ± 1.7

95.70 ± 0.57

95.85 ± 0.5

91.75 ± 0.94

n = 50


69.93 ± 0.69

80.07 ± 1.08

83.43 ± 0.92

76.08 ± 0.58


98.94 ± 3.9

95.95 ± 0.73

99.76 ± 1.01

96.57 ± 0.44


92.97 ± 0

89.36 ± 2.85

88.35 ± 1.51

90.82 ± 1.71


97.16 ± 0.90

95.82 ± 1.01

99.51 ± 0.27

96.37 ± 0.25


98.39 ± 0.47

95.43 ± 0.73

99.73 ± 0.16

96.56 ± 0.34


97.06 ± 1.09

95.36 ± 1.04

99.54 ± 0.15

96.08 ± 0.92


98.96 ± 3.4

97.3 ± 0.85

99.85 ± 0.08

98.78 ± 0.51

Application to three real microarray data sets of lung cancer

Lung cancer is one of the most malignant tumors worldwide. We then applied the proposed method to identify gene signatures for lung adenocarcinoma (LUAD) based on three real-world lung cancer microarray datasets collected from GEO ( Selamat’s data (GSE32863), Landi’s data (GSE10072) and Su’s data (GSE7670). When generated, Selamat’s data used the HG-U133A Affymetrix chips for hybridization with 25,441 probes, Landi’s data the Illumina Human WG-6 v3.0 Expression BeadChips with 13,267 probes and Su’s data the Affymetrix Human Genome U133A array with 13,212 probes. All samples in the three datasets were divided into two classes, LUAD and normal tissue of lung (NTL). For the Selamat’s data, there are totally 117 samples, 58 of which are LUAD and 59 NTL samples; for the Landi’s data, there are totally 107 samples, 58 of which are LUAD and 49 are NTL samples; for the Su’s data, there are totally 54 paired LUAD/NTL samples. To preprocess the three datasets, we mapped probes into Entrez IDs and averaged the intensities of multiple probes matching a same Entrez ID to be the expression values of the gene, and adopted the coefficient of variation (CV) criterion with a CV cutoff of 0.05 to remove non-specific or noise genes.

We separately analyzed the three lung cancer data sets for identifying LUAD biomarkers in the experiment. To control false positive rates, the resulting p-values for each gene were corrected using the Benjamini-Hochberg (BH) procedure [21]. The previous methods, GRP, Rankprod [9], Limma [8] and SAM [14], were also applied to re-analyze these data sets for comparison. Figure 2a shows the numbers of DEGs called by these methods on each data set and the number of common DEGs across the three data sets at an ad hoc BH-adjusted p-value cutoff of 0.01. From this figure, we can clearly see that aGRP called more DEGs than those by the previous methods on almost all the three data sets and especially, most common DEGs across these data sets. This is consistent with the higher sensitivity on the simulation gene expression data (Table 1). For the GRP model, τ = 0.7 led to more DEGs than those of τ = 0.5 and 0.9 for two data sets, Landi’s and Su’s, while τ = 0.9 led to more DEGs than those of τ = 0.5 and 0.7 for Selamat’s data, implying the necessity of choosing proper τs for different data applications for the GRP model. In contrast, aGRP adaptively captured the heterogeneity of data sets to automatically reach the optimal performance.

Fig. 2

Comparison of the number of DEGs called among aGRP, GRP models and RankProd on the three LUAD data sets (a) and one HCC RNA-Seq data set (b). GRP0.5, GRP0.7 and GRP0.9 mean GRP models with τ = 05,0.7,0.9, respectively

We further investigated the DEGs more called by aGRP than the previous methods, Limma, SAM and RankProd. Figure 3a shows the histograms of fold changes (FCs) of the DEGs for each of the three methods on the lung cancer data sets. For comparison, the aGRP statistics of these DEGs calculated on the three data sets were shown in Fig. 3b. It can be clearly seen that while the FCs are small with a distribution around one, the corresponding aGRP statistics are generally large, e.g., > 0.3, reflecting the high likelihoods of being regulated between tumor and normal tissues. We then looked into the biology of these DEGs by literature survey and found that many of them are associated with cancer. For example, gene PPP1R1A with a small FC of 0.97 but a large aGRP of 0.39 on the Selamat’s data is a tumor promoter, whose depletion can significantly suppress oncogenic transformation and cell migration. Differential expression of PPP1R1A was often observed in non-small cell lung cancers and colorectal cancers [23]. Luo et al.. [24] revealed that PPP1R1A-mediated tumorigenesis and metastasis relies on PKA phosphorylation-activating PPP1R1A at Thr35 in ewing’s sarcoma. Another gene CP110 with FC = 0.95 and aGRP = − 0.32 on Landi’s data was previously reported to be involved in lung cancers [25]. The inhibition of CP110 by MiR-129-3p are associated with docetaxel resistance of breast cancer cells [26] and centrosome number in metastatic prostate cancer cells [27]. Gene LRRC42 with FC = 1.45 and aGRP = 0.50 on Su’s data was extensively observed to be significantly up-regulated in the majority of lung cancers [28]. Taken together, these results demonstrate the special power of aGRP in capturing subtle but consistent changes of gene expression for cancer biomarker identification.

Fig. 3

Distributions of FC (a) and aGRP statistics (b) of DEGs more called by aGRP than Limma, SAM or Rankprod on the three Lung cancer data sets

As described above, aGRP is featured with the ability of discerning DEGs regulated in different directions by the sign of the statistic aGRP. Totally, aGRP called 2023 common LUAD markers across the three data sets at an ad hoc BH-adjusted p-value cutoff of 0.01. We then divided the common DEGs into two categories: 1104 (Additional file 2: Table S1) with negative aGRP and 869 (Additional file 3: Table S2) with positive aGRP. According to the definition of aGRP, the former are likely down-regulated in LUAD relative to normal lung tissues as potential tumor suppressors. Take as an example TCF21 whose aGRPs are − 0.99, − 0.90 and − 0.99 on Landi’s, Selamat’s and Su’s data set respectively. Biologically, the gene encodes a transcription factor of the basic helix-loop-helix family, and has been previously reported to be a tumor suppressor in many human malignancies including lung cancer [29]. Recently, Wang et al. [30] have reported that the under-representation of TCF21 is likely derived from its hyper-methylation in LUAD. The coordinated pattern of hyper-methylation and under-expression has been observed to be tumor-specific and very frequent in all types of NSCLCs, even in early-stage disease [31]. Smith et al. [29] used restriction landmark genomic scanning to check the DNA sequence of TCF21, consolidating the epigenetic inactivation in lung and head and neck cancers. Shivapurkar et al. [32] employed DNA sequencing technique to zoom in the sequence of TCF21, revealing a short CpG-rich segment (eight specific CpG sites in the CpG island within exon 1) that is predominantly methylated in lung cancer cell lines but unmethylated in normal epithelial cells of lung. We reason that the short CpG-rich segment narrowed down may be responsible for the abnormal down-regulation of TCF21 in LUAD.

On the other hand, the 869 markers with positive aGRP may be potential onco-genes for LUAD. Take as an example COL11A1 (aGRP = 0.92, 0.75 and 0.99 on Landi’s, Selamat’s and Su’s data set respectively). Biologically, the gene is a minor fibrillar collagen involved in proliferation and migration of cells and plays roles in the tumorigenesis of human malignancies. Recently, many studies observed that COL11A1 is frequently abnormally highly expressed both in NSCLC and in recurrent NSCLC tissues and suggested it to be a clinical biomarker for diagnosing NSCLC. Using NSCLC cell lines, Shen et al [33] witnessed the functional promotion of the gene COL11A1 in cell proliferation, migration and invasion of cancer cells, where the outcome of abnormal high expression of COL11A1 can be interceded by Smad signaling [33]. In addition, COL11A1 was also observed to over-express in ovarian and pancreatic cancer and to be an indicator of poor clinical outcome of cancer treatment [34]. Another markers worthy of noticing is HMGA1 with aGRP = 0.93, 0.80 and 0.98 on Landi’s, Selamat’s and Su’s data set respectively. Biologically, the protein encoded by the gene is chromatin-associated and plays roles in the regulation of gene transcription. HMGA1 was previously reported to frequently over-express in NSCLC tissues and to be associated with the metastatic progression of cancer cells. Using immunohistochemistry, Zhang et al [35] experimentally observed that high levels of HMGA1 protein are positively correlated with the status of clinical stage and differentiation degree in NSCLC, and suggested that HMGA1 may act as a convictive biomarker for the prognostic prediction of NSCLC.

To further assess the lung cancer markers identified by aGRP, pathway analysis was done based on functional annotation clustering analysis using DAVID, which is available at As a result, DAVID reported 38 KEGG pathways (Additional file 4: Table S3) that are significantly enriched in the list of total 2023 DEGs at an ad hoc q-value cutoff of 0.1. Literature survey showed that many of these KEGG pathways are related to cancer, e.g. cell cycle (Rank 1, p-value = 1.9 × 10− 5), extracellular matrix (ECM)-receptor interaction (Rank 2, p-value = 1.6 × 10− 4), and Pathways in cancer (Rank 11, p-value = 0.006). Of them, cell cycle comprises of a series of events that take place in a cell leading to the division and duplication of DNA. The pathway, Complement and coagulation cascades (p-value = 5.1 × 10− 4), has been recently reported to dysfunction in lung cancer [36]. The analysis also reported another two lung cancer-related pathways, PI3K-Akt signaling pathway (p-value = 0.009) and small cell lung cancer (p-value = 0.017). Biologically, the former regulates many fundamental cellular functions including proliferation and growth. There exist many types of cellular stimuli or toxic insults which can activate the signaling pathway. When activated, the pathway first employs PI3K to catalyze the production of PIP3 and then PIP3 as a second messenger to activate Akt. An active Akt can phosphorylate substrates that are involved in many vital cellular processes such as apoptosis, cell cycle, and metabolism, which play important roles in tumorigenesis of cells. Accumulated evidences indicate that the PI3K-AKT signaling pathway plays an essential role in lung cancer development. For example, Tang et al. [37] experimentally observed that Phosphorylated Akt overexpression and loss of PTEN expression in non-small cell lung cancer and concluded that the activity of the pathway confers poor prognosis. Recently, many clinical strategies have been suggested to target PI3K-AKT signaling pathway for clinical treatment of lung cancer [38], including the novel anticancer reagent sulforaphene [39]. In addition, Wang et al. [40] reported the role of PI3K/AKT signaling pathway in the regulation of non-small cell lung cancer radiosensitivity after hypo-fractionated radiation therapy.

Comparison of consistency between aGRP and GRP

Both aGRP and GRP are a regulation-based statistic for cancer biomarker identification, whose absolute values and signs indicate the strength and direction of regulation respectively. In the LUAD application, each marker were identified with three values of aGRP (or GRP) derived from the three data sets. Consider the same LUAD topic of the three data sets, the consistency or similarity among the results can be used to evaluate the reasonability and reproducibility of these regulation-based statistics. For this purpose, we divided the range [0.5,1] into five intervals, [η, η + 0.1], η = 0.5,0.6,0.7,0.8,0.9, and determined the genes whose absolute aGRP/GRP fall within each interval. Figure 4a compares the proportions of common genes in the union across the three data sets in each interval between aGRP and GRPs with τ = 0.5, 0.7, 0.9. From this figure, we can clearly see that both aGRP and GRP had a tendency of the proportion of common genes gradually increasing with η, showing the reasonability of regulation-based statistics. Compared with GRP, aGRP led to the higher proportions, irrespective of interval used, suggesting the better consistency of results by aGRP. We further compared the proportions of genes with a same regulation direction in the common genes across the three data sets between aGRP and GRPs in each interval, as shown in Fig. 4b. From Fig. 4b, it can be clearly seen that aGRP achieved the proportions larger than 94.44% (at η = 0.01) on all the intervals, confirming the consistency of the results by aGRP. Although the GRP model with τ= 0.9 had all the proportions of one, the proportions of common genes obtained by it were far lower than those by aGRP in all the intervals (Fig. 4a). Taken together, these results demonstrated the robustness and reliability of aGRP in cancer biomarker identification. The advantage of aGRP should be related to the ability of adaptively capturing the heterogeneity of expression across data sets.

Fig. 4

Changes of proportions of intersection genes (a) and genes with the same regulation direction (b) by aGRP and GRP across the three LUAD data sets with η. GRP 0.5, GRP 0.7 and GRP 0.9 are for the GRP model with τ = 05,0.7,0.9, respectively

Application to RNA-seq expression data

We also evaluated the proposed method on RNA-seq expression data. Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related deaths. We downloaded a HCC RNA-seq data set from the GEO database: Yang’s data (GSE77509) [41], which were measured using Illumina Hiseq 2000. All the samples in the data set consist of 17,501-gene expression profiles of 40 matched HCC patients and adjacent normal tissues. For quality control, we preprocessed the dataset by averaging the raw counts with a same Entrez ID as the expression levels of the corresponding gene. For comparison, we also applied three previous count-based method, DEGSeq [12], DESeq2 [42] and edgeR [13], besides the GRP model and Rankprod as above, to analyze the RNA-seq data in the experiment.

We first examined the similarity between the statistics of aGRP and DEGSeq on the RNA-seq data. As a result, the Spearman correlation of the aGRP statistic and log2 fold change from DEGSeq and the Spearman correlation of p-values derived from aGRP and p-values derived using DESeq are 0.86 and 0.617, respectively. Both of the correlations are not equal to zero at a significance level of < 2.2e-16 (t-test), respectively. Then, we compared the numbers of DEGs called by aGRP and the previous methods at an ad hoc BH-adjusted p-value cutoff of 0.01, as shown in Fig. 2b. From this figure, we can see that aGRP still called more DEGs than those by previous methods, GRP (0.9), Rankprod, DEGSeq, DESeq2 and edgeR, on the RNA-Seq data, consistent with the results on the simulation gene expression and the three lung cancer microarray data, confirming the especial power of aGRP in identifying subtle but consistent expression changes. Among the 7234 DEGs identified by aGRP, there are totally 3548 (Additional file 5: Table S4) and 3686 (Additional file 6: Table S5) with positive aGRP statistics and negative aGRP statistics, respectively. Literature survey shows that many of these genes are associated with HCC or other types of cancer. Among the 3548 positive aGRP DEGs, for example, MMS19 (aGRP = 0.69) is a DNA repair gene playing important role in Nucleotide Excision Repair (NER) pathway, whose single nucleotide polymorphism, rs3740526 has been reported to significantly distinguish adenocarcinoma with squamous cell carcinoma and whose expression levels are clinically related with ACT benefit of resected non-small cell lung cancer patients [43, 44]. TRIB1 (aGRP = 0.66) has been previously evidenced to be associated with tumorigeneses of various types of cancer, e.g., leukemia and colorectal cancer [45, 46]. Especially, Gendelman et al… [47] computationally inferred that TRIB1 is potentially a regulator of cell-cycle progression and survival in cancer cells and experimentally observed that the expression of TRIB1 is predictive of clinical outcome of breast cancer. DDX59 (aGRP = 0.645) has been extensively observed to be highly expressed in lung adenocarcinoma and promote DNA replication in lung cancer development [48, 49]. In addition, among the 3686 negative aGRP DEGs, hormone receptor PGRMC2 (aGRP = − 0.635) was previously reported to be a tumor suppressor and an inhibitor of migration of cancer cell [50]. Recently, Causey et al [51] also observed that the expression level of PGRMC2 is informative in clinically staging breast cancer and is potentially useful to distinguish low stage tumors from higher stages.

Articles You May Like

100 years ago today, a deadly flood of molasses tore through an American town
PAIRUP-MS: Pathway analysis and imputation to relate unknowns in profiles from mass spectrometry-based metabolite data
Solar Farms Shine a Ray of Hope on Bees and Butterflies
Highlights for January 15, 2019
Proteo-Transcriptomic Dynamics of Cellular Response to HIV-1 Infection

Leave a Reply

Your email address will not be published. Required fields are marked *