Selection of platin drug-related genes
We documented genes in the peer-reviewed literature associated with drug effectiveness or responses (Supplementary References, Section B). For cisplatin, carboplatin, and oxaliplatin, 179, 90, and 288 genes were implicated, respectively (Supplementary Table S1). Multiple factor analysis (MFA) was used to determine which genes correlated with the GI50 in breast cancer cell lines through either GE and/or CN,13 significantly reducing the sizes of the gene sets for cisplatin (N = 39), carboplatin (N = 28), and oxaliplatin (N = 55). Genes with significant relationships to GI50 and the directions of correlations (positive or inverse) are indicated in Figs. 1–3. The diverse functions of these genes included apoptosis, DNA repair, transcription, cell growth, metabolism, immune system, signal transduction, and membrane transport. Analyses of IC50 and GE levels for cisplatin-treated bladder cancer cell lines confirmed these relationships based on the GI50 values of different breast cancer lines. IC50 values were related to GE for CFLAR, FEN1, MAPK3, MSH2, NFKB1, PNKP, PRKAA2, and PRKCA.20 Similarly, IC50 values obtained from separate bladder cell lines included in the Genomics of Drug Sensitivity in Cancer project (CancerRxGene; http://www.cancerrxgene.org; N = 17)21 correlated with GE for CFLAR, FEN1, and NFKB1, as well as ATP7B, BARD1, MAP3K1, NFKB2, SLC31A2, and SNAI1.
We performed an MFA of the GI50 values for cisplatin, carboplatin, and oxaliplatin, without considering either GE or CN. Responses to cis- and carboplatin were directly correlated (a 6.2° separation between vectors), but neither was related to the oxaliplatin response (Fig. 4). Cisplatin-resistant cell lines are generally sensitive to oxaliplatin.22,23,24
SVM-based signatures were initially derived for each platin drug using breast cancer cell line GE data. A 13-gene signature for cisplatin that predicts whether observed growth inhibition is above or below the median GI50 threshold (5.2% cross-validation misclassification rate) consisted of BARD1, BCL2L1, FAAP24, CFLAR, MAP3K1, MAPK3, NFKB1, POLQ, PRKAA2, SLC22A5, SLC31A2, TLR4, and TWIST1. A similarly derived carboplatin signature included AKT1, ATP7B, EGF, EIF3I, ERCC1, GNGT1, HRAS, MTR, NRAS, OPRM1, RAD50, RAF1, SCN10A, SGK1, TIGD1, TP53, and VEGFB (10.4% cross-validation misclassification). For oxaliplatin, the final SVM gene signature consisted of AGXT, APOBEC2, BRAF, CLCN6, FCGR2A, IGF1, MPO, MSH2, NAGK, NAT2, NFE2L2, NOTCH1, PANK3, PRSS1, and UGT1A1 (2.1% cross-validation misclassification). A cisplatin SVM generated from 17 bladder cancer cell lines in CancerRxGene resulted in 2 equally accurate signatures (with 11.8% cross-validation misclassification) consisting of either PNKP and PRKCA, or ATP7B, CFLAR, FEN1, MAPK3, NFKB1, and SLC22A11. These gene signatures were not useful for predicting patient outcomes due to the limited size of the training set.
GI50 threshold-independent modeling
In our previous studies, we set the median GI50 value as the threshold to distinguish drug resistance and sensitivity.5,6 An important question is whether the genes contributing to drug responses are consistent among different cell lines, each with their own unique GI50 values. Different ML gene signatures were obtained by shifting the GI50 threshold, which changed the labels of resistant and sensitive cell lines. After feature selection, the compositions of the corresponding gene signatures for each threshold were compared. Finally, ensemble averaging of all of these optimized SVMs with Gaussian kernels derived for different GI50 thresholds was used to create a single aggregated, threshold-independent, ML-based predictive model comprised of all genes that were selected in any of the threshold-specific models (i.e., a composite gene signature).
Kinase (MAPK3 and MAP3K1) genes and apoptotic family members (BCL2 and BCL2L1) were the most common genes in the cisplatin signatures at different GI50 thresholds, with consistent representation of error-prone and base-excision DNA repair genes as well (Fig. 5a and Supplementary Table S2A). The kinases were more concentrated in signatures with lower drug sensitivity thresholds, whereas BCL2 and BCL2L1 were more ubiquitous at all levels. The error-prone polymerases POLD1 and POLQ were more frequently detected in gene signatures with lower sensitivity thresholds, while the flap endonuclease FEN1 tended to be present at high levels of resistance. Thresholded gene signatures for carboplatin-related genes commonly contained the apoptotic family member AKT1, transcription regulation genes ETS2 and TP53, as well as cell growth factors VEGFB and VEGFC, although the latter were less common at lower sensitivity thresholds (Fig. 5b). Common oxaliplatin-related genes included the transporters SLCO1B1 and GRTP1 (but not SLC47A1), transcription-related genes NFE2L2, PARP15, and CLCN6, and multiple metabolism-related genes (Fig. 5c).
GI50-thresholded ML models were also derived using the log-loss function to evaluate whether an alternative loss function (for classification) would differ significantly from the misclassification-based gene signatures (by both the distribution of selected genes and by model accuracy to patient data). The log-loss function penalizes false classifications, whose value ranges from zero (or completely accurate) to 1 (or completely inaccurate; Supplementary Table S3). The overall distribution of genes across GI50 thresholds exhibited many distinct similarities to the gene signatures derived by misclassification. For both sets of cisplatin gene signatures, BCL2, BCL2L1, and FEN1 were common in low-to-moderate GI50 thresholds, while NFKB1 was enriched at high thresholds (Fig. 5a and Supplementary Figure S1A). For carboplatin, AKT1, VEGFB, and VEGFC were similarly distributed across GI50 thresholds with both methods, although VEGFB was less frequently represented in log-loss-based gene signatures at low GI50 values (Fig. 5b and Supplementary Figure S1B). In both sets of oxaliplatin gene signatures, SIAE and SLC47A1 were present at high frequencies across all GI50 thresholds, whereas ABCG2 was present less frequently (<50% inclusion; Fig. 5c and Supplementary Figure S1C). Differences between signatures selected by minimizing log-loss and misclassification rates were observed. EGF and ERCC1 were selected at a greater frequency at a moderate carboplatin GI50 using the log-loss function, rather than misclassification. Similarly, in oxaliplatin signature genes, APOBEC2, HLA-B, LTA, and MPO, were selected considerably more often using the log-loss function. Therefore, while the misclassification- and log-loss-based gene signatures are not interchangeable, overall, they are quite similar to one another.
Log-loss gene signatures were initially constructed either using (a) a modified version of the misclassification-based method, or (b) the backwards feature selection (BFS) software described by Zhao et al.25 Multiple signatures with low log-loss values can have different compositions, consistent with the possibility that various diverse gene combinations may give rise to signatures with satisfactory performance. However, these signatures often contain a larger number of gene features than the misclassification-based signatures, raising the possibility that they might be more prone to overfitting. This concern was addressed by generating gene signatures by minimizing log-loss using both methods. The median GI50-thresholded cisplatin gene signature generated using the log-loss modified software [ATP7B, BCL2L1, CDKN2C, CFLAR, ERCC2, ERCC6, FAAP24, FOS, GSTO1, GSTP1, MAP3K1, MAPK13, MAPK3, MSH2, MT2A, PNKP, POLD1, POLQ, PRKAA2, PRKCA, PRKCB, SLC22A5, SLC31A2, SNAI1, TLR4, and TP63] shares 15/19 genes with the signature generated using the BFS software25 [ATP7B, BARD1, BCL2, BCL2L1, ERCC2, FAAP24, FEN1, FOS, MAP3K1, MAPK13, MAPK3, MSH2, MT2A, NFKB1, PNKP, POLQ, PRKCB, SLC22A5, and SNAI1].
Impacts of features in gene signatures
Each gene was independently excluded and model accuracy was reassessed within every SVM signature to determine the contributions of individual genes to the overall cross-validation accuracy of a gene signature (Supplementary Table S2A, S2B, and S2C contain cis-, carbo-, and oxaliplatin gene signatures, respectively). The elimination of ERCC2, POLD1, BARD1, BCL2, PRKCA, and PRKCB consistently significantly increased the misclassification error (average > 16% increase) in moderate threshold cisplatin SVMs (GI50 thresholds: 5.1–5.5). ERCC2 and POLD1 perform critical functions in nucleotide and base excision repair, respectively. PRKCA and PRKCB are paralogs with significant roles in signal transduction. BARD1 has been shown to reduce the expression of the apoptotic BCL2 gene in the mitochondria,26 and has a key role in genomic stability through its association with BRCA1. The genes NFKB1, NFKB2, TWIST1, TP63, PRKAA2, and MSH2 showed a high variance in increased misclassification between different gene signatures. The variance of these genes may be due to epistatic interactions with other biological components, including the other genes in the SVM. For example, NFKB1 and NFKB2 are jointly included in 7 SVMs generated at a moderate GI50 threshold. Possible epistasis was observed, as the removal of either of these genes, but not necessarily both, exerted a substantial impact on model misclassification rates (≥18.0% increase). The misclassification variance of NFKB1 with NFKB2 was significantly lower than in SVM gene signatures lacking NFKB2.
Derivation of gene signatures from data obtained from patients with bladder carcinoma
Gene signatures derived from cell line data were validated using data from patients with cancer. We also developed SVMs using the cisplatin and/or carboplatin-treated TCGA (The Cancer Genome Atlas) data from patients with bladder urothelial carcinoma using post-treatment time to relapse as a surrogate criterion for different GI50 resistance thresholds to explore the similarities in the gene signatures in the data obtained from these patients (as performed in Mucaki et al.;27 Supplementary Table S4). Similar trends to cell line SVMs were apparent: POLQ was frequently included in gene signatures with a recurrence threshold of a longer duration, while FEN1 was a marker of resistance when the time to relapse was shorter. However, BCL2, which is present in a majority of breast cancer cell lines SVMs, was present in only one gene signature derived from TCGA data. Similarly, MSH2 was rarely selected using cell lines, yet appeared in nearly all patient-derived SVMs with >1-year recurrence. However, independently-derived patient SVMs were not able to be used for any other analyses.
Validation of cell line-based models using data from patients with cancer
GI50-thresholded models for each platin drug, which were generated with the breast cancer cell line data, produced 70 cisplatin, 83 carboplatin, and 83 oxaliplatin SVM gene signatures. Each of the thresholded gene signatures was applied to available platin-treated patient datasets to understand how the choice of GI50 threshold for training on cell line data impacted the predictive accuracy when the resulting gene signatures were assessed according to patient outcomes.28,29,30,31,32 In this study, cisplatin gene signatures were validated using data from patients with bladder cancer, carboplatin signatures were validated using data from patients with ovarian cancer, and oxaliplatin signatures were validated using data from patients with CRC. While the available data contained the necessary GE information, the clinical response metadata differed between studies. The responses of patients with bladder cancer to cisplatin were described as post-treatment survival by Als et al.,31 whereas patients with CRC treated with oxaliplatin were categorized as responders and non-responders by Tsuji et al.32 TCGA provided two different measures that we used to assess the predictive accuracy of our gene signatures—clinical response to chemotherapy and disease-free survival. Signature accuracy was similar using either measure (Supplementary Table S5A); however, recurrence and disease-free survival were used as the primary measures of responses, as these outcomes were more consistently recorded among the TCGA datasets tested. Patients in the study by Als et al.31 with a ≥5-year post-treatment survival were labeled as sensitive to treatment. The differences between these metadata may partially contribute to differences in the prediction accuracy of the thresholded SVM gene signatures.
At higher resistance thresholds for any platin drug (low GI50), where more cell lines were labeled sensitive, the positive class (disease-free survival) was correctly classified, while the negative class (recurrence) was highly misclassified (Supplementary Figures S2 and S3). The reverse was true for gene signatures derived using lower resistance thresholds (high GI50). For these reasons, SVMs generated at these extreme thresholds were not very useful at predicting patient outcomes. When used to predict recurrence in the TCGA datasets, sensitivity and specificity appeared to be maximized in gene signatures where the GI50 threshold for resistance was set near (but not necessarily at) the median (Supplementary Figure S2 and Supplementary Table S5A to S5C). While this pattern held true for the data reported by Tsuji et al.,32 oxaliplatin gene signatures, where GI50 thresholds were set above the median, better separated patients with primary and metastatic CRC (best signature predicting 92.6% of metastases and 60.7% of primary cancers; Supplementary Table S5C). Although less consistent, cisplatin gene signatures generated with thresholds above the median GI50 performed better when evaluating the patient dataset reported by Als et al. (Supplementary Figure S3).31
Gene signatures were individually evaluated for their accuracy in TCGA patients using various recurrence times post-treatment to classify resistant and sensitive patients (0.5–5 years; Supplementary Table S6A-C). The best-performing cisplatin signature (hereby identified as Cis1; Table 1) accurately predicted 71.0% of the recurrence of bladder cancer in patients who experienced recurrence after 18 months (N = 31; 58.5% accurate for disease-free patients [N = 41]). The best-performing carboplatin gene signature (designated Car1 [Table 1]) predicted the recurrence of ovarian cancer after 4 years at an accuracy of 60.2% (N = 302; 61.0% accurate for disease-free patients [N = 108]). For oxaliplatin, the best-performing gene signature (designated Oxa1 [Table 1]) accurately predicted 71.6% of the disease-free TCGA patients with CRC after 1 year (N = 88; 54.5% accuracy in predicting recurrence [N = 11]). These gene signatures (based on GE measured with the Affymetrix Gene Chip Human Exon 1.0 ST arrays), TCGA sample expression data, and SVMs based on bladder cell line data (based on expression measured using the Affymetrix U133A microarray) were added to the online web‐based SVM calculator (http://chemotherapy.cytognomix.com; introduced in Dorman et al.6) to predict platin responses.
The TCGA bladder cancer dataset contained 19 patients treated with carboplatin (but not cisplatin), which enabled an evaluation of the specificity of cisplatin models relative to patients who were not treated with this drug. The cisplatin model that best predicted outcomes of carboplatin-treated patients with bladder cancer in TCGA was not Cis1 (the best-performing cisplatin model) but rather Cis12 at 2 years post-treatment (80% accurate for responding patients [N = 5]; 93% for recurrent patients [N = 14]). Cis12 contains 9 genes that are not present in Cis1, including ATP7B, a gene present in many of our carboplatin models. The presence of this gene may have a significant impact on the overall accuracy of Cis12 in determining the outcomes of the carboplatin-treated patients with bladder cancer. We also evaluated these 19 patients to determine the carboplatin-specific gene signatures, and the signature that best predicted the response of these patients (Car73) was 84% accurate for patients after 1 year of treatment (100% for responding patients [N = 11]; 62.5% accuracy for recurrent [N = 8]). Interestingly, Car73 shares the same ATP7B gene with Cis12. Two additional carboplatin gene signatures were tied for overall accuracy (84%; Car9 and Car51), but more successfully predicted non-responsive patients (87.5%; 82% accuracy for responding patients). AKT1, ETS2, GNGT1, and VEGFB were shared among these carboplatin gene signatures.
Distances from the hyperplane for all SVMs generated were determined for patients with a short recurrence time to evaluate the consistency of the predicted responses of TCGA patients with bladder cancer who were treated with cisplatin (<6 months, N = 10; Supplementary Figure S4). Despite showing similar levels of resistance to treatment, distances differed between patients. While these patients were expected to be indicated as highly cisplatin-resistant (hyperplane distance < 0), two patients (TCGA-XF-A9SU and TCGA-FJ-A871) were predicted to be sensitive by nearly all SVM gene signatures. Similar variations were also observed in patients with either a long recurrence time (>4 years) or no recurrence after 6 years (Supplementary Figure S5).
An aggregate, threshold-independent model was generated for each individual platin drug at different GI50 thresholds using ensemble ML, which involves the averaging of hyperplane distances for each model to generate a composite score for each TCGA patient tested (i.e., a composite gene signature). Hyperplane distances across all 70 cisplatin gene signatures were similar, with a mean score of −0.22 and a standard deviation of 3.5 hyperplane units (hu) across the set of patient data. The ensemble model classified disease-free patients diagnosed with bladder cancer with 59% accuracy and those with recurrent disease with 47% accuracy. Limiting ensemble averaging to only cisplatin gene signatures generated at a moderate GI50 threshold (ranging from 5.10 to 5.50) did not significantly improve accuracy (44% for disease-free patients and 66% for recurrent patients; Supplementary Table S7A). For carboplatin, ensemble ML did not produce significantly better predictions than random, regardless of the GI50 threshold interval selected (Supplementary Table S7B) or the similar mean hyperplane distances (−0.11 ± 3.9 hu). For oxaliplatin, the ensemble ML model (mean = −0.12 ± 2.7 hu) was most accurate after 1 year (60% accuracy for disease-free patients and 73% for recurrent patients; Supplementary Table S7C). Similar to cisplatin, limiting this analysis to oxaliplatin SVM gene signatures with moderate GI50 thresholds did not significantly increase accuracy.
The misclassification-based cisplatin, carboplatin, and oxaliplatin gene signatures were also evaluated with k-fold cross-validation of TCGA data from patients with bladder, ovarian, and colorectal cancer, respectively. This cross-validation approach was independent of the cell line data; namely, the genes and hyper-parameters of signatures were used, but the GE data used were exclusively derived from patients. Patients were evenly distributed in 5 groups with an equal (or near-equal) ratio of disease-free and recurrent patients. The majority of the cisplatin gene signatures showed an overall accuracy >50%. The cisplatin gene signature that performed best under the k-fold analysis (6-resistance level; BARD1, BCL2, BCL2L1, PRKAA2, PRKCA, PRKCB, and TWIST1) showed an overall accuracy of 71.2% (84.4% accurate for sensitive patients and 53.9% accurate for resistant patients). The accuracy of the carboplatin and oxaliplatin gene signatures did not exceed 60%. In general, treating the patient data as a held-out test set yielded higher performance estimates than training and evaluating the models on the patient data using k-fold cross-validation.
Predicting cisplatin responses in patients based on smoking history
Tobacco smoking is known as the risk factor with the greatest contribution to the development of bladder cancer.33 Patients with head and neck cancer who smoke while undergoing cisplatin and radiotherapy treatment have been shown to have a shorter overall survival rate.34 We therefore subdivided the patients based on their smoking history and tested the thresholded gene signatures (Supplementary Tables S8 and S9). When testing patients who were lifelong non-smokers, the prediction accuracy of Cis1 predicted all non-smoking patients who were recurrent after 18 months as cisplatin-resistant (N = 5). Prediction accuracy for disease-free patients was 57.1% (N = 14). Another gene signature (Cis18; Supplementary Table S8) performed equally well for non-smokers, and these two gene signatures shared the genes BCL2, BCL2L1, FAAP24, MAP3K1, MAPK13, MAPK3, and SLC31A2. The threshold-independent analysis predicted the disease-free status equally well, but recurrence was less accurate (66.7%). Notably, non-smokers comprised a small subset of the patients tested (N = 19). The threshold-independent prediction of recurrence in patients with a smoking history was 46% accurate (N = 13), while disease-free patients were correctly predicted at a rate of 58% (N = 19). Recurrence in these patients was best predicted by a gene signature built at the median GI50 threshold (Cis2). Accuracy improved for both disease-free (57.7–61.9%) and recurrent patients (76.0–78.6%) when excluding patients who quit smoking more than 15 years before the diagnosis. This SVM included the CFLAR and PRKAA2 genes, which were not present in the two gene signatures that performed well for non-smokers.
We gradually altered the expression of each signature gene until the misclassification was corrected to determine which genes in these gene signatures led to the discordant predictions of patient outcomes. Alterations in the expression of MAP3K1, MAPK3, SLC22A5, and SLC31A2 corrected discordant predictions of patient outcome. Alterations in BCL2L1 expression were more likely to correct the discordant predictions of Cis1 (4 of 5) than Cis2 (2 of 4). If the change exceeded ≥3 times the highest or lowest expression of that gene and the prediction remained unchanged between different patients, then the impact of that gene on the signature was considered to be limited. By these criteria, the expression of PRKAA2, NFKB1, NFKB2, and TWIST1 was not able to be altered to correct a discordant prediction.
Cytosine methylation levels of genes in cisplatin models
Tobacco smoking has a significant impact on cytosine methylation levels in the genome.35 CpG island methylation is associated with smoking in pack years in a subset of the TCGA patients with bladder urothelial carcinoma.28 We suspected that the level of methylation measured in the SVMs that performed best for smoking and non-smoking patients might differ and exert possible concomitant effects on GE. When ranking each gene from Cis1 by the highest methylation level and GE, 88 of 1080 patient–gene combinations showed the expected inverse correlation between methylation levels and GE (i.e., high methylation and low GE). Methylation and GE levels were more frequently inversely than directly correlated (i.e., high methylation and high GE; N = 17). However, the direct correlation was more common in patients with a recent smoking history (70.5%). This pattern was also observed for Cis2, which best predicted recurrence in smokers. In cases where methylation and GE were directly correlated, we propose that smoking may alter expression through other effects, e.g., mutagenic effects, rather than solely by epigenetic inactivation through methylation.