The abundance of cheap and accurate genomic technologies has led to a multitude of different approaches for classifying breast cancer patients into different prognostic groups based on their transcriptomic profiles. These risk classifications are clinically useful because they allow the targeting of aggressive treatment on patients most vulnerable to tumour recurrence or mortality while avoiding exposing those with lower risk to associated side-effects. Prognostic signatures, also referred to as biomarkers, are a popular class of tools for converting mRNA abundance data into patient risk scores that can serve as proxies for clinical outcome. Several particularly proficient biomarkers for breast cancer have already proven to be commercially viable [1–3].
A biomarker generally consists of two parts: a gene set chosen for association with prognosis using a supervised or unsupervised feature selection algorithm and a risk score model that transforms the mRNA abundance levels from these genes into risk scores for a given patient cohort. Many biomarkers developed for breast cancer rely on complex algorithms for feature selection and risk score calculation [4–9]. However, it has been demonstrated that gene sets selected for prognostic ability using such methods often fail to outperform randomly chosen gene sets of the same size [10, 11]. This is concordant with a previous finding that large numbers of non-overlapping gene sets are associated with breast cancer prognosis [12, 13]. This appears to be a general characteristic of multiple tumour types [14, 15]. If the background or “null” level of gene set prognostic performance is relatively high, it makes it more challenging for feature selection algorithms to identify groups of genes that perform significantly better than random chance. It is therefore apparent that the fundamental properties of multi-gene biomarkers must be fully elucidated in order to identify optimal biomarkers, which is difficult to do even when the feature-size is pre-set.
Haibe-Kains et al.  cast further doubt on the usefulness of multi-gene biomarkers for breast cancer outcome prognosis by suggesting that they do not consistently outperform the simplest possible model: one based on a single, well-chosen gene. They found that a simple biomarker that dichotomizes patients based on the expression of the gene AURKA fared roughly as well as much more complex methods that attempt to leverage the mRNA abundance data of many genes or even the whole genome. Haibe-Kains et al.  thus put forward a “single-gene hypothesis”: there is little marginal utility in implementing multi-gene prognostic biomarkers with complex feature selection and patient risk scoring components in breast cancer. Given the other difficulties inherent in using multi-gene biomarkers outlined above, it would seem that single-gene biomarkers based on biological intuition confer an advantage in interpretability and ease of discovery without compromising prognostic performance.
However, in the decade since the single-gene hypothesis was formulated, there have been many advances in both genomics technology itself and in the analytical techniques used for biomarker development. The drop in cost of measuring mRNA abundance has led to the availability of a much greater number of datasets. One of the largest of these is the Metabric dataset, comprising nearly two thousand patients . The improved fidelity of transciptomics technologies has also led to smaller levels of noise in newer expression datasets. Furthermore, bioinformaticians are continually finding new and improved ways of applying insights from the field of machine learning to biomarker development.
Given these new developments, we revisited the single-gene hypothesis, testing it on a meta-dataset of 4960 breast cancer patient expression profiles. We tested the prognostic performance of single-gene models, including AURKA, and two different types of multi-gene models on a deep sample of the biomarker population. This approach allowed us to draw generalizable insights into the relative performance of multi-gene and single-gene transcriptomic biomarkers for predicting breast cancer patient prognosis.