In order to perform our tests, we downloaded expression data from the GTEx consortium [13, 15] (http://www.gtexportal.org). We chose the “Whole blood” data set (with data for 23,152 genes), as it contains a very large amount of measurements per gene (338), which allows us to use the full data set as a reasonably accurate reference point. As an auxiliary test set for further confirmation, we also downloaded a separate data set obtained from mouse brains from the Gene Expression Omnibus [20], with the identifier GSE26500 [21]. This set was chosen to serve as an independent test compared to our human set, while exhibiting a comparable number of genes (25,698) and a reasonably large number of samples (198) in order to evaluate the reference co-expression.

Based on these gene-expression data sets, we simulate three levels of low-quality data by sampling nested sets of 10, 20 or 50 measurements (which we hereby refer to as points). The sampling was performed by randomizing the order of samples in the data sets, after which we simply picked the first 10, 20, or 50 points (according to randomized order) for each gene, so that all 10 points in the lowest-quality set are also contained in the 20-point set, which in turn is entirely contained in the 50-point set. The rationale for this nested sampling approach is to evaluate the performance of wTO on a sliding scale of increasing data quality, in order to identify a potential break-even point at which it either begins or ceases to outperform the base correlation. From here on out, we use the term “quality” specifically to denote the number of points in a set.

For a given set of *N* genes, the computation time of wTO is longer than that of a simple correlation analysis; *O*(*N*^{3}) as opposed to *O*(*N*^{2}). Consequently, in order to reduce running times of our investigation to a more manageable duration, we reduce the number of genes in the simulated low-quality sets (and corresponding reference sets) to 1000 genes (out of the 23,152 available) in each set. This reduction in gene-set size also allows us to create an ensemble of 20 separate groups (without any shared genes between them) of nested low-quality sets, with corresponding reference sets, in order to account for potential variability in the predictive performance of the low-quality sets. In a procedure somewhat similar to the nested sampling approach for reducing the number of data points per gene, we generated the 20 1000-gene data sets by randomizing the order of genes in our original set, and dividing it into 20 consecutive sets (the remaining 3152 genes being omitted from the study). This process is effectively identical to a random draw of 20 separate 1000-gene sets without replacement.

*i*,

*j*):

$$ wTO(i, j) = frac{w_{ij} + {sumnolimits}_{kneq i, j}{w_{ik} w_{kj}}}{minleft({sumnolimits}_{k} w_{ik}, {sumnolimits}_{k} w_{jk}right) + 1 – w_{ij}}, $$

(1)

where *w*_{ab} denotes the absolute value of the correlation score between genes *a* and *b*.

We then evaluate the performance of the measures on low-quality sets against the reference set, using two rank-based tests: (A) Jaccard index for the top 1000 pairs in each low-quality set as compared to the reference sets, and (B) Spearman correlation between the co-expression/wTO scores in the low-quality sets and the reference sets.

Our motivation for the choice of rank-based tests is as follows: For a given threshold, the proportion of false positives within a given selection is essentially determined by the proportion of factually uncorrelated genes above a given rank in the data set. Any perturbation (e.g. due to differences in methodology or sample size) of the co-expression matrix that does not alter the order of pairs could, in principle, be counteracted by an appropriate change in the cut-off, yielding the desired reference network. On the other hand, if spuriously correlated (low-rank in the reference set) gene-pairs suddenly exhibit larger edge weights (i.e., higher rank in the low-quality set) than the factually correlated genes, false positives become inevitable. Identifying changes in ranking of pairs between sets is therefore critical. Our intention with the Spearman test is to identify this across the entire gene set, while the purpose of the Jaccard test is to focus on the most strongly co-expressed pairs, which are usually of main interest for a biological study.

In order to test the impact of wTO usage on network structure, we generated unweighted networks from each of the complete matrices using the 1000-gene sets for all of the chosen data quality levels by retaining all edges above a given cut-off. Determining an appropriate cut-off threshold for building networks from correlation matrices is a major field of gene co-expression analysis in particular and network science in general, and a variety of criteria have been proposed for this purpose. One might, for instance, set it to correspond to a desired *p*-value under the null hypothesis that the data is independent. Other proposed methods specifically suggesting bicor, Spearman or Pearson correlations as the co-expression metric involve choosing a cut-off so that the network follows a scale-free topology [19], or setting the cut-off near abrupt transitions in the nearest neighbour spacing distribution [25]. A detailed investigation in the choice of cut-off and impact thereof is, however, outside the scope of this paper. To build networks from our complete correlation and wTO matrix, we therefore opted for the straightforward approach of selecting the top 0.2% of gene pairs (from the 500,000 pairs per set). Community detection on these networks was conducted using the Louvain algorithm [26].