In this section, we report the performance of the proposed algorithm using simulated and real tumor datasets. We compared model estimates, matrices of genotypes and proportions, from the proposed algorithm to those obtained from other similar algorithms. In real tumor datasets, similar to the manual approach considered in [27], we hypothesized phylogenetic trees from the estimated matrix of genotypes. Particularly, we assumed that the set of mutations that are grouped together in a tumor subclone comprises of: all the mutations that belong to its ancestors on the tree and the mutations on the edge that connect the subclone to its parent subclone. With this simple rule, we were able to construct the possible phylogenetic trees that are consistent with the estimated matrix of genotypes. For the simulation experiments, we employed a reverse of the above rule to generate binary genotype matrices from phylogenetic trees. Finally, we compared the runtimes of the different algorithms for subclone inference.
Simulated datasets
We generated datasets for average sequencing depth r∈{50,200,1000} per locus, number of tumor subclones C∈{3,4,5}, number of tumor samples S∈{3,4,…,10} and number of genomic loci T∈{20,40,60,80,100,5000}. For a given number of tumor subclones C and number of genomic loci T, we simulated a phylogenetic tree from where the genotype matrix Z is obtained. For the phylogenetic tree simulation, we grouped the T mutations into C subclones uniformly at random. The mutations in each subclone are assumed to first appear in that particular subclone on the tree. One of the subclones is randomly selected as the root node and the rest C−1 subclones are iteratively connected to the tree. Specifically, an unattached subclone (child) and a parent subclone on the tree are randomly selected. The child subclone is attached to the parent subclone and the new set of mutations in the child subclone is a union of the mutations in the parent and the mutations in the child subclone. The mutational profiles of the subclones are the columns of the genotype matrix Z.
Given the genotype matrix, along with specific values of r and S, we generated the input data to the proposed algorithm, i.e., the matrices of variant count Y and total count V. We generated each entry of V, i.e., v_{ts} from Pois(r). We generated each entry of Y, i.e., y_{ts} as follows: sampled each column of the proportion matrix W independently from Dir([a_{0},a_{1},…,a_{4}]) (a_{0}=0.2 and a_{c}, c∈{1,…,4} randomly chosen from the set {2,4,5,6,7,8}), defined p=0.02, computed p_{ts} following (2) in the “Method” section, and sampled y_{ts} from binomial(v_{ts},p_{ts}).
$$ begin{aligned} e_{Z} ,=, frac{1}{TC} sumlimits_{t = 1}^{T} sumlimits_{c = 1}^{C} left| hat{z}_{tc} ,-, z_{tc} right|, hspace{1mm} e_{W} = frac{1}{CS} sumlimits_{c = 0}^{C} sumlimits_{s = 1}^{S} left| hat{w}_{cs} – w_{cs} right|, end{aligned} $$
and
$$ begin{aligned} e_{p_{ts}} ,=, frac{1}{TS} sumlimits_{t= 1}^{T} sumlimits_{s = 1}^{S} left| hat{p}_{ts} ,-, p_{ts} right|, hspace{1mm} text{where} hspace{1mm} hat{p}_{ts} ,=, hat{p} hat{w}_{0s} ,+, sumlimits_{c = 1}^{C} hat{z}_{tc} hat{w}_{cs}. end{aligned} $$
The problem of estimating genotype matrix and proportions matrix is a blind decomposition problem. This implies that after the analysis, we are unaware of the columns of the estimated genotype matrix that correspond to the columns of the true genotype matrix. We resolved this by computing the genotype error with every permutation of the columns of the estimated genotype matrix. We selected the permutation that resulted in the smallest error and we used the selected genotype in computing the other error values. All experiments were performed on Intel(R) Xeon(R) CPU @ 3.5GHz and a 24GB of RAM running a 64-bit Windows 7.
Genotype error (e_{Z}) and proportion error (e_{W}) computed for SeqClone, Clomial, BayClone and Cloe for T=20, C=3, S=5 and r∈{50,200,1000}
50 |
0.035 (0.005) |
0.053 (0.005) |
0.040 (0.005) |
0.071 (0.005) |
0.080 (0.007) |
0.059 (0.006) |
0.065 (0.005) |
0.064 (0.008) |
200 |
0.012 (0.002) |
0.022 (0.002) |
0.025 (0.004) |
0.046 (0.007) |
0.075 (0.009) |
0.062 (0.008) |
0.060 (0.006) |
0.052 (0.003) |
1000 |
0.002 (0.001) |
0.019 (0.002) |
0.020 (0.002) |
0.039 (0.004) |
0.060 (0.004) |
0.038 (0.005) |
0.065 (0.004) |
0.037 (0.004) |
Genotype error (e_{Z}) and proportion error (e_{W}) computed for SeqClone, Clomial, BayClone and Cloe for T=100, C=3, S=5 and r∈{50,200,1000}
50 |
0.030 (0.004) |
0.023 (0.003) |
0.055 (0.007) |
0.094 (0.006) |
0.078 (0.007) |
0.059 (0.006) |
0.041 (0.005) |
0.064 (0.008) |
200 |
0.015 (0.003) |
0.014 (0.001) |
0.050 (0.006) |
0.050 (0.006) |
0.080 (0.006) |
0.061 (0.006) |
0.080 (0.005) |
0.081 (0.004) |
1000 |
0.004 (0.001) |
0.011 (0.001) |
0.045 (0.004) |
0.051 (0.005) |
0.070 (0.006) |
0.055 (0.005) |
0.070 (0.005) |
0.066 (0.005) |
Average and standard deviation of (e_{p_{ts}}), e_{Z} and e_{W} for T∈{100,5000}, S=5, C=3, and r=1000
100 |
0.002 [0.000] |
0.010 [0.002] |
0.014 [0.001] |
5000 |
0.002 [0.001] |
0.004 [0.001] |
0.009 [0.002] |
Runtimes, e_{Z} and e_{W} for T=20, S=5, C=3, and r=1000
e |
0.005 |
0.015 |
0.050 |
0.060 |
e |
0.018 |
0.034 |
0.036 |
0.035 |
In Figs. 1, 2, 3, 4, 5 and 6, for SeqClone, we present the errorbar plots for the average and standard deviation over 20 datasets for different combinations of the number of loci, sample size, number of subclones and average sequencing depth. The standard deviation is the vertical line above and below the average value in the errorbar plots. Figures 1, 2 and 3 show how the errors vary across different sample sizes for different subclones. There is an improvement, for all the subclones, in the estimates of all model parameters when the number of tumor samples increases. Similarly, in Figs. 4, 5 and 6, estimates of model parameters improves when the average sequencing depth increases. In the first row in Table 3, we present, for SeqClone, the result of the permutations of rows of the input data. For the dataset with T = 100,C = 3,r = 1000 and S=5, we ran SeqClone with randomly selected 100 permutations of the rows of the input data matrices and we computed the average and standard deviation of the errors (row one in Table 3). In row two in Table 3, we present results for higher number of genomic loci. In particular, we present the average and standard deviation of errors over 20 runs for the datasets with T=5000, C=3, r=1000 and S=5.
Lastly, we present the runtimes and memory consumption for all the methods when performing a section of the experiments in Table 1. For the proposed algorithm, we ran the algorithm 20 times with 1000 particles. For the MCMC-based algorithms (Cloe and BayClone), we ran 30,000 chains. For Clomial, we ran 2000 iterations. The runtimes for all the methods on a 3.5Ghz Intel 8 cores running MATLAB and the associated genotype and proportion errors for the dataset from T=20, C=3, r=1000, and S=5 are in Table 4. In addition, for this particular dataset, we report the estimated sample mean and sample standard deviation of the relative frequency of variant reads that are produced as error (parameter p in “Method” section) from SeqClone and BayClone. For SeqClone, the mean is 0.019 and the standard deviation is 0.0012. Likewise, for BayClone, the mean is 0.022 and the standard deviation is 0.0011. In Fig. 7, we present the memory consumption by all the algorithms for different genomic loci (T∈{20,40,60,80,100}). In general, Clomial is the most memory efficient of all the algorithms. However, SeqClone consumes lesser memory when compared to BayClone and Cloe.
Real biological tumor samples
Next, we present the results obtained from applying the proposed algorithm to real biological tumor datasets. Particularly, we analyzed the datasets of three patients with B-cell CLL namely: CLL077, CLL006, and CLL003 [42]. Complete datasets and the data pre-processing steps are in [42]. In Additional file 1, we include the analysis results with Clomial, BayClone and Cloe.
CLL077:
CLL077: estimate of genotype matrix/mutational profile
C1 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
C2 |
1 |
1 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
C3 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
C4 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
CLL077: estimate of the proportions of subclones in the samples
C0 |
0.00 |
0.00 |
0.00 |
0.05 |
0.35 |
C1 |
0.27 |
0.15 |
0.14 |
0.18 |
0.16 |
C2 |
0.02 |
0.04 |
0.05 |
0.13 |
0.28 |
C3 |
0.35 |
0.29 |
0.41 |
0.30 |
0.12 |
C4 |
0.36 |
0.52 |
0.40 |
0.34 |
0.09 |
CLL006:
CLL006: estimate of genotype matrix/mutational profile
C1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
C2 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
C3 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
0 |
C4 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
C5 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
CLL006: estimate of the proportions of subclones in the samples
C0 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
C1 |
0.10 |
0.19 |
0.07 |
0.19 |
0.21 |
C2 |
0.41 |
0.09 |
0.19 |
0.18 |
0.17 |
C3 |
0.23 |
0.24 |
0.30 |
0.16 |
0.08 |
C4 |
0.09 |
0.21 |
0.19 |
0.17 |
0.27 |
C5 |
0.17 |
0.27 |
0.25 |
0.30 |
0.27 |
CLL003:
CLL003: estimate of genotype matrix/mutational profile
C1 |
1 |
0 |
0 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
C2 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
1 |
CLL003: estimate of the proportions of subclones in the samples
C0 |
0.00 |
0.00 |
0.35 |
0.00 |
0.01 |
C1 |
0.08 |
0.05 |
0.53 |
0.99 |
0.98 |
C2 |
0.92 |
0.95 |
0.12 |
0.01 |
0.01 |
Runtimes and memory consumption for simulated and real biological dataset
Simulated data |
55 (20.48) |
53 (18.50) |
93 (80.52) |
101 (75.20) |
CLL003 |
57 (20.60) |
54 (18.80) |
98 (81.00) |
102 (75.50) |