SeqClone: sequential Monte Carlo based inference of tumor subclones

Bioinformatics

In this section, we report the performance of the proposed algorithm using simulated and real tumor datasets. We compared model estimates, matrices of genotypes and proportions, from the proposed algorithm to those obtained from other similar algorithms. In real tumor datasets, similar to the manual approach considered in [27], we hypothesized phylogenetic trees from the estimated matrix of genotypes. Particularly, we assumed that the set of mutations that are grouped together in a tumor subclone comprises of: all the mutations that belong to its ancestors on the tree and the mutations on the edge that connect the subclone to its parent subclone. With this simple rule, we were able to construct the possible phylogenetic trees that are consistent with the estimated matrix of genotypes. For the simulation experiments, we employed a reverse of the above rule to generate binary genotype matrices from phylogenetic trees. Finally, we compared the runtimes of the different algorithms for subclone inference.

Simulated datasets

We generated datasets for average sequencing depth r{50,200,1000} per locus, number of tumor subclones C{3,4,5}, number of tumor samples S{3,4,…,10} and number of genomic loci T{20,40,60,80,100,5000}. For a given number of tumor subclones C and number of genomic loci T, we simulated a phylogenetic tree from where the genotype matrix Z is obtained. For the phylogenetic tree simulation, we grouped the T mutations into C subclones uniformly at random. The mutations in each subclone are assumed to first appear in that particular subclone on the tree. One of the subclones is randomly selected as the root node and the rest C−1 subclones are iteratively connected to the tree. Specifically, an unattached subclone (child) and a parent subclone on the tree are randomly selected. The child subclone is attached to the parent subclone and the new set of mutations in the child subclone is a union of the mutations in the parent and the mutations in the child subclone. The mutational profiles of the subclones are the columns of the genotype matrix Z.

Given the genotype matrix, along with specific values of r and S, we generated the input data to the proposed algorithm, i.e., the matrices of variant count Y and total count V. We generated each entry of V, i.e., vts from Pois(r). We generated each entry of Y, i.e., yts as follows: sampled each column of the proportion matrix W independently from Dir([a0,a1,…,a4]) (a0=0.2 and ac, c{1,…,4} randomly chosen from the set {2,4,5,6,7,8}), defined p=0.02, computed pts following (2) in the “Method” section, and sampled yts from binomial(vts,pts).

The proposed algorithm, Clomial [27], BayClone [28], and Cloe [26] were run on the simulated datasets. We defined the following metrics to quantify the estimation strength of the algorithms: genotype error (eZ), proportion error (eW) and success probabilities error ((e_{p_{ts}})). Mathematically, these errors are defined as

$$ begin{aligned} e_{Z} ,=, frac{1}{TC} sumlimits_{t = 1}^{T} sumlimits_{c = 1}^{C} left| hat{z}_{tc} ,-, z_{tc} right|, hspace{1mm} e_{W} = frac{1}{CS} sumlimits_{c = 0}^{C} sumlimits_{s = 1}^{S} left| hat{w}_{cs} – w_{cs} right|, end{aligned} $$

and

$$ begin{aligned} e_{p_{ts}} ,=, frac{1}{TS} sumlimits_{t= 1}^{T} sumlimits_{s = 1}^{S} left| hat{p}_{ts} ,-, p_{ts} right|, hspace{1mm} text{where} hspace{1mm} hat{p}_{ts} ,=, hat{p} hat{w}_{0s} ,+, sumlimits_{c = 1}^{C} hat{z}_{tc} hat{w}_{cs}. end{aligned} $$

The problem of estimating genotype matrix and proportions matrix is a blind decomposition problem. This implies that after the analysis, we are unaware of the columns of the estimated genotype matrix that correspond to the columns of the true genotype matrix. We resolved this by computing the genotype error with every permutation of the columns of the estimated genotype matrix. We selected the permutation that resulted in the smallest error and we used the selected genotype in computing the other error values. All experiments were performed on Intel(R) Xeon(R) CPU @ 3.5GHz and a 24GB of RAM running a 64-bit Windows 7.

In Tables 1, 2, 3 and 4 and Figs. 1, 2, 3, 4, 5, 6 and 7, we present the results obtained from analyses of simulated datasets. To compare the methods, we generated 20 datasets for every combination of number of genomic loci T, number of tumor subclones C, number of tumor samples S and average sequencing depth r. We computed the average and standard deviation of genotype error eZ and proportion error eW over all the 20 datasets. In Table 1, we present the average and standard deviation (in round parentheses) of the genotype and the proportion errors for all the methods when the number of loci T=20, number of subclones C=3, number of samples S=5 and average sequencing depth r{50,200,1000}. We excluded success probabilities error ((e_{p_{ts}})) because not all the algorithms return an estimate of p in (2) (“Method” section). Similarly, in Table 2, we show, for all the methods, the average and the standard deviation of genotype and proportion errors when T=100, C=3, S=5 and r{50,200,1000}. The proposed algorithm demonstrates a comparable and sometimes, superior performance in terms of the accuracy of the estimated genotype and proportion matrices. It should be noted that, for BayClone, the ones in the true binary genotype matrices were changed to 0.5 before the simulation and the entries of the estimated genotype matrices greater than 0 were changed to ones before computing the errors.

Fig. 1

Plot of genotype error eZ versus sample size S for T=20 loci, average sequencing depth r=1000 and C{3,4,5} subclones

Fig. 2

Plot of proportion error eW versus sample size S for T=20 loci, average sequencing depth r=1000 and C{3,4,5} subclones

Fig. 3

Plot of the error of success probability (e_{p_{ts}}) versus sample size S for T=20 loci, average sequencing depth r=1000 and C{3,4,5} subclones

Fig. 4

Plot of genotype error eZ versus sample size S for T=20 loci, C=3 subclones and average sequencing depth r{50,200,1000}

Fig. 5

Plot of proportion error eW versus sample size S for T=20 loci, C=3 subclones and average sequencing depth r{50,200,1000}

Fig. 6

Plot of the error of success probability (e_{p_{ts}}) versus sample size S for T=20 loci, C=3 subclones and average sequencing depth r{50,200,1000}

Fig. 7

Plot of consumed memory versus number of genomic loci T{20,40,60,80,100}

Table 1

Genotype error (eZ) and proportion error (eW) computed for SeqClone, Clomial, BayClone and Cloe for T=20, C=3, S=5 and r{50,200,1000}

50

0.035 (0.005)

0.053 (0.005)

0.040 (0.005)

0.071 (0.005)

0.080 (0.007)

0.059 (0.006)

0.065 (0.005)

0.064 (0.008)

200

0.012 (0.002)

0.022 (0.002)

0.025 (0.004)

0.046 (0.007)

0.075 (0.009)

0.062 (0.008)

0.060 (0.006)

0.052 (0.003)

1000

0.002 (0.001)

0.019 (0.002)

0.020 (0.002)

0.039 (0.004)

0.060 (0.004)

0.038 (0.005)

0.065 (0.004)

0.037 (0.004)

Table 2

Genotype error (eZ) and proportion error (eW) computed for SeqClone, Clomial, BayClone and Cloe for T=100, C=3, S=5 and r{50,200,1000}

50

0.030 (0.004)

0.023 (0.003)

0.055 (0.007)

0.094 (0.006)

0.078 (0.007)

0.059 (0.006)

0.041 (0.005)

0.064 (0.008)

200

0.015 (0.003)

0.014 (0.001)

0.050 (0.006)

0.050 (0.006)

0.080 (0.006)

0.061 (0.006)

0.080 (0.005)

0.081 (0.004)

1000

0.004 (0.001)

0.011 (0.001)

0.045 (0.004)

0.051 (0.005)

0.070 (0.006)

0.055 (0.005)

0.070 (0.005)

0.066 (0.005)

Table 3

Average and standard deviation of (e_{p_{ts}}), eZ and eW for T{100,5000}, S=5, C=3, and r=1000

100

0.002 [0.000]

0.010 [0.002]

0.014 [0.001]

5000

0.002 [0.001]

0.004 [0.001]

0.009 [0.002]

Table 4

Runtimes, eZ and eW for T=20, S=5, C=3, and r=1000

e

Z

0.005

0.015

0.050

0.060

e

W

0.018

0.034

0.036

0.035

In Figs. 1, 2, 3, 4, 5 and 6, for SeqClone, we present the errorbar plots for the average and standard deviation over 20 datasets for different combinations of the number of loci, sample size, number of subclones and average sequencing depth. The standard deviation is the vertical line above and below the average value in the errorbar plots. Figures 1, 2 and 3 show how the errors vary across different sample sizes for different subclones. There is an improvement, for all the subclones, in the estimates of all model parameters when the number of tumor samples increases. Similarly, in Figs. 4, 5 and 6, estimates of model parameters improves when the average sequencing depth increases. In the first row in Table 3, we present, for SeqClone, the result of the permutations of rows of the input data. For the dataset with T = 100,C = 3,r = 1000 and S=5, we ran SeqClone with randomly selected 100 permutations of the rows of the input data matrices and we computed the average and standard deviation of the errors (row one in Table 3). In row two in Table 3, we present results for higher number of genomic loci. In particular, we present the average and standard deviation of errors over 20 runs for the datasets with T=5000, C=3, r=1000 and S=5.

Lastly, we present the runtimes and memory consumption for all the methods when performing a section of the experiments in Table 1. For the proposed algorithm, we ran the algorithm 20 times with 1000 particles. For the MCMC-based algorithms (Cloe and BayClone), we ran 30,000 chains. For Clomial, we ran 2000 iterations. The runtimes for all the methods on a 3.5Ghz Intel 8 cores running MATLAB and the associated genotype and proportion errors for the dataset from T=20, C=3, r=1000, and S=5 are in Table 4. In addition, for this particular dataset, we report the estimated sample mean and sample standard deviation of the relative frequency of variant reads that are produced as error (parameter p in “Method” section) from SeqClone and BayClone. For SeqClone, the mean is 0.019 and the standard deviation is 0.0012. Likewise, for BayClone, the mean is 0.022 and the standard deviation is 0.0011. In Fig. 7, we present the memory consumption by all the algorithms for different genomic loci (T{20,40,60,80,100}). In general, Clomial is the most memory efficient of all the algorithms. However, SeqClone consumes lesser memory when compared to BayClone and Cloe.

Real biological tumor samples

Next, we present the results obtained from applying the proposed algorithm to real biological tumor datasets. Particularly, we analyzed the datasets of three patients with B-cell CLL namely: CLL077, CLL006, and CLL003 [42]. Complete datasets and the data pre-processing steps are in [42]. In Additional file 1, we include the analysis results with Clomial, BayClone and Cloe.

CLL077:

Here, we present the results obtained from analyzing the dataset from patient CLL077 with SeqClone. This dataset had 16 distinct loci probed for tumor heterogeneity. These are shown in the first row in Table 5. We present our analysis results in the main paper, and the estimates for other methods in Additional file 1. In concordance with other methods, SeqClone estimated 4 subclones as shown in Table 5, and also produced SNV profiles that are similar to those obtained from the three other methods. Also, the proportions of tumor subclones exhibit similar trend in all the 5 tumor samples across various methods. For instance, the abundance of sublone 1 in sample ‘a’ in Clomial, BayClone, Cloe and SeqClone are 0.27,0.21,0.16 and 0.27, respectively. This trend continues in all other samples except in sample ‘e’ where Clomial deviates from this normal trend, i.e., Clomial, BayClone, Cloe and SeqClone are 0.43,0.07,0.03, and 0.16, respectively. On this dataset, SeqClone produced a consistent result with other methods in estimating the SNV profiles of subclones and their proportions in all the samples (Tables 5, 6 and in Additional file 1). The constructed phylogenetic tree from the SNV profiles for CLL077 is shown Fig. 8.

Fig. 8

Phylogenetic tree from CLL077

Table 5

CLL077: estimate of genotype matrix/mutational profile

C1

1

0

0

0

0

1

0

0

0

0

1

0

0

0

1

1

C2

1

1

0

0

0

1

1

1

0

1

1

1

0

0

1

1

C3

1

0

1

1

1

1

0

0

0

0

1

0

1

1

1

1

C4

1

0

1

1

1

1

0

0

1

0

1

0

1

1

1

1

Table 6

CLL077: estimate of the proportions of subclones in the samples

C0

0.00

0.00

0.00

0.05

0.35

C1

0.27

0.15

0.14

0.18

0.16

C2

0.02

0.04

0.05

0.13

0.28

C3

0.35

0.29

0.41

0.30

0.12

C4

0.36

0.52

0.40

0.34

0.09

CLL006:

This dataset comprises of 11 genomic loci. These are shown in the first row in Table 7. We analyzed the dataset with SeqClone, and the estimates of genotype and proportions matrices are in Tables 7 and 8. The constructed phylogenetic tree is shown Fig. 9. SeqClone and BayClone estimated 5 distinct subclones, Clomial had 4 subclones and Cloe recovered 6 subclones. Details of the estimates from Clomial, BayClone and Cloe are in Additional file 1.

Fig. 9

Phylogenetic tree from CLL006

Table 7

CLL006: estimate of genotype matrix/mutational profile

C1

1

1

1

1

1

1

1

1

1

1

1

C2

1

1

0

1

0

1

1

1

0

1

1

C3

1

1

0

1

0

1

1

1

0

1

0

C4

1

1

0

1

1

1

1

1

0

1

1

C5

1

1

1

1

1

1

1

1

0

1

1

Table 8

CLL006: estimate of the proportions of subclones in the samples

C0

0.00

0.00

0.00

0.00

0.00

C1

0.10

0.19

0.07

0.19

0.21

C2

0.41

0.09

0.19

0.18

0.17

C3

0.23

0.24

0.30

0.16

0.08

C4

0.09

0.21

0.19

0.17

0.27

C5

0.17

0.27

0.25

0.30

0.27

CLL003:

The dataset from patient CLL003 has 20 distinct genomic loci. This is shown in the first row in Table 9. In this dataset, Clomial and Cloe produced 2 distinct subclones with considerably high proportions in the samples and 2 others with very small proportions across all samples. SeqClone and BayClone estimated the first 2 major subclones that dominate the 5 samples with proportions shown in Table 10 (and Additional file 1: Table S6). The constructed phylogenetic tree for CLL003 is shown in Fig. 10.

Fig. 10

Phylogenetic tree from CLL003

Table 9

CLL003: estimate of genotype matrix/mutational profile

C1

1

0

0

1

0

0

1

1

1

1

1

0

1

1

1

0

1

1

1

0

C2

1

1

1

0

1

1

0

1

0

1

1

1

0

0

0

1

0

1

1

1

Table 10

CLL003: estimate of the proportions of subclones in the samples

C0

0.00

0.00

0.35

0.00

0.01

C1

0.08

0.05

0.53

0.99

0.98

C2

0.92

0.95

0.12

0.01

0.01

Finally, we investigated the behavior of the algorithms in terms of runtime and memory consumption, when applied to simulated and real datasets of similar size: T=20 and S=5. We present the results in Table 11. Runtimes (without parentheses) are in minutes and the consumed memory (in round parentheses) are in MB.

Table 11

Runtimes and memory consumption for simulated and real biological dataset

Simulated data

55 (20.48)

53 (18.50)

93 (80.52)

101 (75.20)

CLL003

57 (20.60)

54 (18.80)

98 (81.00)

102 (75.50)

Articles You May Like

Human ESC-Derived Chimeric Mouse Models of Huntington’s Disease Reveal Cell-Intrinsic Defects in Glial Progenitor Cell Differentiation
Teenage Diver Finds Tons Of Golf Balls Rotting Off California
Erythema Gyratum Repens Associated with Anal Cancer
Kinetic analysis of multistep USP7 mechanism shows critical role for target protein in activity
Thermodynamic model of gene regulation for the Or59b olfactory receptor in <i>Drosophila</i>

Leave a Reply

Your email address will not be published. Required fields are marked *