Systems overview
The ToolSeg tool provides functionality for many tasks typically encountered in copy number analysis: data preprocessing, segmentation methods of various algorithms and visualization tools. The main workflow includes: 1) reading and filtering of raw sample data; 2) segmentation of allelespecific SNP array data; and 3) visualization of results. The input includes copy number measurements from single or paired SNParray or HTS experiments. Allele observations normally need to detect and appropriately modify or filter extreme observations (outliers) prior to segmentation. Here, the median filtering algorithm [17] is used in the ToolSeg toolbox to manipulate the original input measurements. The method of DBS is based on the Central Limit Theorem in probability theory for finding breakpoints and observation segments with a welldefined expected mean and variance. In DBS, the segmentation curves are recursively generated by the recursive splits using the preceding breakpoints. A set of graphical tools is also available in the toolbox to visualize the raw data and segmentation results and to compare six different segmentation algorithms in a statistically rigorous way.
Input data and preprocessing
ToolSeg requires the raw signals from highthroughput samples to be organized as a onedimensional vector and stored as a .txt file. Detailed descriptions of the software are included in the Supplementary Material.
Before we performed copy number change detection and segmentation using copy number data, a challenging factor in copy number analysis was the frequent occurrence of outliers – single probe values that differ markedly from their neighbors. Generally, such extreme observations can be due to the presence of very short segments of DNA with deviant copy numbers, technical aberrations, or a combination. Such extreme observations have potentially harmful effect when the focus is on detection of broader aberrations [17, 18]. In ToolSeg, the classical limit filter, Winsorization, is performed to reduce such noise, which is a typical preprocessing step to eliminate extreme values in the statistical data to reduce the effect of possible spurious outliers.
$$ f(x)=left{begin{array}{c}widehat{mu}tau widehat{sigma},x<widehat{mu}tau widehat{sigma}\ {}widehat{mu}+tau widehat{sigma},x>widehat{mu}+tau widehat{sigma}\ {} x,mathrm{otherwise}end{array}right. $$
(1)
and τ ∈ [1.5, 3], (default 2.5 in ToolSeg). Often, such simple and fast Winsorization is sufficient, as discussed in [12].
Binary segmentation
Now, we discuss the basic problem of obtaining individual segmentation for one chromosome arm in one sample. The aim of copy number change detection and segmentation is to divide a chromosome into a few continuous segments, within each of which the copy numbers are considered constant.
$$ {x}_i={y}_i+{varepsilon}_i $$
where y_{i} is an unknown actual “true” copy number at the i’th locus and ε_{i} represents measurement noise, which follows an independent and identically distributed (i.i.d.) with mean of zero. A breakpoint is said to occur between probe i and i + 1 if y_{i} ≠ y_{i + 1}, i ∈ (1, n). The sequence y_{0}, …, y_{K} thus implies a segmentation with a breakpoint set {b_{1}, …, b_{K}}, where b_{1} is the first breakpoint, the probes of the first subsegment are before b_{1}, the second subsegment is between b_{1} and the second breakpoint b_{2}, and so on. Thus, we formulated the copy number change detection as the problem of detecting the breakpoint in copy number data.
Consider first the simplest problem of obtaining only one segment. There is no copy number change on a chromosome in the sample. Given the copy number signals of length n on the chromosome, x_{1}, …, x_{n}, and let x_{i} be an observation produced by independent and identically distributed (i.i.d.) random variable drawn from distribution of expected values given by ( widehat{mu} ) and finite variances given by ( {widehat{sigma}}^2 ).
$$ {widehat{Z}}_{i,j}=frac{sum_{k=i}^{j1}left({x}_kwidehat{mu}right)}{widehat{sigma}sqrt{ ji}},1<i<j<n+1, $$
(2)
where ( widehat{mu}=frac{1}{ji}{sum}_{k=i}^{j1}{x}_k ) is the arithmetic mean between point i and point j (does not include j), and ( widehat{sigma} ) is the estimated standard deviation of x_{i}, ( widehat{sigma}=sqrt{frac{1}{ji1}{sum}_{k=i}^{j1}{left({x}_kwidehat{mu}right)}^2} ), which will be discussed later. Furthermore, we define the test statistic
$$ widehat{Z}=underset{1<i<j<mathrm{n}+1,ji>{n}_0 }{max}left{widehat{Z}}_{i,j}right $$
(3)
where n_{0} is a predetermined parameter of the minimum length of CNA.
According to the central limit theorem (CLT), as the sample size (the length of an observation sequence, j − i) increases to a sufficiently large number, the arithmetic mean of independent random variables will be approximately normally distributed with mean μ and variance σ^{2}/(j − i), regardless of the underlying distribution. Therefore, under the null hypothesis of no copy number change, the test statistic ( {widehat{Z}}_{ij} ) asymptotically follows a standard normal distribution, N (0, 1). Copy number change segments measured by highthroughput sequencing data usually span over hundreds, even tens of thousands, of probes. Therefore, the normality of ( {widehat{Z}}_{ij} ) is approximated with high accuracy.
$$ wp left({widehat{Z}}_{ij}right)=1sqrt{frac{2}{pi }}{int}_{infty}^{{widehat{Z}}_{ij}}{e}^{frac{x^2}{2}} dx>theta $$
(4)
We iterate over the whole segment to calculate the Pvalue of ( widehat{Z} ) using the cumulative distribution function of N (0, 1). If the Pvalue is greater than θ, then we will consider that there is no copy number change in the segment. In other words, ( widehat{Z} ) is not far from the center of the shape of the standard normal distribution.
$$ wp left({widehat{T}}_{i,j}right)=frac{theta }{ji} $$
(5)
$$ widehat{Z}=underset{i,j}{max}leftfrac{sum_{k=i}^{j1}left({x}_kwidehat{mu}right)}{widehat{sigma}sqrt{ji}}rightge {widehat{T}}_{i,j} $$
(6)
$$ {Z}_{i,j}=frac{sum_{k=i}^{j1}left({x}_kwidehat{mu}right)}{{widehat{T}}_{i,j}sqrt{ji}}={omega}_{i,j}{varepsilon}_{i,j} $$
(7)
where ( {omega}_{i,j}={left({widehat{T}}_{i,j}sqrt{ji}right)}^{1} ), ω_{i, j} > 0, and ε_{i, j} is the accumulated error between two points i and j, 1 i j
$$ {mathbb{Z}}_{1,n+1}(p)= distleft(left{{Z}_{1,p},{Z}_{p,n+1}right},0right) $$
(8)
$$ {mathbb{Z}}_p=underset{1<p<n+1}{max }{mathbb{Z}}_{1,n+1}(p) $$
(9)
$$ b=arg underset{1<p<n+1}{max }{mathbb{Z}}_{1,n+1}(p) $$
(10)
Then, a binary segmentation procedure will be performed at breakpoint b, and we will apply the above algorithm recursively to the two segments x_{1}, …, x_{p − 1} and x_{p}, …, x_{n}, p ∈ (1, n).
Multiscale scanning procedure
Up to now, the above algorithm has been able to identify the most significant breakpoints, except one short segment sandwiched between two long segments. In this case, the distance between breakpoints at the intermediate position p and both ends is much or far greater than 1. Thus, ( wp left({widehat{T}}_{1,p}right)=theta /p ) tends to 0, and ( {widehat{T}}_{1,p} ) has almost no change with an increase in p. The accumulated error generated by the sum process is equally shared to each point from 1 to p. When increasing the distance to the ends, the change of Z_{i, j} becomes slower. Thus, spike pulses and small segments embedded in long segments are suppressed. Therefore, if ℤ_{p} is less than the estimated standard deviation ( widehat{sigma} ) after a onetime scan of the whole segment, we cannot arbitrarily exclude the presence of the breakpoint.
From the situation above, it is obvious that we cannot use the fixed endpoints to detect breakpoints on a global scale. This method is acceptable with large jumps or changes in long segments, but to detect shorter segments. We need smaller windows. For these smaller segments, scalespace scanning is used. In the DBS algorithm, in the second phase, a multiscale scanning stage will be started by the windowed model, if a breakpoint was not found immediately by the first phase.
$$ {mathbb{Z}}_p=underset{1<p<n,win mathcal{W}}{max } distleft(left{{Z}_{pw,p},{Z}_{p,p+w}right},0right) $$
(11)
Therefore, we can find the local maximum across these scales (window width), which provides a list of (Z_{p − w, p}, Z_{p, p + w}, p, w) values and indicate that there is a potential breakpoint at p at the w scale. Once ℤ_{p} is greater than the estimated standard deviation ( widehat{sigma} ), then a new candidate breakpoint is found. The new recursive procedure as the mentioned first phase will be applied to the two new segments just generated.
Analysis of time complexity in DBS
In DBS, the first phase is a binary segmentation procedure, and the time complexity of this phase is O(n ∙ log K), where K is the number of segments in the result of the first phase, and n is the length of an observation sequence to be split. Because n ≫ K, the time complexity approaches O(n). Next, the second phase, the multiscale scanning procedure, is costly compared with a onetime scan on a global scale on the whole segment. When ( mathcal{W} ) is a geometric sequence with a common ratio of 2, the time complexity of the second phase is O(n ∙ log n). When ( mathcal{W} ) includes all integer numbers from 1 to n, the time complexity of the second phase degenerates to O(n^{2}). Then, in this case, the algorithm framework of DBS is fully equivalent to one in BACOM, which is similar to the idea used in Circular Binary Segmentation procedure.
In simulation data set, it is not common that one short segment sandwiched between two long segments is found in the first or first few dichotomies of whole segmentation process, because broader changes can be expected to be detected reasonably well. After several recursive splits were executed, the length of each subsegment is greatly reduced. Then, the execution time of the second phase in DBS is also greatly reduced at each subsegment. But the second phase must be triggered once before the recursive procedure ends. So, the time complexity of DBS tends to approach O(n ∙ log n). Moreover, the real data is more complicated, so the effect of DBS is O(n ∙ log n) in practice. Its time complexity is about the same as its predecessors, but DBS is faster than them. We will discuss later in section “Computational Performance”.
Convergence threshold: Trimmed firstorder difference variance ( widehat{boldsymbol{sigma}} )
Here, the average of estimated standard deviation ( widehat{sigma} ) on each chromosome is the key to the convergence of iterative binary segmentation, and it comes from a trimmed firstorder difference variance estimator [19]. Combined with simple heuristics, this method may be used to further enhance the accuracy of ( widehat{sigma} ). Suppose we restrict our attention to exclude a set of potential breakpoints by computationally inexpensive methods. One way to identify potential breakpoints is to use highpass filters, i.e., a filter obtaining high absolute values when passing over a breakpoint. The simplest such filter uses the difference ∆x_{i} = x_{i + 1} − x_{i}, 1 < i < n for each position i. We calculate all the differences at each position and identify approximately 2% of the probe positions as potential breakpoints. In other words, the area below the 1st percentile and above the 99th percentile of all differences corresponds to the breakpoints. Then, we estimated the standard deviation ( {overset{sim }{sigma}}^{prime } ) of ∆x_{i} at the remaining positions. Supposing the change of the variances of each segment on one chromosome is not very large, the average standard deviation ( widehat{sigma} ) of each segment is ( widehat{sigma}={overset{sim }{sigma}}^{prime }/sqrt{2} ).
We need to be reminded that the current ( widehat{sigma} ) is only used to determine whether to continue to split iteratively. After a whole binary segmentation procedure is completed, we can obtain preliminary results and a corresponding binary tree of the test statistic ℤ_{p} generated by segmentation. Furthermore, according to the binary tree, a new finetuned ( {widehat{sigma}}^{prime } ) will be generated naturally to improve the intrasegment variance more accurately. Finally, we select those candidate breakpoints in which ℤ_{p} is greater than the given ( {widehat{sigma}}^{prime } ) as the final ‘true’ breakpoints.
Determining real breakpoints or false breakpoints
Then, we can generate a corresponding binary tree of test statistic ℤ_{p}; see Fig. 1(b). The values of the root node and child nodes are ℤ_{p} of the initial array and the corresponding intermediate results, and the values of the leaf nodes are ℤ_{p} of the results of segmentation. The identification of every node (Node ID) is the Segment ID.
$$ upeta =min left({mathbb{Z}}_iinotin {N}_{mathrm{leaf}}right)max left(widehat{sigma_j}jin {N}_{mathrm{leaf}}right) $$
(12)
where ℤ_{i} is the ℤ_{p} of corresponding segments, and ( widehat{sigma_j} ) is the estimated standard deviation of corresponding segments. Because now the partitions have been initially completed, we use the real local standard deviation of each segment to examine the significance level of every child node.
If η > 0, all ℤ_{p}s of nonleaf nodes are greater than all standard deviations of leaf nodes, and the breakpoints corresponding to all nonleaf nodes are the real significant breakpoints and the DBS algorithm ends immediately.
$$ {widehat{sigma}}^{prime }=max left(widehat{sigma_j}jin {N}_{mathrm{leaf}}right)+lambda $$
(13)
where λ is a safe distance between the nonleaf nodes and the leaf nodes. Its default value is 0.02. In other words, we only choose the candidate breakpoints whose ℤ_{p} are greater than ( {widehat{sigma}}^{prime } ) as the final result. Here when a false breakpoint is removed, then the subsegments corresponding to its two children are merged. This pruning process is equivalent to the process of merging and collating segments in other algorithms. In the following sections, we will discuss segmentation process using simulation data and actual data sample.
In DBS, we use absolute errors rather than squared errors to enhance the computational speed of segmentation. The time complexity of absolute errors can be reduced to O(1), and it only needs one subtraction operation for the summing of one continuous region using an integral array. The algorithm of integral array is naturally decreased from the integral image in computing 2D images [20]. A special data structure and algorithm, namely, summed area array, make it very quick and efficient to generate the sum of values in a continuous subset of an array.
$$ {Z}_{i,j}={omega}_{i,j}{sum}_{k=i}^{j1}left({x}_kwidehat{mu}right)={omega}_{i,j}left[{mathrm{S}}_j{mathrm{S}}_ileft(jiright)widehat{mu}right] $$
(14)
where ( {omega}_{i,j}={left({wp}^{1}left(theta /left(jiright)right)sqrt{ji}right)}^{1} ) and ℘^{−1}(∙) are the inverse functions of the cumulative distribution function of N (0, 1).
$$ {Z}_{1,p}={omega}_{1,p}{varepsilon}_{1,p} $$
(15)
$$ {Z}_{p,n+1}={omega}_{p,n+1}{varepsilon}_{p,n+1} $$
(16)
$$ {mathbb{Z}}_{1,n+1}(p)= distleft(leftlangle {Z}_{1,p},{Z}_{p,mathrm{n}+1}rightrangle, 0right)= distleft(leftlangle {omega}_{1,p},{omega}_{p,n+1}rightrangle, 0right)left{varepsilon}_{1,p}right={mathfrak{D}}_pleft{varepsilon}_{1,p}right $$
(17)
Finally, ℤ_{1, n + 1}(p) represents an accumulated error ε_{1, p} weighted by ( {mathfrak{D}}_p ). The test statistic ℤ_{p} physically represents the largest fluctuation of ℤ_{1, n + 1}(p) on a segment.
$$ {mathfrak{D}}_p={dist}_{k=0.5}left(leftlangle {omega}_{1,p},{omega}_{p,n+1}rightrangle, 0right)={left({omega_{1,p}}^k+{omega_{p,n+1}}^kright)}^{1/k} $$
(18)
When k < 1, the Minkowski distance between 0 and 〈ω_{1, p}, ω_{p, n + 1}〉 tends to the smaller component within ω_{1, p} and ω_{p, n + 1}. Furthermore, ω_{1, p} and ω_{p, n + 1} belong to the same interval [ω_{1, n + 1}, ω_{1, 2}], and as p moves, they exhibit the opposite direction of change. Thus, when ω_{1, p} is equal to ω_{p, n + 1}, ( {mathfrak{D}}_p ) reaches a maximum value, and then p = n/2.
From the analysis above, when p is close to any end (such as p is relatively small, then n − p is sufficiently large), Z_{1, p} is very susceptible to the outliers between point 1 and p. In this case, the position of such a local maximum Z_{1, p} may be false breakpoints, but Z_{p, n + 1} is not significant because there are a sufficient number of probes between point p and another end to suppress the noise. Here, the Minkowski distance (k < 1) is used to filter these unbalanced situations. At the same time, the breakpoints at balanced local extrema are preferentially found. Usually, the most significant breakpoint may be found at the middle of a segment due to ( {mathfrak{D}}_p ), so the performance and stability of binary segmentation is increased.
$$ {mathbb{Z}}_p={mathbb{Z}}_{1,n+1}(b)=max left(left{Z}_{1,b}right,left{Z}_{b,n+1}rightright) $$
(19)
In other words, at each point, ℤ_{1, n + 1}(p) tends to the smaller value of the left and right parts to suppress the noise when searching for breakpoints; ℤ_{1, n + 1}(p) tends to the larger value for measuring the significance of breakpoints.
Algorithm DBS: Deviation binary segmentation
Input: Copy numbers x_{1}, …, x_{n}; predefined significance level θ = 0.05; filtration ratio γ = 2(%); safe gap λ = 0.02;

1.
Calculate integral array by letting S_{0} = 0, and iterate for i = 1…n:
$$ {mathrm{S}}_i={mathrm{S}}_{i1}+{x}_i $$

2.
Estimate standard deviation ( widehat{sigma} ):

a)
Calculate the differences iteratively for i = 1…n: d_{i} = x_{i + 1} − x_{i}

b)
Sort all d_{i} and exclude the area below the γ/2 percentile and above the 100 − γ/2 percentile of differences of d_{i}, then calculate the estimated standard deviation ( {overset{sim }{sigma}}^{prime } ) at the remaining part.

c)
Get ( widehat{sigma}={overset{sim }{sigma}}^{prime }/sqrt{2} ).

a)

3.
Start binary segmentation with two fixed endpoints for segment x_{1}, …, x_{n}, and calculate the average ( widehat{mu} ) on the segment;

a)
By Eqn (14) iterate Z_{1, p} and Z_{p, n + 1} for p = 1…n;

a)

b)
Search the index of potential breakpoint b_{k} at which ℤ_{p} is the maximum in the previous step, and calculate ( {mathbb{Z}}_{b_k} ) by Eqn (19);

c)
If ( {mathbb{Z}}_{b_k}>widehat{sigma} ), store ( {mathbb{Z}}_{b_k} ) and b_{k}, then go to Step 3 and apply binary segmentation recursively to the two subsegments ( {x}_1,dots, {x}_{b_k1} ) and ( {x}_{b_k},dots, {x}_n ); otherwise, the multiscale scanning will be started, and enter Step 4.

4.
Start binary segmentation with various sliding windows for segment x_{1}, …, x_{n},

g)
Create a width set of sliding windows by letting ( {mathcal{W}}_0=n/2 ), and iterate ( {mathcal{W}}_i={mathcal{W}}_{i1} )/2 until ( {mathcal{W}}_i ) is less than 2 or a given value.

h)
Similar to the above binary segmentation, iterate Z_{p − w, p}, Z_{p, p + w} and ℤ_{1, n + 1}(p) for p = 1…n under all sliding windows ( {mathcal{W}}_i ), then find the index of potential breakpoint b_{k} by the maximum and ( {mathbb{Z}}_{b_k} ) is calculated.

i)
If ( {mathbb{Z}}_{b_k}>widehat{sigma} ), store ( {mathbb{Z}}_{b_k} ) and b_{k}, then go to Step 3 and recursively start a binary segmentation without windows to the two segments ( {x}_1,dots, {x}_{b_k1} ) and ( {x}_{b_k},dots, {x}_n ); otherwise, terminate the recursion and return.

g)

5.
Merge operations: calculate η and update ( {widehat{sigma}}^{prime } ), and prune child nodes corresponding to candidate breakpoints to satisfy η > λ.

6.
Sort the indices of breakpoints b_{1}, …, b_{K}, find the segment averages:
y_{i} = average(( {x}_{b_{i1}},dots, {x}_{b_i1} )) for i = 1…K, and b_{0} = 1.
In the algorithm, we use the data from each segment to estimate the standard deviation of noise. As it is well documented that the copy number signals have higher variation for increasing deviation from diploid ground states, by assuming each segment has the same copy number state, the segmentspecific estimate of noise level makes the algorithm robust to the heterogeneous noise.
DBS is a statistical approach based on observations. Similar to its predecessors, there is a limit of minimum length of short segments in order to meet the conditions of CLT. In DBS, the algorithm has also been extended to allow a constraint on the least number of probes in a segment. It is worth noting that low density data, such as arrayCGH, is not suitable for DBS, and an insufficiently low limit on the length (twenty probes) of a segment is necessary.