DBS: a fast and informative segmentation algorithm for DNA copy number analysis

Bioinformatics

Systems overview

The ToolSeg tool provides functionality for many tasks typically encountered in copy number analysis: data pre-processing, segmentation methods of various algorithms and visualization tools. The main workflow includes: 1) reading and filtering of raw sample data; 2) segmentation of allele-specific SNP array data; and 3) visualization of results. The input includes copy number measurements from single or paired SNP-array or HTS experiments. Allele observations normally need to detect and appropriately modify or filter extreme observations (outliers) prior to segmentation. Here, the median filtering algorithm [17] is used in the ToolSeg toolbox to manipulate the original input measurements. The method of DBS is based on the Central Limit Theorem in probability theory for finding breakpoints and observation segments with a well-defined expected mean and variance. In DBS, the segmentation curves are recursively generated by the recursive splits using the preceding breakpoints. A set of graphical tools is also available in the toolbox to visualize the raw data and segmentation results and to compare six different segmentation algorithms in a statistically rigorous way.

Input data and preprocessing

ToolSeg requires the raw signals from high-throughput samples to be organized as a one-dimensional vector and stored as a .txt file. Detailed descriptions of the software are included in the Supplementary Material.

Before we performed copy number change detection and segmentation using copy number data, a challenging factor in copy number analysis was the frequent occurrence of outliers – single probe values that differ markedly from their neighbors. Generally, such extreme observations can be due to the presence of very short segments of DNA with deviant copy numbers, technical aberrations, or a combination. Such extreme observations have potentially harmful effect when the focus is on detection of broader aberrations [17, 18]. In ToolSeg, the classical limit filter, Winsorization, is performed to reduce such noise, which is a typical preprocessing step to eliminate extreme values in the statistical data to reduce the effect of possible spurious outliers.

Here, we calculated the arithmetic mean as the expected value ( widehat{mu} ) and the estimated standard deviation ( widehat{sigma} ) based on all observations on the whole genome. For original observations, the corresponding Winsorized observations are defined as ( {x}_i^{prime }=fleft({x}_iright) ), where

$$ f(x)=left{begin{array}{c}widehat{mu}-tau widehat{sigma},x<widehat{mu}-tau widehat{sigma}\ {}widehat{mu}+tau widehat{sigma},x>widehat{mu}+tau widehat{sigma}\ {} x,mathrm{otherwise}end{array}right. $$

(1)

and τ [1.5, 3], (default 2.5 in ToolSeg). Often, such simple and fast Winsorization is sufficient, as discussed in [12].

Binary segmentation

Now, we discuss the basic problem of obtaining individual segmentation for one chromosome arm in one sample. The aim of copy number change detection and segmentation is to divide a chromosome into a few continuous segments, within each of which the copy numbers are considered constant.

Let xi, i = 1,2, …, n, denote the obtained measurement of the copy numbers at each of the i loci on a chromosome. The observation xi can be thought of as a sum of two contributions:

$$ {x}_i={y}_i+{varepsilon}_i $$

where yi is an unknown actual “true” copy number at the i’th locus and εi represents measurement noise, which follows an independent and identically distributed (i.i.d.) with mean of zero. A breakpoint is said to occur between probe i and i + 1 if yi ≠ yi + 1, i (1, n). The sequence y0, …, yK thus implies a segmentation with a breakpoint set {b1, …, bK}, where b1 is the first breakpoint, the probes of the first sub-segment are before b1, the second sub-segment is between b1 and the second breakpoint b2, and so on. Thus, we formulated the copy number change detection as the problem of detecting the breakpoint in copy number data.

Consider first the simplest problem of obtaining only one segment. There is no copy number change on a chromosome in the sample. Given the copy number signals of length n on the chromosome, x1, …, xn, and let xi be an observation produced by independent and identically distributed (i.i.d.) random variable drawn from distribution of expected values given by ( widehat{mu} ) and finite variances given by ( {widehat{sigma}}^2 ).

The following defines the statistic ( {widehat{Z}}_{ij} ),

$$ {widehat{Z}}_{i,j}=frac{sum_{k=i}^{j-1}left({x}_k-widehat{mu}right)}{widehat{sigma}sqrt{ j-i}},1<i<j<n+1, $$

(2)

where ( widehat{mu}=frac{1}{j-i}{sum}_{k=i}^{j-1}{x}_k ) is the arithmetic mean between point i and point j (does not include j), and ( widehat{sigma} ) is the estimated standard deviation of xi, ( widehat{sigma}=sqrt{frac{1}{j-i-1}{sum}_{k=i}^{j-1}{left({x}_k-widehat{mu}right)}^2} ), which will be discussed later. Furthermore, we define the test statistic

$$ widehat{Z}=underset{1<i<j<mathrm{n}+1,j-i>{n}_0 }{max}left|{widehat{Z}}_{i,j}right| $$

(3)

where n0 is a pre-determined parameter of the minimum length of CNA.

According to the central limit theorem (CLT), as the sample size (the length of an observation sequence, j − i) increases to a sufficiently large number, the arithmetic mean of independent random variables will be approximately normally distributed with mean μ and variance σ2/(j − i), regardless of the underlying distribution. Therefore, under the null hypothesis of no copy number change, the test statistic ( {widehat{Z}}_{ij} ) asymptotically follows a standard normal distribution, N (0, 1). Copy number change segments measured by high-throughput sequencing data usually span over hundreds, even tens of thousands, of probes. Therefore, the normality of ( {widehat{Z}}_{ij} ) is approximated with high accuracy.

Here, let θ be a predefined significance level,

$$ wp left({widehat{Z}}_{ij}right)=1-sqrt{frac{2}{pi }}{int}_{-infty}^{{widehat{Z}}_{ij}}{e}^{-frac{x^2}{2}} dx>theta $$

(4)

We iterate over the whole segment to calculate the P-value of ( widehat{Z} ) using the cumulative distribution function of N (0, 1). If the P-value is greater than θ, then we will consider that there is no copy number change in the segment. In other words, ( widehat{Z} ) is not far from the center of the shape of the standard normal distribution.

Furthermore, we also introduce an empirical correction to θ which is divided by Li, j = j − i. In other words, the predefined significance level is a function of length Li, j of the detected parts in the segment. Here, let ( {widehat{T}}_{i,j} ) be the cut-off threshold of ( widehat{Z} ),

$$ wp left({widehat{T}}_{i,j}right)=frac{theta }{j-i} $$

(5)

with a given θ and a length that corresponds to a definite ( {widehat{T}}_{i,j}={wp}^{-1}left(theta /left(j-iright)right) ) based on using the inverse function of the cumulative distribution function. If ( widehat{Z} ) is less than ( {widehat{T}}_{i,j} ), then we will consider that there is no copy number change in the segment. Otherwise, it is necessary to split. The following is the criterion of segmentation in Eqn (6),

$$ widehat{Z}=underset{i,j}{max}left|frac{sum_{k=i}^{j-1}left({x}_k-widehat{mu}right)}{widehat{sigma}sqrt{j-i}}right|ge {widehat{T}}_{i,j} $$

(6)

When the constant parameter θ is subjectively determined, we define a new statistic Zi, j by transforming formula (1) so that it represents a normalized standard deviation weighted by a predefined significance level between the two points i and j:

$$ {Z}_{i,j}=frac{sum_{k=i}^{j-1}left({x}_k-widehat{mu}right)}{{widehat{T}}_{i,j}sqrt{j-i}}={omega}_{i,j}{varepsilon}_{i,j} $$

(7)

where ( {omega}_{i,j}={left({widehat{T}}_{i,j}sqrt{j-i}right)}^{-1} ), ωi, j > 0, and εi, j is the accumulated error between two points i and j, 1 ij

We select a point p between the start 1 and the end n in one segment. Thus, Z1, p and Zp, n + 1 are the two statistics that correspond to the left side and the right side, respectively, of point p in the segment and represent the weighted deviation of these two parts. Furthermore, we define a new statistic 1, n + 1(p),

$$ {mathbb{Z}}_{1,n+1}(p)= distleft(left{{Z}_{1,p},{Z}_{p,n+1}right},0right) $$

(8)

where dist(〈∙〉, 0) is a distance measure between vector 〈∙〉 and 0. The Minkowski distance can be used here. These will be discussed in a later section “Selecting for the distance function”. Finally, we define a new test statistic p,

$$ {mathbb{Z}}_p=underset{1<p<n+1}{max }{mathbb{Z}}_{1,n+1}(p) $$

(9)

p is the maximum of abrupt jumps of variance within the segment under the current constraints, and its position is found by iterating once over the whole segment. If p is greater than the estimated standard deviation ( widehat{sigma} ) at location p, that is, p is considered significant, we will obtain a new candidate breakpoint b at p.

$$ b=arg underset{1<p<n+1}{max }{mathbb{Z}}_{1,n+1}(p) $$

(10)

Then, a binary segmentation procedure will be performed at breakpoint b, and we will apply the above algorithm recursively to the two segments x1, …, xp − 1 and xp, …, xn, p (1, n).

Multi-scale scanning procedure

Up to now, the above algorithm has been able to identify the most significant breakpoints, except one short segment sandwiched between two long segments. In this case, the distance between breakpoints at the intermediate position p and both ends is much or far greater than 1. Thus, ( wp left({widehat{T}}_{1,p}right)=theta /p ) tends to 0, and ( {widehat{T}}_{1,p} ) has almost no change with an increase in p. The accumulated error generated by the sum process is equally shared to each point from 1 to p. When increasing the distance to the ends, the change of Zi, j becomes slower. Thus, spike pulses and small segments embedded in long segments are suppressed. Therefore, if p is less than the estimated standard deviation ( widehat{sigma} ) after a one-time scan of the whole segment, we cannot arbitrarily exclude the presence of the breakpoint.

From the situation above, it is obvious that we cannot use the fixed endpoints to detect breakpoints on a global scale. This method is acceptable with large jumps or changes in long segments, but to detect shorter segments. We need smaller windows. For these smaller segments, scale-space scanning is used. In the DBS algorithm, in the second phase, a multi-scale scanning stage will be started by the windowed model, if a breakpoint was not found immediately by the first phase.

Here, let ( mathcal{W} ) be a width set of sliding windows, and a window width ( in mathcal{W} ). Thus, the two statistics above, Z1, p and Zp, n + 1, are updated to Zp − w, p and Zp, p + w. The test statistic p is updated by a double loop in Eqn (11),

$$ {mathbb{Z}}_p=underset{1<p<n,win mathcal{W}}{max } distleft(left{{Z}_{p-w,p},{Z}_{p,p+w}right},0right) $$

(11)

Therefore, we can find the local maximum across these scales (window width), which provides a list of (Zp − w, p, Zp, p + w, p, w) values and indicate that there is a potential breakpoint at p at the w scale. Once p is greater than the estimated standard deviation ( widehat{sigma} ), then a new candidate breakpoint is found. The new recursive procedure as the mentioned first phase will be applied to the two new segments just generated.

Analysis of time complexity in DBS

In DBS, the first phase is a binary segmentation procedure, and the time complexity of this phase is O(n ∙ log K), where K is the number of segments in the result of the first phase, and n is the length of an observation sequence to be split. Because nK, the time complexity approaches O(n). Next, the second phase, the multi-scale scanning procedure, is costly compared with a one-time scan on a global scale on the whole segment. When ( mathcal{W} ) is a geometric sequence with a common ratio of 2, the time complexity of the second phase is O(n ∙ log n). When ( mathcal{W} ) includes all integer numbers from 1 to n, the time complexity of the second phase degenerates to O(n2). Then, in this case, the algorithm framework of DBS is fully equivalent to one in BACOM, which is similar to the idea used in Circular Binary Segmentation procedure.

In simulation data set, it is not common that one short segment sandwiched between two long segments is found in the first or first few dichotomies of whole segmentation process, because broader changes can be expected to be detected reasonably well. After several recursive splits were executed, the length of each sub-segment is greatly reduced. Then, the execution time of the second phase in DBS is also greatly reduced at each sub-segment. But the second phase must be triggered once before the recursive procedure ends. So, the time complexity of DBS tends to approach O(n ∙ log n). Moreover, the real data is more complicated, so the effect of DBS is O(n ∙ log n) in practice. Its time complexity is about the same as its predecessors, but DBS is faster than them. We will discuss later in section “Computational Performance”.

Convergence threshold: Trimmed first-order difference variance ( widehat{boldsymbol{sigma}} )

Here, the average of estimated standard deviation ( widehat{sigma} ) on each chromosome is the key to the convergence of iterative binary segmentation, and it comes from a trimmed first-order difference variance estimator [19]. Combined with simple heuristics, this method may be used to further enhance the accuracy of ( widehat{sigma} ). Suppose we restrict our attention to exclude a set of potential breakpoints by computationally inexpensive methods. One way to identify potential breakpoints is to use high-pass filters, i.e., a filter obtaining high absolute values when passing over a breakpoint. The simplest such filter uses the difference ∆xi = xi + 1 − xi, 1 < i < n for each position i. We calculate all the differences at each position and identify approximately 2% of the probe positions as potential breakpoints. In other words, the area below the 1st percentile and above the 99th percentile of all differences corresponds to the breakpoints. Then, we estimated the standard deviation ( {overset{sim }{sigma}}^{prime } ) of ∆xi at the remaining positions. Supposing the change of the variances of each segment on one chromosome is not very large, the average standard deviation ( widehat{sigma} ) of each segment is ( widehat{sigma}={overset{sim }{sigma}}^{prime }/sqrt{2} ).

We need to be reminded that the current ( widehat{sigma} ) is only used to determine whether to continue to split iteratively. After a whole binary segmentation procedure is completed, we can obtain preliminary results and a corresponding binary tree of the test statistic p generated by segmentation. Furthermore, according to the binary tree, a new fine-tuned ( {widehat{sigma}}^{prime } ) will be generated naturally to improve the intra-segment variance more accurately. Finally, we select those candidate breakpoints in which p is greater than the given ( {widehat{sigma}}^{prime } ) as the final ‘true’ breakpoints.

Determining real breakpoints or false breakpoints

Let us now analyze the specific process of ( {widehat{sigma}}^{prime } ) in detail. Figure 1(a) shows an assumed complete segmentation process. After being split twice at breakpoints b1 and b2, an initial array (Segment ID is 1) is divided into three segments (their IDs are 3, 4, and 5). 1 and 2 are two local maximums p at the corresponding breakpoints of two segments (IDs are 1 and 2). If 3, 4 and 5 within the corresponding segments are all less than the pre-calculated ( widehat{sigma} ), then the whole subdivision process ends.

Fig. 1

Segmentation process and binary tree of p in DBS. a an assumed segmentation process with two breakpoints. Row [0] is the initial sequence to be split. Row [1] shows the first breakpoint would be found at loci b1, and Row [2] is similar. b shows the corresponding binary tree of p generated by (a). Here the identification of every node (Node ID) also is the Segment ID

Then, we can generate a corresponding binary tree of test statistic p; see Fig. 1(b). The values of the root node and child nodes are p of the initial array and the corresponding intermediate results, and the values of the leaf nodes are p of the results of segmentation. The identification of every node (Node ID) is the Segment ID.

We define a new distance η between the set of non-leaf nodes and the set Nleaf of leaf nodes,

$$ upeta =min left({mathbb{Z}}_i|inotin {N}_{mathrm{leaf}}right)-max left(widehat{sigma_j}|jin {N}_{mathrm{leaf}}right) $$

(12)

where i is the p of corresponding segments, and ( widehat{sigma_j} ) is the estimated standard deviation of corresponding segments. Because now the partitions have been initially completed, we use the real local standard deviation of each segment to examine the significance level of every child node.

If η > 0, all ps of non-leaf nodes are greater than all standard deviations of leaf nodes, and the breakpoints corresponding to all non-leaf nodes are the real significant breakpoints and the DBS algorithm ends immediately.

If η ≤ 0, there are false breakpoints resulting in over-segmentation, which are less than the standard deviation of the leaf nodes. Thus, we update ( widehat{sigma} ) to ( {widehat{sigma}}^{prime } ),

$$ {widehat{sigma}}^{prime }=max left(widehat{sigma_j}|jin {N}_{mathrm{leaf}}right)+lambda $$

(13)

where λ is a safe distance between the non-leaf nodes and the leaf nodes. Its default value is 0.02. In other words, we only choose the candidate breakpoints whose p are greater than ( {widehat{sigma}}^{prime } ) as the final result. Here when a false breakpoint is removed, then the sub-segments corresponding to its two children are merged. This pruning process is equivalent to the process of merging and collating segments in other algorithms. In the following sections, we will discuss segmentation process using simulation data and actual data sample.

Here it needs to be emphasized that the proper over-segmentation is helpful to avoid missing real breakpoints possibly exist. In actual data, if and only if η trends closer to zero, the best segmentation result will be obtained due to the continuity of variance within segments. We will discuss later in section “Segmentation of actual data sample”.

In DBS, we use absolute errors rather than squared errors to enhance the computational speed of segmentation. The time complexity of absolute errors can be reduced to O(1), and it only needs one subtraction operation for the summing of one continuous region using an integral array. The algorithm of integral array is naturally decreased from the integral image in computing 2D images [20]. A special data structure and algorithm, namely, summed area array, make it very quick and efficient to generate the sum of values in a continuous subset of an array.

Here, we only need to use a one-dimensional summed area table. As the name suggests, the value at any point i in the summed area array is just the sum of all the left values of point i, inclusive: ( {mathrm{S}}_i={sum}_{k=1}^i{x}_k ). Moreover, the summed area array can be computed efficiently in a single pass over a chromosome. Once the summed area array has been computed, the task of evaluating the sum between point i and point j requires only two array references. This method allows for a constant calculation time that is independent of the length of the subarray. Thus, using this fact, the statistic Zij can be computed rapidly and efficiently in Eqn (14).

$$ {Z}_{i,j}={omega}_{i,j}{sum}_{k=i}^{j-1}left({x}_k-widehat{mu}right)={omega}_{i,j}left[{mathrm{S}}_j-{mathrm{S}}_i-left(j-iright)widehat{mu}right] $$

(14)

where ( {omega}_{i,j}={left({wp}^{-1}left(theta /left(j-iright)right)sqrt{j-i}right)}^{-1} ) and −1(∙) are the inverse functions of the cumulative distribution function of N (0, 1).

The two statistics Z1, p and Zp, n + 1 are weighted standard deviations between point p and two ends, respectively,

$$ {Z}_{1,p}={omega}_{1,p}{varepsilon}_{1,p} $$

(15)

$$ {Z}_{p,n+1}={omega}_{p,n+1}{varepsilon}_{p,n+1} $$

(16)

Because ( {varepsilon}_{i,j}={sum}_{k=i}^{j-1}left({x}_k-widehat{mu}right) ), then ε1, p + εp, n + 1 = 0. Thus,

$$ {mathbb{Z}}_{1,n+1}(p)= distleft(leftlangle {Z}_{1,p},{Z}_{p,mathrm{n}+1}rightrangle, 0right)= distleft(leftlangle {omega}_{1,p},{omega}_{p,n+1}rightrangle, 0right)left|{varepsilon}_{1,p}right|={mathfrak{D}}_pleft|{varepsilon}_{1,p}right| $$

(17)

Finally, 1, n + 1(p) represents an accumulated error |ε1, p| weighted by ( {mathfrak{D}}_p ). The test statistic p physically represents the largest fluctuation of 1, n + 1(p) on a segment.

There are two steps in the entire process of searching for the local maximum p on a segment. First, the Minkowski distance (k = 0.5) is used to find the position of breakpoint b at the local maximum by the model Eqn (18).

$$ {mathfrak{D}}_p={dist}_{k=0.5}left(leftlangle {omega}_{1,p},{omega}_{p,n+1}rightrangle, 0right)={left({omega_{1,p}}^k+{omega_{p,n+1}}^kright)}^{1/k} $$

(18)

When k < 1, the Minkowski distance between 0 and 〈ω1, p, ωp, n + 1〉 tends to the smaller component within ω1, p and ωp, n + 1. Furthermore, ω1, p and ωp, n + 1 belong to the same interval [ω1, n + 1, ω1, 2], and as p moves, they exhibit the opposite direction of change. Thus, when ω1, p is equal to ωp, n + 1, ( {mathfrak{D}}_p ) reaches a maximum value, and then p = n/2.

From the analysis above, when p is close to any end (such as p is relatively small, then n − p is sufficiently large), Z1, p is very susceptible to the outliers between point 1 and p. In this case, the position of such a local maximum Z1, p may be false breakpoints, but Zp, n + 1 is not significant because there are a sufficient number of probes between point p and another end to suppress the noise. Here, the Minkowski distance (k < 1) is used to filter these unbalanced situations. At the same time, the breakpoints at balanced local extrema are preferentially found. Usually, the most significant breakpoint may be found at the middle of a segment due to ( {mathfrak{D}}_p ), so the performance and stability of binary segmentation is increased.

Once a breakpoint at b is identified, 1, n + 1(b) can be calculated by the Minkowski distance (k ≥ 1). The Minkowski distance (k

$$ {mathbb{Z}}_p={mathbb{Z}}_{1,n+1}(b)=max left(left|{Z}_{1,b}right|,left|{Z}_{b,n+1}right|right) $$

(19)

In other words, at each point, 1, n + 1(p) tends to the smaller value of the left and right parts to suppress the noise when searching for breakpoints; 1, n + 1(p) tends to the larger value for measuring the significance of breakpoints.

Algorithm DBS: Deviation binary segmentation

Input: Copy numbers x1, …, xn; predefined significance level θ = 0.05; filtration ratio γ = 2(%); safe gap λ = 0.02;

Output: Indices of breakpoints b1, …, bK; segment average y1, …, yK; and degree of significance of breakpoints ( {mathbb{Z}}_{b_1},dots, {mathbb{Z}}_{b_k} ).

  1. 1.

    Calculate integral array by letting S0 = 0, and iterate for i = 1…n:

$$ {mathrm{S}}_i={mathrm{S}}_{i-1}+{x}_i $$

  1. 2.

    Estimate standard deviation ( widehat{sigma} ):

    1. a)

      Calculate the differences iteratively for i = 1…n: di = xi + 1 − xi

    2. b)

      Sort all di and exclude the area below the γ/2 percentile and above the 100 − γ/2 percentile of differences of di, then calculate the estimated standard deviation ( {overset{sim }{sigma}}^{prime } ) at the remaining part.

    3. c)

      Get ( widehat{sigma}={overset{sim }{sigma}}^{prime }/sqrt{2} ).

  2. 3.

    Start binary segmentation with two fixed endpoints for segment x1, …, xn, and calculate the average ( widehat{mu} ) on the segment;

    1. a)

      By Eqn (14) iterate Z1, p and Zp, n + 1 for p = 1…n;

then get 1, n + 1(p) by Eqn (18), and k = 0.5;

  1. b)

    Search the index of potential breakpoint bk at which p is the maximum in the previous step, and calculate ( {mathbb{Z}}_{b_k} ) by Eqn (19);

  2. c)

    If ( {mathbb{Z}}_{b_k}>widehat{sigma} ), store ( {mathbb{Z}}_{b_k} ) and bk, then go to Step 3 and apply binary segmentation recursively to the two sub-segments ( {x}_1,dots, {x}_{b_k-1} ) and ( {x}_{b_k},dots, {x}_n ); otherwise, the multi-scale scanning will be started, and enter Step 4.

  1. 4.

    Start binary segmentation with various sliding windows for segment x1, …, xn,

    1. g)

      Create a width set of sliding windows by letting ( {mathcal{W}}_0=n/2 ), and iterate ( {mathcal{W}}_i={mathcal{W}}_{i-1} )/2 until ( {mathcal{W}}_i ) is less than 2 or a given value.

    2. h)

      Similar to the above binary segmentation, iterate Zp − w, p, Zp, p + w and 1, n + 1(p) for p = 1…n under all sliding windows ( {mathcal{W}}_i ), then find the index of potential breakpoint bk by the maximum and ( {mathbb{Z}}_{b_k} ) is calculated.

    3. i)

      If ( {mathbb{Z}}_{b_k}>widehat{sigma} ), store ( {mathbb{Z}}_{b_k} ) and bk, then go to Step 3 and recursively start a binary segmentation without windows to the two segments ( {x}_1,dots, {x}_{b_k-1} ) and ( {x}_{b_k},dots, {x}_n ); otherwise, terminate the recursion and return.

  2. 5.

    Merge operations: calculate η and update ( {widehat{sigma}}^{prime } ), and prune child nodes corresponding to candidate breakpoints to satisfy η > λ.

  3. 6.

    Sort the indices of breakpoints b1, …, bK, find the segment averages:

yi = average(( {x}_{b_{i-1}},dots, {x}_{b_i-1} )) for i = 1…K, and b0 = 1.

In the algorithm, we use the data from each segment to estimate the standard deviation of noise. As it is well documented that the copy number signals have higher variation for increasing deviation from diploid ground states, by assuming each segment has the same copy number state, the segment-specific estimate of noise level makes the algorithm robust to the heterogeneous noise.

DBS is a statistical approach based on observations. Similar to its predecessors, there is a limit of minimum length of short segments in order to meet the conditions of CLT. In DBS, the algorithm has also been extended to allow a constraint on the least number of probes in a segment. It is worth noting that low density data, such as arrayCGH, is not suitable for DBS, and an insufficiently low limit on the length (twenty probes) of a segment is necessary.

Articles You May Like

Materials could delay frost up to 300 times longer than existing anti-icing coatings
Researchers Who Study Mass Shootings Say Perpetrators Often Idolize And Copy Others
Eddie Woo: How Can Math Help Us Understand The Complexity Of The Universe?
Floodplain forests under threat
Oscillation in muscle tissue

Leave a Reply

Your email address will not be published. Required fields are marked *