NCLcomparator: systematically post-screening non-co-linear transcripts (circular, trans-spliced, or fusion RNAs) identified from various detectors

Bioinformatics
NCLcomparator is applied to an rRNA depleted RNA-seq of HeLa cells (SRR1637089). Nine intragenic and six intergenic NCL detectors are selected and run independently with default parameters or the parameters suggested by the authors (Additional file 1: Table S1). Some tools can simultaneously detect intragenic and intergenic NCL events (e.g., NCLscan, Segemehl [68], and MapSplice [69]). These tools totally identified 17,313 intragenic and 766 intergenic NCL events (i.e., the input NCL events; Additional file 2: Table S2). NCLcomparator first checks the alignment ambiguity of the input NCL events. Of the 17,313 intragenic NCL events, 269 events contain alternative co-linear explanations and 792 events map to multiple positions with similar BLAT mapping scores (Table 1). Of the 766 intergenic NCL events, 184 and 69 events have alternative co-linear explanations and multiple hits, respectively (Table 1). We can find that the proportions of false positives derived from ambiguous alignments vary among the compared tools and are generally higher in intergenic NCL events than in intragenic NCL events (Fig. 2a and b), reflecting previous reports that many intergenic NCL events may arise from false positives (sequencing/alignment errors or in vitro artifacts) [7, 42]. Particularly, more than 50% of intergenic events identified by CRAC, ericscript, and SOAPfuse are derived from ambiguous alignments. Of note, the NCLscan results have the lowest percentages of false positives derived from ambiguous alignments among the results of the compared tools; such percentages are consistently low (~ 1%) in both intragenic and intergenic NCL event detections (Fig. 2a and b). This result consists with previous reports that NCLscan has the highest precision compared with other tools [1, 51]. The NCL events that are determined to be derived from ambiguous alignments (designated as “ambiguous NCL events”) are removed. Therefore, a total of 16,252 intragenic and 513 intergenic events (designated as “non-ambiguous NCL events”) are retained for the following comparisons (Table 1).

Table 1

The number of intragenic and intergenic NCL events before and after screening

Intragenic NCL events

 Before screening

17,313

  Alignment ambiguity (ambiguous NCL events)

1061

  Alternative co-linear explanation

269

  Multiple hit

792

 After screening (non-ambiguous NCL events)

16,252

Intergenic NCL events

 Before screening

766

  Alignment ambiguity (ambiguous NCL events)

253

  Alternative co-linear explanation

184

  Multiple hit

69

 After screening (non-ambiguous NCL events)

513

Fig. 2

Distributions of (a) intragenic and (b) intergenic NCL events derived from ambiguous alignments with an alternative co-linear explanation and multiple hit(s) among various tools on the rRNA depleted RNA-seq data of HeLa cells (SRR1637089)

In addition to tab-delimited text files (e.g., Additional file 2: Table S2), NCLcomparator provides figures for comparison of identification results from different tools (see Fig. 3a and b for intragenic events and Fig. 3c and d for intergenic events). Here we take the result of intragenic NCL events as an example. We can find quite a large variation in numbers of detected events among tools; every tool identifies a considerable proportion of tool-specific intragenic NCL events (Fig. 3a, top). More than 40% (6450 events) of the intragenic NCL events are exclusively identified by a single tool (Fig. 3a, bottom). Particularly, even though the intragenic NCL events (205 events) detected by all the nine tools compared, the number of supporting NCL-junction reads (NNCL) vary among the compared tools (Fig. 3b). These observations reflect great discrepancies in the detection results (including the number and NNCL values of identified events) among tools. Such great discrepancies are much more remarkable in intergenic events than in intragenic ones. We can find that except for SOAPfuse and ericscript every tool identifies more than 30% of tool-specific intragenic NCL events (Fig. 3c, top), more than 83% (426 out of 513) of intergenic events are exclusively identified by a single tool (Fig. 3c, bottom), and the NNCL values highly vary among the compared tools (Fig. 3d). In addition, comparisons of the cumulative distribution of τNCLshow that τNCL values are significantly higher in intergenic events than in intragenic ones (P value 3e). These observations highlight the importance of a careful screen for the NCL events, especially for intergenic events, identified by currently available NCL event detectors.

Fig. 3

Variation in number of detected events and NNCL among various tools. Of note, the analysis is based on the non-ambiguous NCL events. a The coverage of identified intragenic NCL events between the compared tools (top) and the distribution of the number of supporting tools (bottom). b Boxplot representing the number of the supporting NCL-junction reads (NNCL) of the intragenic NCL events (205 events) that are identified by all nine examined circRNA detectors. c and d representing the results of intergenic NCL events identified by 6 gene-fusion detectors, related to the intragenic cases in (a) and (b), respectively. For (d), boxplot represents the 87 intergenic NCL events identified by at least two gene-fusion detectors. A zoom-in view for NNCL of SOAPfuse, ericscript, CRAC, MapSplice2, and NCLscan is shown the middle panel of (d). The identified intragenic and intergenic NCL events by various tools are listed in Additional file 2: Table S2. e Comparisons of cumulative distribution of τNCL for the non-ambiguous intragenic and intergenic NCL events. P value is determined by the Kolmogorov-Smirnov test

Importantly, NCLcomparator provides two measurements, τNCL and NCLscore (see Eqs. (1) and (2), respectively) to help users selecting potentially high- plausible NCL events. Since NNCL values vary remarkably between tools depending on the level of strictness of the filtering steps used [51, 60] (see also Fig. 3b and d), we speculate that high-confidence NCL events tend to have a large NCLscore (in other words, a large median NNCL with a small τNCL). Since ambiguous NCL events can be regarded as false positives, we examine τNCLand NCLscore values of ambiguous and non-ambiguous NCL events. Indeed, we find a significantly negative correlation between these two measurements, in which τNCL values are significantly higher in ambiguous NCL events than in non-ambiguous ones (Fig. 4a and b), whereas the reverse trends are observed for NCLscore values (Fig. 4a and c), regardless of whether the events are intragenic or intergenic (both P values τNCLτNCL, more than 95% of ambiguous events are filtered out, if we set the thresholds as 4b). Meanwhile, for NCLscore, more than 95% of ambiguous events are removed, if we set the thresholds as > − 1 and > − 2 for intragenic and intergenic events, respectively (Fig. 4c). These results reveal that NCLscore is a good indicator for selecting NCL events with high level of confidence, suggesting that NCL events with a large NCLscore may be of high reliability and considered to take priority over the other for further experimental validation and functional analysis.

Fig. 4

Comparisons of τNCL and NCLscore between ambiguous and non-ambiguous NCL events. a Correlation between τNCL and NCLscore for intragenic (left) and intergenic (right) NCL events. The black and red dots represent non-ambiguous and ambiguous NCL events, respectively. The correlation coefficient r and P values are determined by Pearson’s r test. b and c Comparisons of cumulative distribution of (b) τNCL and (c) NCLscore for the ambiguous and non-ambiguous NCL events in intragenic (left) and intergenic (right) detections. In (c), a zoom-in view is shown in the lower right panel. P values are determined by the Kolmogorov-Smirnov test

Articles You May Like

A scientist accidentally found the world’s oldest surviving Periodic Table
Smallest near-infrared fluorescent protein evolved from cyanobacteriochrome as versatile tag for spectral multiplexing
A discriminative learning approach to differential expression analysis for single-cell RNA-seq
Oil put L.A. on the map. It may have exaggerated the city’s quake risk too
Bacteria In Worms Make A Mosquito Repellent That Might Beat DEET

Leave a Reply

Your email address will not be published. Required fields are marked *