Tigmint: correcting assembly errors using linked reads from large molecules

Bioinformatics
Correcting the ABySS assembly of the human data set HG004 with Tigmint reduces the number of misassemblies identified by QUAST by 216, a reduction of 27%. While the scaffold NG50 decreases slightly from 3.65 Mbp to 3.47 Mbp, the scaffold NGA50 remains unchanged; thus in this case, correcting the assembly with Tigmint improves the correctness of the assembly without substantially reducing its contiguity. However, scaffolding the uncorrected and corrected assemblies with ARCS yield markedly different results: a 2.5-fold increase in NGA50 from 3.1 Mbp to 7.9 Mbp without Tigmint versus a more than five-fold increase in NGA50 to 16.4 Mbp with Tigmint. Further, correcting the assembly and then scaffolding yields a final assembly that is both more correct and more contiguous than the original assembly, as shown in Fig. 3 and Table 2.

Fig. 3

Assembly contiguity and correctness metrics of HG004 with and without correction using Tigmint prior to scaffolding with ARCS. The most contiguous and correct assemblies are found in the top-left. Supernova assembled linked reads only, whereas the others used paired end and mate pair reads

Table 2

The assembly contiguity (scaffold NG50 and NGA50) and correctness (number of misassemblies) metrics with and without correction using Tigmint prior to scaffolding with ARCS

HG004

ABySS

3.65

3.09

790

NA

 

ABySS+Tigmint

3.47

3.09

574

216 (27.3%)

 

ABySS+ARCS

9.91

7.86

823

NA

 

ABySS+Tigmint+ARCS

26.39

16.43

641

182 (22.1%)

HG004

DISCO+ABySS

10.55

9.04

701

NA

 

DISCO+ABySS+Tigmint

10.16

9.04

666

35 (5.0%)

 

DISCO+ABySS+ARCS

29.20

17.05

829

NA

 

DISCO+ABySS+Tigmint+ARCS

35.31

23.68

804

25 (3.0%)

HG004

DISCO+BESST

7.01

6.14

568

NA

 

DISCO+BESST+Tigmint

6.77

6.14

493

75 (13.2%)

 

DISCO+BESST+ARCS

27.64

15.14

672

NA

 

DISCO+BESST+Tigmint+ARCS

33.43

19.40

603

69 (10.3%)

HG004

Supernova

38.48

12.65

1005

NA

 

Supernova+Tigmint

17.72

11.43

923

82 (8.2%)

 

Supernova+ARCS

39.63

13.24

1052

NA

 

Supernova+Tigmint+ARCS

27.35

12.60

998

54 (5.1%)

HG004

Falcon

4.56

4.21

3640

NA

 

Falcon+Tigmint

4.45

4.21

3444

196 (5.4%)

 

Falcon+ARCS

18.14

9.71

3,801

NA

 

Falcon+Tigmint+ARCS

22.52

11.97

3,574

227 (6.0%)

NA12878

Canu

7.06

5.40

1688

NA

 

Canu+Tigmint

6.87

5.38

1600

88 (5.2%)

 

Canu+ARCS

19.70

10.12

1736

NA

 

Canu+Tigmint+ARCS

22.01

10.85

1,626

110 (6.3%)

Simulated

ABySS

9.00

8.28

272

NA

 

ABySS+Tigmint

8.61

8.28

217

55 (20.2%)

 

ABySS+ARCS

23.37

17.09

365

NA

 

ABySS+Tigmint+ARCS

30.24

24.98

320

45 (12.3%)

Correcting the DISCOVARdenovo + BESST assembly reduces the number of misassemblies by 75, a reduction of 13%. Using Tigmint to correct the assembly before scaffolding with ARCS yields an increase in NGA50 of 28% over using ARCS without Tigmint. Correcting the DISCOVARdenovo + ABySS-Scaffold assembly reduces the number of misassemblies by 35 (5%), after which scaffolding with ARCS improves the NGA50 to 23.7 Mbp, 2.6 times the original assembly and a 40% improvement over ARCS without Tigmint. The assembly with the fewest misassemblies is DISCOVARdenovo + BESST + Tigmint. The assembly with the largest NGA50 is DISCOVARdenovo + ABySS-Scaffold + Tigmint + ARCS. Finally, DISCOVARdenovo + BESST + Tigmint + ARCS strikes a good balance between both good contiguity and few misassemblies.

Correcting the Supernova assembly of the HG004 linked reads with Tigmint reduces the number of misassemblies by 82, a reduction of 8%, and after scaffolding the corrected assembly with ARCS, we see a slight (<1%) decrease in both misassemblies and NGA50 compared to the original Supernova assembly. Since the Supernova assembly is composed entirely of the linked reads, this result is concordant with our expectation of no substantial gains from using these same data to correct and scaffold the Supernova assembly.

We attempted to correct the ABySS assembly using NxRepair, which made no corrections for any value of its z-score threshold parameter T less than -2.7. Setting T = -2.4, NxRepair reduced the number of misassemblies from 790 to 611, a reduction of 179 or 23%, whereas Tigmint reduced misassemblies by 216 or 27%. NxRepair reduced the NGA50 by 34% from 3.09 Mbp to 2.04 Mbp, unlike Tigmint, which did not reduce the NGA50 of the assembly. Tigmint produced an assembly that is both more correct and more contiguous than NxRepair with T = -2.4. Smaller values of T corrected fewer errors than Tigmint, and lager values of T further decreased the contiguity of the assembly. We similarly corrected the two DISCOVARdenovo assemblies using NxRepair with T = -2.4, shown in figure Fig. 4. The DISCOVARdenovo + BESST assembly corrected by Tigmint is both more correct and more contiguous than that corrected by NxRepair. The DISCOVARdenovo + ABySS-Scaffold assembly corrected by NxRepair has 16 (2.5%) fewer misassemblies than that corrected by Tigmint, but the NGA50 is reduced from 9.04 Mbp with Tigmint to 5.53 Mbp with NxRepair, a reduction of 39%.

Fig. 4

Assembly contiguity and correctness metrics of HG004 corrected with NxRepair, which uses mate pairs, and Tigmint, which uses linked reads. The most contiguous and correct assemblies are found in the top-left

The assemblies of SMS reads have contig NGA50s in the megabases. Tigmint and ARCS together improve the scaffold NGA50 of the Canu assembly by more than double to nearly 11 Mbp and improve the scaffold NGA50 of the Falcon assembly by nearly triple to 12 Mbp, and both assemblies have fewer misassemblies than their original assembly, shown in Fig. 5. Thus, using Tigmint and ARCS together improves both the contiguity and correctness over the original assemblies. This result demonstrates that by using long reads in combination with linked reads, one can achieve an assembly quality that is not currently possible with either technology alone.

Fig. 5

Assemblies of Oxford Nanopore sequencing of NA12878 with Canu and PacBio sequencing of HG004 with Falcon with and without correction using Tigmint prior to scaffolding with ARCS

The alignments of the ABySS assembly to the reference genome before and after Tigmint are visualized in Fig. 6 using JupiterPlot [26], which uses Circos [27]. A number of split alignments, likely misassemblies, are visible in the assembly before Tigmint, whereas after Tigmint no such split alignments are visible.

Fig. 6

The alignments to the reference genome of the ABySS assembly of HG004 before and after Tigmint. The reference chromosomes are on the left in colour, the assembly scaffolds on the right in grey. No translocations are visible after Tigmint

The default maximum distance permitted between linked reads in a molecule is 50 kbp, which is the value used by the Long Ranger and Lariat tools of 10x Genomics. In our tests, values between 20 kbp and 100 kbp do not substantially affect the results, and values smaller than 20 kbp begin to disconnect linked reads that should be found in a single molecule. The effect of varying the window and spanning molecules parameters of Tigmint on the assembly contiguity and correctness metrics is shown in Fig. 7. When varying the spanning molecules parameter, the window parameter is fixed at 2 kbp, and when varying the window parameter, the spanning molecules parameter is fixed at 20. The assembly metrics of the ABySS, DISCOVARdenovo + ABySS-Scaffold, and DISCOVARdenovo + BESST assemblies after correction with Tigmint are rather insensitive to the spanning molecules parameter for any value up to 50 and for the window parameter for any value up to 2 kbp. The parameter values of span=20 and window=2000 worked well for all of the tested assembly tools.

Fig. 7

a. b. c. d. Effect of varying the window and span parameters on scaffold NGA50 and misassemblies of three assemblies of HG004

We simulated 434 million 2 ×250 paired-end and 350 million 2 ×125 mate-pair read pairs using wgsim of samtools, and we simulated 524 million 2 ×150 linked read pairs using LRSim [28], emulating the HG004 data set. We assembled these reads using ABySS 2.0.2, and applied Tigmint and ARCS as before. The assembly metrics are shown in Table 2. We see similar performance to the real data: a 20% reduction in misassemblies after running Tigmint, and a three-fold increase in NGA50 after Tigmint and ARCS. Since no structural rearrangements are present in the simulated data, each misassembly identified by QUAST ought to be a true misassembly, allowing us to calculate precision and recall. For the parameters used with the real data, window = 2000 and span = 20, Tigmint makes 210 cuts in scaffolds at least 3 kbp (QUAST does not analyze shorter scaffolds), and corrects 55 misassemblies of the 272 identified by QUAST, yielding precision and recall of (text {PPV} = frac {55}{210} = 0.26) and (text {TPR} = frac {55}{272} = 0.20). Altering the window parameter to 1 kbp, Tigmint makes only 58 cuts, and yet it corrects 51 misassemblies, making its precision and recall (text {PPV} = frac {51}{58} = 0.88) and (text {TPR} = frac {51}{272} = 0.19), a marked improvement in precision with only a small decrease in recall. The scaffold NGA50 after ARCS is 24.7 Mbp, 1% less than with window = 2000. Since the final assembly metrics are similar, using a smaller value for the window size parameter may avoid unnecessary cuts. Small-scale misassemblies cannot be detected by Tigmint, such as collapsed repeats, and relocations and inversions smaller than a typical molecule.

The primary steps of running Tigmint are mapping the reads to the assembly, determining the start and end coordinate of each molecule, and finally identifying the discrepant regions and correcting the assembly. Mapping the reads to the DISCOVAR + ABySS-Scaffold assembly with BWA-MEM and concurrently sorting by barcode using Samtools [29] in a pipe required 5.5 h (wall-clock) and 17.2 GB of RAM (RSS) using 48 threads on a 24-core hyper-threaded computer. Determining the start and end coordinates of each molecule required 3.25 h and 0.08 GB RAM using a single thread. Finally, identifying the discrepant regions of the assembly, correcting the assembly, and creating a new FASTA file required 7 min and 3.3 GB RAM using 48 threads. The slowest step of mapping the reads to the assembly could be made faster by using light-weight mapping rather than full alignment, since Tigmint needs only the positions of the reads, not their alignments. NxRepair required 74.9 GB of RAM (RSS) and 5h 19m of wall clock time using a single CPU core, since it is not parallelized.

When aligning an assembly of an individual’s genome to a reference genome of its species, we expect to see breakpoints where the assembled genome differs from the reference genome. These breakpoints are caused by both misassemblies and true differences between the individual and the reference. The median number of mobile-element insertions for example, just one class of structural variant, is estimated to be 1218 per individual [30]. Misassemblies can be corrected by inspecting the alignments of the reads to the assembly and cutting the scaffolds at positions not supported by the reads. Reported misassemblies due to true structural variation will however remain. For this reason, even a perfectly corrected assembly is expected to have a number of differences when compared to the reference.

Articles You May Like

David Baron: Why Should You Experience A Total Solar Eclipse?
These mysterious mounds in Brazil are 4,000 years old and visible from space
Computational discovery of direct associations between GO terms and protein domains
Look at This Sea Worm With a Creepy ‘Smiling’ Face Pulled From The Deep Ocean
Something is causing male infertility in one of the most important life forms on Earth

Leave a Reply

Your email address will not be published. Required fields are marked *