Rapid fine mapping of causative mutations from sets of unordered, contig-sized fragments of genome sequence

The process of mutation and crossing results in reduction of the density of experimentally induced homozygous polymorphisms relative to heterozygous polymorphisms as distance from the causative mutation increases [7, 8]. We assessed the properties of fragmented genome assemblies of Arabidopsis with respect to statistics of allele frequency. The length distribution of scaffolds was modelled and followed a log-normal distribution (Additional file 1: Figure S1, R2=0.796). Therefore we generated 1000 simulated Arabidopsis assemblies containing fragments of length fitting a log-normal distribution for both backcross and outcross bulk segregant sequence data. Variants were called on Arabidopsis genome for both mapping experiments. These variants (as VCF files) were used in the simulations to generate both fragmented assemblies and associated variant calls from both mapping populations (see methods for further details). We defined a straightforward homozygosity enrichment score (HMES) strictly the ratio of homozygous to heterozygous variants. HMES=(α+ρ)/(β+ρ), where α = number of homozygous variants on the selected fragment, β = number of heterozygous variants on the selected fragment and ρ = a ratio adjustment factor, included to avoid division of or by zero and ρ = 0.5 is used in our case. For each fragment, HMES serves as an estimator of the nearness of the fragment to the causative mutation. The density of fragment HMES is plotted against a continuous genome in Fig. 1 for each of the thousand simulations. The outcross (Fig. 1a) and backcross (Fig. 1b) data both show strong enrichment of HMES around the causative mutation, though backcross data show strong peaks elsewhere, a consequence of lower density of SNPs resulting from mutagenesis. In our outcross simulations, the fragment carrying the causative mutation has a mean HMES of 12.24 (median 10.45) and on average is in the top 18.8% (median 13.5%) of HMES scores (Fig. 1c, d). In our simulation length of the fragment ranged from 0.5kb to 493kb with median of 27.5kb. Together these statistics indicate that the HMES is a reliable way to identify the most likely fragments around or bearing the causative mutation.

Fig. 1

Region of the genome harbouring the candidate mutation is defined by the assembled fragments with enriched homozygosity. (a,b) Density plots of fragments HMES from 1000 iterations for an (a) outcross data and (b) backcross data are plotted along the entire Arabidopsis genome. Contigs from chromosome 1 to 5 are arranged sequentially. Causative mutation location is highlighted using a red vertical dashed line. (c,d,e) Boxplots showing the distribution of (c) HMES (d) HMES rank presented as a percentage of all fragments with HMES >1 and (e) length of the fragment with candidate mutation from 1000 iterations

We developed a rapid HMES sorting algorithm to arrange unordered fragments into a sequence representing distance from the causative mutation, though not necessarily their original order in the genome (Algorithm 1). In broad terms the algorithm orders the fragments with the largest HMES at the centre, the second largest to its left, the third largest to its right, the fourth largest to the extreme left and so on resulting in a rough ordering of the genomic fragments such that nearness to the centre of the ordering increases the likelihood of a fragment carrying the causative mutation (Additional file 1: Figure S2).

Output from CHERIPIC is a tab-delimited text file with the following information about the variants selected – “HMES, allele frequency, length of contig, id of contig, variant position in contig, reference base, coverage, read bases, base qualities, left sequence to variant, variant allele, right sequence to variant”. Left and right sequences are provided to easily design markers and sequence length can be user adjusted to retrieve enough sequence information.

Articles You May Like

Antibody-mediated biorecognition of myelin oligodendrocyte glycoprotein: computational evidence of demyelination-related epitopes
Workshop on Informatics for Macromolecules and ADCs in San Diego
Maps of variability in cell lineage trees
Bioinformatics calls the school: Use of smartphones to introduce Python for bioinformatics in high schools
How good are pathogenicity predictors in detecting benign variants?

Leave a Reply

Your email address will not be published. Required fields are marked *