GenESysV: a fast, intuitive and scalable genome exploration open source tool for variants generated from high-throughput sequencing projects

Bioinformatics

The advent of high throughput sequencing technologies has greatly accelerated the identification of variants that underlie Mendelian and complex diseases [17]. With the cost of sequencing decreasing and sequencing accuracy improving, an increasing number of research laboratories/projects have adopted these technologies to interrogate variants from a few or even hundreds to thousands of human samples in an attempt to identify variants that may underline rare monogenic or common complex diseases. Of the millions of variants typically found in any given individual, most of them likely only contribute to human population diversities. Identifying a subset of variants that are most likely underlying the disease or traits of interest requires field knowledge combined with the use of software tools to facilitate this process. The typical workflow in selecting candidate disease-causing variants starts with variant annotation using tools such as the Ensembl Variant Effect Predictor (VEP) [8] or Annovar [9]. Variants are then subsequently filtered by their minor allele frequencies and other criteria such as the functional consequences to the genes and transcripts they affect, conservation scores [10, 11], predicted pathogenicity scores [12, 13], known associations with disease phenotypes [1416], etc. After these filtering steps, a short list of candidate disease-causing genes or variants can be produced and reviewed by field experts for downstream validation.

Due to a very large number of variants typically identified from a sequencing project, retrieving variants of interest based on the above criteria generally requires writing custom scripts to process VCF [17] format files, a de facto standard used in reporting genetic variants from high throughput sequencing or genotyping projects. Given the importance of identifying these variants, it is not surprising that a number of software tools have been developed in the past few years. These include commercial packages such as Ingenuity Variant Analysis from QIAGEN (www.qiagen.com/ingenuity), VarSeq (http://goldenhelix.com/products/VarSeq/index.html) and Sequence Miner (https://www.wuxinextcode.com), as well as several open source tools, such as GEMINI [18], BrowseVCF [19], VCF-miner [20], Mendel,MD [21] and BiERapp [22].

During the course of supporting genomics projects engaged by the Buffalo Institute for Genomics and Data Analytics (https://www.buffalo.edu/genomics.html), we surveyed existing open source software in order to find a package that would meet our needs for performance, ease of use, scalability, and controlled access to its users and their proprietary data (Table 1). Unfortunately, many of these open source tools are not designed as comprehensive variant exploration tools and are unable to handle all of the commonly known disease models, neither are they designed for use by multiple researchers who require secure data storage and access. Furthermore, many of the existing tools lack rapid analysis capability for large cohorts consisting of thousands of samples and hundreds of millions of variants.

Table 1

Comparison of existing open-source software tools with similar functions

Graphical Usera Interface

Yes

No

Yes

Yes

Yes

Yes

Study type

Single cohort complex disease, Case/Control, and Mendelian inheritance

Single cohort complex disease and Mendelian inheritance

Single cohort complex and Mendelian inheritance

Single cohort complex disease and Mendelian inheritance

Mendelian only

Single cohort complex disease, Case/Control, and Mendelian inheritance

Whole genome, exome or target study

All

All

All

All

WES or targeted study

WES or target study

Can handle studies with large numbers of samples

Yes

Yes

No

No

No

No

Database Type

Elasticsearch

Sqlite3

Wormtable & BerkeleyDB

MongoDB

PostgreSQL

SQLite & MongoDB

Flag variants for further filtering

Yes

No

No

No

No

No

To overcome these limitations, we developed GenESysV – an open source software system with an intuitive user interface. GenESysV can be deployed on a single computer or on a multi-node computer cluster to enable a wide range of researchers with varying computational skills to explore and prioritize variants in both coding and non-coding regions of the human genome. It can scale for studies with thousands of samples, yet still gives satisfactory data loading and querying performance. Below, we describe its design, features and performance benchmarks.

Articles You May Like

Neonics hinder bees’ ability to fend off deadly mites
Tiny Earthquakes Happen Every Few Minutes In Southern California, Study Finds
Brains of blind people adapt to sharpen sense of hearing, study shows
Geomagnetic jerks finally reproduced and explained
Hurricane Michael Was A Category 5, NOAA Finds – The First Since Andrew In 1992

Leave a Reply

Your email address will not be published. Required fields are marked *