iRODS metadata management for a cancer genome analysis workflow


Numerous NGS workflows have been adapted to HPC systems with various methods. For example, HugeSeq [18] detects and annotates genetic variations by applying a MapReduce [19] approach, NGSANE [20] uses bash scripting with extensible logging and checkpointing measures, SIMPLEX [21] offers a fully functional VirtualBox Image to reduce installation issues. While they describe the analysis process in detail, few of them consider the requirements of data security and the necessary framework to make the results as well the corresponding metadata available for further dissemination. The WEP [22] pipeline for whole exome processing addresses the latter shortcoming by storing result metadata in a self developed MySQL database with a PHP-based web interface. What is missing is a comprehensive data management system that would encompass the employed input data, the results and the metadata within a secure and reliable framework. Even though the metadata delivers necessary information, the underlying files should also be stored in a controlled environment, so that they can be both retrieved at a moments notice.

There are organizations that have employed iRODS for their NGS workflows, namely the Wellcome Trust Sanger Institute [23], Broad Institute, Genome Center at Washington University, Bayer HealthCare, and University of Uppsala (private communication); most recently, the University of Arizona has developed a widely integrated cloud solution for NGS data processing and analysis [24] which is partly based on iRODS. However, they mainly use iRODS to manage, store and retrieve data (e.g., alignment files). In contrast, by encompassing our workflow with iRODS, we can not only store and annotate the input data with relevant information but also parse the results and make them available for queries through self-defined metadata within a single system.

The described pipeline automation and integration with iRODS empowers (organizations of many) scientists to keep track of their data in an efficient and secure manner. By employing verifiable data schemata, we can enforce metadata consistency and build a hierarchical structure within iRODS’ virtual file space that places files in predefined locations. While it provides a straightforward means to narrow data searches down, it also makes the mapping of user permissions easier to manage. The possibility to restrict access to certain projects or file groups is especially relevant in the clinical context where patient data is involved. We have decided to rely on iRODS’ authentication in order to let it manage contents in their entirety, rather than using it as a sole metadata provider. For this means we have also tightened security and restricted its services to a virtual machine as well as a resource server within the cluster. The latter leverages low latency and high bandwidth network capabilities.

The inclusion of both the input as well as output data with matching descriptions has resulted in a comprehensive system that allows to retrieve and compare analysis results with their underlying sources.

Efficient use of storage and computing resources

A common use case in cancer genomics is to compare NGS data of tumor specimen against data of a normal tissue specimen, both collected from the same patient. Thus we often process pairs of tumor and normal data. During course of the patient’s treatment sometimes additional tumor specimen are collected, sequenced, and subsequently compared against the previously collected normal data (e.g., in order to understand the tumor evolution). As a common problem in practice, redundant copies of normal data files are made for each tumor data analysis; likewise, often multiple redundant data copies arise when different projects work on the same input data. In practice, a proper clean-up of such redundant data is time-consuming and often missing. While this can be addressed in principle by establishing organizational data storage and processing policies alone, the proper execution of such a policy is a clear benefit of our implementation of the complete workflow.

Additionally, in order to minimize the use of computing resources we distinguish between different use cases for running the cancer genome analysis pipeline: either, a pair of tumor and normal data, or new tumor data with respect to previously analyzed normal data is processed. In the latter case, processing results of the normal data are retrieved from the vault and used for the current analysis.

Articles You May Like

NASA Says Ultima Thule Actually Looks Like A Pancake And A Walnut
Maps of variability in cell lineage trees
After 16 Months Of Dead Fish, Manatees And Dolphins, Florida’s Red Tide Ebbs
FunMappOne: a tool to hierarchically organize and visually navigate functional gene annotations in multiple experiments
Assessing the performance of real-time epidemic forecasts: A case study of Ebola in the Western Area region of Sierra Leone, 2014-15

Leave a Reply

Your email address will not be published. Required fields are marked *