First step toward gene expression data integration: transcriptomic data acquisition with COMMAND>_

Bioinformatics

Workflow

COMMAND>_ is a multi-user application. From the web interface it is possible to create users and groups and grant privileges. Admin users have unlimited access, while normal users might be limited to work only on specific compendia and/or with a subset of functionalities. The typical workflow can be divided into three steps: i) search and download experiment data, ii) parsing downloaded files, iii) preview and import experiment data into the local database (Fig. 3).

Fig. 3

The flowchart of the typical workflow. Users start by searching and downloading experiments from public databases or uploading files for local experiments. Experiment files are then associated with parsing scripts and the parsing phase runs in background. Once experiment is parsed can be imported into the database and, if necessary, probes can be mapped to genes

A mandatory prerequisite for being able to perform these steps is to first establish the genomic background for the expression data, by uploading a FASTA file with gene sequences. Users can then import experiments starting by searching and downloading them from public databases or uploading local files (Fig. 4a). The supported databases are (at the moment) NCBI GEO [16], SRA [17] and EBI ArrayExpress [18]. Once the search has been performed, users can select one or more experiments and start the download process. Compressed files will be automatically extracted in a temporary folder.

Fig. 4

a The download experiment dialog. From this dialog it is possible to search experiment from GEO, SRA or ArrayExpress and download all the related files. b The file assignment dialog. From this dialog users assign scripts to experiment files in order to parse the information. c Probe to gene mapping dialog. This dialog allows users to perform the alignment and the two-step filtering used to re-annotate microarrays by mapping probes to genes

The pivotal point is the assignment of downloaded files together with parsing scripts to entities (experiment, platforms and samples) to mine only the relevant information (Fig. 4b). The scripts can be created or modified directly within the interface and are responsible for parsing input files and populating each part of the data model, i.e. measurement data and meta-data for experiment, platforms and samples. Once scripts are assigned to downloaded files, they can run independently and the results can be inspected using the preview interface. If the experiment appears to be complete, it can be imported into the database. Any possible error that might occur during parsing or importing of the experiment will be reported in the system log.

When a new microarray platform gets imported, it would be necessary to map its probes to genes. The probe to gene mapping is a fundamental process carried out performing a BLAST+ [19] alignment and a two-step filtering (Fig. 4c). The alignment might take a while for platforms with a lot of probes, especially when using the short-blastn option, and the result cannot be used as-is for the probe to gene mapping. Bad alignments need to be filtered out in order to retain only the most plausible ones, i.e. the alignments that most likely represent the “true” mRNA-to-probe hybridization process. The filtering step is usually fast and can be performed several times on the same alignment result to test different threshold choices. The two-steps filtering tries to mitigate the side effect of a simpler filtering (Fig. 5) and it is performed to guarantee that probes map to genes with high similarity (restrictive alignment threshold), while also mapping unambiguously to a unique position avoiding cross-hybridization issues in the measurements (less restrictive alignment threshold). Since probes coming from different microarrays generally differ in terms of length, origin, and sequence quality, parameters and cut-off thresholds can be adjusted in order to always obtain the reasonably best possible results according to each platform’s specific characteristics and user needs. The probe to gene mapping step has the advantage of enhancing data homogeneity since all microarray platforms will be annotated using the same gene list (i.e. the same genomic background represented by the FASTA file with gene sequences uploaded during the initial step). Moreover, annotating the microarray with the latest available data is often preferable since it might improve the expression data interpretation [20, 21]. If probe sequences are not available, or relying on the default annotation is more appropriate, it is possible to manually associate probes to genes using, for example, the manufacturer annotation (gene identifiers). All the parameters and re-annotation are stored on COMMAND>_, so that the procedure is completely reproducible.

Fig. 5

Advantages of two-step filtering. The five probe sequences (yellow, orange, green, purple and blue) all aligns twice with different scores. Starting from the assumptions that, for this specific example case, a reasonable threshold is 95% or more and that for a sequence to be considered aligned uniquely the score difference between the alignments of the same probe should be more than 3%, we consider three different scenarios. The leftmost one (scenario a) presents the case of a single-step filtering with a threshold equal to 95%. In this case the probe selected as unique ones are the orange, green and purple one. While the first two probes are compliant with our assumptions, the purple one should be discarded since it aligns twice with a score of 95 and 94% respectively. To avoid this situation we might try do adjust the threshold raising it to 96% as depicted in the scenario b. The problem in this case is given by the fact that we would miss the green probe. Using a two-step filtering, as shown in scenario c, avoids both this unwanted situations since the first filter (at 94%) would retain both the purple probe alignments (that will be discarded since the alignment is not unique) and the blue probe that will then be removed by the second filter at 95% leaving only the orange and the green probes as expected

In case of RNA-seq platforms, these steps are not necessary since the imported measurements (raw counts) are directly related to genes of the defined genomic background without the need for reporters like probes as for microarray experiments. Once FASTQ files are downloaded and associated to samples, the user will need to create the index file for the genomic background to be used in the alignment program. A FASTA file with gene sequences imported by the user would be automatically created and put in the experiment directory to be used as target for the index creation script. By default, FASTQ files will be then trimmed using Trimmomatic [22] and expression level quantified using Kallisto [23]. Users that wish to use different programs, could copy them to the COMMAND>_ directory and to write a Python wrapper script to use them.

The three steps described in the workflow are specific for gene expression data, but would be the same in case of other kind of quantitative data. For example, exon or small-RNA sequencing could easily be used by adopting a different genomic background, thus uploading a FASTA file with exons or small-RNAs sequences respectively. The importance of the genomic background definition lies in the fact that it establishes exactly what is measured by the imported experiments. Considering that the genomic background should not change during data collection, not all quantitative data are equally suitable for being imported in COMMAND>_. For example, metagenomic experiments for which Operational Taxonomic Units (OTUs) change from sample to sample (and even more from experiment to experiment) would not be an ideal type of quantitative data to be collected.

Comparison with similar tools

To the best of our knowledge there are no other tools that offer all the options COMMAND>_ does. Nevertheless, we will report the main differences between COMMAND>_ and other similar tools (see Table 1). GEOquery [24] is a package written for the R programming language (http://www.R-project.org/) that allow R users to easily connect, retrieve, parse and extract expression data from GEO ready to be used in downstream analysis. The ArrayExpress [25] R package works similarly to GEOquery but for the ArrayExpress database. GEOmetadb [26] and SRAdb [27] both allow the user to query GEO and SRA within the R environment, but they require to download an SQLite file that contains the totality of GEO/SRA metadata. Compendiumdb [28] is an R package framework used to parse and store expression information into a relational database that can be queried from within the R environment for subsequent analysis. VirtualArray [29] is another R package used to combine raw data from diverse microarray samples (or experiments) and generates a combined object for further analysis. It also implements several batch effect removal methods but it is not available for the latest Bioconductor version. Microarray retriever [30] is a web-application used to query and download expression data from both GEO and ArrayExpress, but is currently unavailable.

Table 1

Functionalities comparison between COMMAND>_ and other tools used to collect gene expression data from public databases

COMMAND>_

NO

YES

YES

YES

YES

YES

YES

YES

YES

 

GEOquery

YES

NO

NO

YES

NO

NO

NO

NO

NO

 

ArrayExpress

YES

NO

NO

NO

YES

NO

NO

NO

YES

 

GEOmetadb

YES

NO

NO

YES

NO

YES

NO

NO

YES

requires the dowload of an sqlite file with meta information

SRAdb

YES

NO

NO

YES

NO

YES

NO

NO

YES

requires the dowload of an sqlite file with meta information

compendiumdb

YES

NO

NO

YES

NO

NO

YES

NO

NO

 

virtualArray

YES

YES

NO

YES

YES

NO

NO

NO

NO

allow to normalize data and correct for batch effect, available only for Bioconductor <= 2.14

Microarray retriever

NO

NO

YES

YES

YES

NO

YES

NO

YES

unavailable

The main difference between all these tools, except for Microarray retriever, and COMMAND>_ is that all of them are Bioconductor packages and run within the R programming environment. The advantages are that Bioconductor is a strong and reliable environment and different packages can be used in combination to perform a vast amount of different analysis. Despite being a great tool for data analysis, R and Bioconductor are not meant for data retrieval, and management of large amount of data can be problematic since R programs, without specific packages such as parallel, are by default single-threaded process, the data are completely stored in RAM and thus don’t easily scale to handle large datasets.

COMMAND>_ has been developed with the specific goal of simplifying this part. It relies on a relational database and a task queue system such as PostgreSQL and Celery respectively to easily scale when number of experiments grows significantly. In this regard they might be thought more as complementary tools with R to be used to analyse the datasets collected using COMMAND>_. In COMMAND>_ many operations can be done using the Graphical User Interface (GUI) such as the re-annotation tool which allows the user to produce an optimized annotation instead of relying on default ones. It is important to highlight that the re-annotation step allows perfect reproducibility of the analysis since all parameters are stored within COMMAND>_. Finally, despite being a graphical tool offering a friendly user experience, COMMAND>_ gives the same flexibility of a command-line environment to manage all possible situations through its Python editor.

Case study

In order to demonstrate COMMAND>_ functionalities, we present several case studies available within the on-line documentation. Moreover, we used it here for searching, downloading, parsing, re-annotating and exporting a collection of small airway samples from patients affected by Chronic Obstructive Pulmonary Disease (COPD) [31]. The original study is a collection of 273 samples from three Affymetrix microarray experiments retrieved from the Gene Expression Omnibus (GEO): GSE8545, GSE20257 and GSE11906. We start retrieving the GEO experiments used in the study using the “Download Experiment From Public Database” dialog with the GSE Series ID as term and GEO as database (Fig. 4a). Before starting to parse the experiments we need to import the gene sequences to be used for the probe mapping step. The parsing procedure starts by selecting one experiment and pressing the “Parse/Import experiment” button. The parsing interface is divided into three collapsible sections: the top one shows the experiment data preview, the middle one contains the experiment files browser and the assignment tool used to couple parsing scripts and experiment files, while the bottom section is the Python editor. Having the original probes is highly encouraged in order to take advantage of the probe to gene mapping functionality. Since probe sequences are not included in this experiment, we have to download them separately from the Affymetrix Support site and upload them into COMMAND>_ using the “Upload file” button. Once all files are in place we are ready to start assigning parsing scripts to the experiment files. Since we don’t need to change any information related to the experiment entity we will start with platform-related files, i.e. HGU133Plus2_Hs_ENSG_probe_tab. The assignment procedure is itself based on the execution of a Python script, and in this way we can automatically assign a vast amount of files using user-defined rules. For this specific case we will tell COMMAND>_ to parse the “HGU133Plus2_Hs_ENSG_probe_tab” file using the “gpr_platform.py” script. To correctly parse the platform file we have to inform the “gpr_platform.py” about the field names to be used for the probe id and the probe sequence. The sample files assignment will proceed similarly by selecting all CEL files (we can use the filter by file names) and giving “cel_sample.py” as script to be used. This time we will use the “match_entity_name.py” assignment script in order to have COMMAND>_ to automatically couple CEL files with the corresponding samples (Fig. 4b). The last file to be assigned is the soft file that contains the meta-data for all entities, experiment, platform and samples. Once again we use the “assign_all.py” to assign “soft_experiment.py”, “soft_platform.py” and “soft_sample.py” scripts to experiment, platform and samples respectively. After inspecting that all the assignments are correctly done we are ready to run the parsing scripts. Once the parsing is done we can inspect the results in the “Preview” section and import the experiment. We will have to repeat the same procedure for the remaining experiments.

Once that all the raw data are imported into the database we can map the probes for the GPL570 platform to the human genes we already imported. This fundamental step consists in two parts, the alignment and filtering of alignment results. For the alignment step we can chose a quite stringent identity threshold (such as 95% or 98%) since both probes and genes belong to the same species. The two-step filtering thresholds are set to 95% alignment length, 0 gap and 3 mismatches for the sensitivity step and 98% alignment length, 0 gap and 1 mismatch for the specificity step (Fig. 4c). The chosen threshold captures the idea that probes might align (even if not perfectly) on more than one gene resulting in an unusable probe, and, require a higher minimum alignment quality for a probe to be considered reliable. As stated previously this wouldn’t be possible using only a single filter (Fig. 5). In this specific case choosing different thresholds will result in little differences since probes and genes come from the same organism. This step is increasingly relevant when more and more probes are designed for a different organism than the one we are using as the genomic background and for which we have the gene sequences, such as might be the case for different strains of bacteria or different cultivars of plant crops.

After the alignment process is complete, we can set the filtering parameters and run the filtering. Once the filtering is done, we are able to import the probe to gene mapping. Finally, we can export the resulting raw data in both TSV and HDF5 file format.

Articles You May Like

Hurricane Michael Was A Category 5, NOAA Finds – The First Since Andrew In 1992
Why language technology can’t handle Game of Thrones (yet)
Catalyst renders nerve agents harmless
Behavioral disorders in kids with autism linked to reduced brain connectivity
Ocean circulation likely to blame for severity of 2018 red tide around Florida

Leave a Reply

Your email address will not be published. Required fields are marked *