Previous efforts carried out in the analysis of environmental datasets, such as the Earth Microbiome Project [16, 29] and the Global Ocean Sampling , produced an astounding amount of sequence data from several environmental samples across the world. Global scale studies can produce an overwhelming amount of data, based on consistent methodological approach, which for the Tara Ocean approach relied also on the use of shotgun genome sequencing strategies. This, in general, allows to collect an important amount of genomic data along with precious informative content , thus expanding the targeted sequencing approach previously employed in the Global Ocean Sampling cruises . However, the evaluation of taxonomic diversity across huge datasets with different sampling sizes  requires well-calibrated, robust and appropriate methods to get reasonable results. The selection of appropriate sequencing approaches is driven by compromises in terms of sequencing costs and scientific outputs: the possibility of getting a wider information content with “shotgun” metagenomics than with targeted sequencing may produce novel insights on distribution of species and on their roles in the ecosystems, although the definition of OTUs may be not accurate and capturing the real diversity of prokaryotic taxa may be hard [29, 31].
Rich metadata collections, provided alongside with molecular data associated with environmental ‘omic studies, add a further layer of complexity to the analysis. Since the number of similar initiatives is widening (e.g. the Earth Microbiome Project [16, 29], the MicroB3 network , the Tara Ocean project [5–7], the Ocean Sampling Day initiative , we focused on possible computational approaches to appropriately exploit the information content from shotgun sequencing even in relation with environmental sampling metadata.
To explore the possible computational frameworks that could support appropriate organization and access to these data types, integrating sequence data information and environmental metadata, we designed a dedicated platform, based on a document-oriented NoSQL database management system (DBMS) .
We focused on the management and maintenance of information on 16S rDNA sequences considering also the possibility to organize collections of heterogeneous quality, since these data type may occur more frequently in shotgun metagenomics and, nevertheless, they represent precious information content to be investigated too.
The starting setup was implemented considering the Tara Ocean dataset, since these data were recently released  and they still need a deeper and wider exploration, also exploiting the wide environmental dataset contextually collected. We reconstructed putative 16S rDNA contigs by assembling miTAGs, representing 16S sequence tags identified and extracted by the Tara Ocean Consortium. The MEGAHIT assembler was chosen to this aim because of its ability to solve 16S micro-diversity, thus potentially allowing to discriminate among prokaryotic strains, differing for identities < 97% in their 16S rDNA sequence . Moreover, the employed strategy allowed maintaining abundance information: in fact, the global abundance of major prokaryotic taxa was similar to that highlighted in previous works [33, 34]. Indeed, although the number of long contigs was < 10% for each library here considered, yet we managed to produce more than 4000 long, chimera-free contigs, which can be used for more complex queries and might allow the exploration of new hypotheses in microbial ecology [7, 35–39].
The presence of putatively chimeric sequences could be due either to miss-assemblies (typically caused by the combined effect of the different degree of variability within the prokaryotic 16S locus and the relatively short length of the raw sequences, which can lead to assembly of non-related sequences) or to genuine gene novelty (thus representing 16S prokaryotic ribosomal genes with still unclear phylogenetic placement). Although this issue is beyond the scope of the present work, it might be worth of future endeavours to fully characterize the real diversity underlying such datasets. New research direction involving phylogenetic diversity analyses on the present dataset might include a more thorough investigation on the taxonomic assignment of short contigs and raw reads, taking into account the composition of conserved vs. hypervariable regions of prokaryotic 16S genes, as well as the presence and characterization of mitochondrial and plastid sequences (which are currently marked as affiliating to either mitochondrial or chloroplast gene sequences according to the SILVA data description).
The BLAST server was implemented using the Sequence Server software . Indeed, this approach allows us to set-up an in-house BLAST service, accessible from the web and with a standard BLAST NCBI-like input and output formats, permitting user driven selections of different data collections. As an example, the possibility to explore in parallel results from the SILVA database  and from the in-house implemented collections permits to cross-check sequence relationships as well as to access sequences information along with environmental metadata.
The choice of MongoDB as an alternative approach to more conventional relational DBMS technologies is an emerging trend in bioinformatics, as demonstrated by other similar research projects . This is due to different reasons: i) the huge size of the dataset, demanding replication and data sharing to guarantee safety and performances, as well as scalability; ii) the inhomogeneity of the data collections, which can be much more easily addressed using a schema-less DBMS; iii) the possibility to query geospatial data natively; iv) the possibility to quickly re-organize data collections, allowing for database updates and changes. Since the MongoDB structure and capabilities meet the aforementioned requirements, it has been chosen as the document database for this study.
Similar platforms as the one here proposed are not new to science: complex systems, such as MG-RAST , allow researchers to upload, store, analyse and compare metagenomics samples on a global scale. Ribosomal sequence repositories such as the SILVA database  also allow scientists to make queries using a proprietary alignment software to identify similar sequences (for a sequence query-based search) or to simply browse the database to download sequences of interest. However, although such web-based systems have become part of standard practices in both shotgun and targeted metagenomics efforts, none of them allows to interactively exploit environmental metadata, which is of paramount importance for ecological studies. Indeed, for instance, MG-RAST allows users to search for specific samples or projects by means of MIMARKS-based metadata [14, 41], which are provided during the submission by data providers, although the environmental metadata are not accessible by straightforward queries on sequence data. The SILVA database  provides users with details on sequence data processing and production, but does not store any contextual data in the sequence dataset. The MarRef database , on the contrary, provides a rich set of metadata concerning both prokaryotic species features and environmental features and allows BLAST searches, but it only hosts a limited amount of sequence data, concerning few reference species. The most similar, and most recent, implementation of a querable system exploiting the Tara Ocean data is the Ocean Gene Atlas [43, 44], which allows users to compare their own sequence data with either the Tara Ocean Microbiome Reference Gene Catalog (for prokaryotes) or the Marine Atlas of Tara Ocean Unigenes (for eukaryotes). This service allows the navigation and visualization of user-defined sets of nucleotide or amino acid sequences that can be explored based on their functional annotation. GLOSSary, in contrast, allows taxonomic based analyses on 16S sequences, supporting investigations on phylogenetic diversity based on this marker. To support the users, the BLAST server embedded in the GLOSSary platform also allows joint analyses versus assembled and unassembled Tara Ocean 16S sequences, and those included in the SILVA database, thus supporting comparative analyses of the different outputs in one shot. This is a straightforward approach to detect novel tags from the Tara Oceans collection which are not included in the SILVA collection.
The GLOSSary platform tackles relevant issues in meta “-omics” from environmental data starting from the organization of heterogeneous 16S rDNA data and from their associated metadata, favouring efficient queries on large amount of information and their analysis by suitable graphical approaches. Instead of replicating already-existing frameworks, GLOSSary also allow for BLAST-like sequence search and comparison, which also integrates a well-established reference database, as well as metadata-informed query of prokaryotic taxa on a global scale.
Although this initial effort is now presented as a framework which embeds the Tara Oceans data, its underlying objective is to expand with additional dataset from similar resources, aiming at a comprehensive collection that could support the exploration of prokaryotic taxonomic diversity integrated with their environmental characterization.