What we can see from very small size sample of metagenomic sequences

Bioinformatics

Mock community

The original full sequences obtained from the mock community of 10 strains were about 1,220,000 reads or 12.3 Gb. The GC content calculated from them was 53.1%.

The results of the GC content calculation for 176 different samples, which were selected from the original full sequences by 16 different selection types for each of 11 different sample sizes from 100 to 50,000 reads, show that GC content values get closer to 53.1%, as the sizes of the samples increase. (Fig. 1) This can be regarded as supportive evidence that a sample with a large enough size represents the nature of the original full sequences.

Fig. 1

GC content of samples (The labels of x-axis mean the sample sizes. They are placed per one selection type. This means a label represents 4 samples made by 4 different K numbers)

To analyze the taxonomy of this mock community, the numbers of hit reads from BLAST search for each of 10 strains of the original full sequences were counted. Their ratios to the sum of all 10 strains’ hits range from 0.014 to 0.186. (Table 1) These are the original values that we want to estimate with BLAST searches.

Table 1

Hits of BLAST searches in the original full sequences of the mock community

Escherichia coli KCTC 2571

30,658,032

0.1862

Escherichia coli Strain W

29,390,176

0.1785

Staphylococcus epidermidis ATCC

18,862,322

0.1146

Pseudomonas stutzeri ATCC 17588

18,559,245

0.1127

Klebsiella pneumoniae KCTC 2242

15,708,328

0.0954

Chromobacterium violaceum ATCC 12472

15,466,319

0.0939

Polaromonas naphthalenivorans CJ2

14,081,217

0.0855

Corynebacterium glutamicum ATCC 13032

10,351,377

0.0629

Roseobacter denitrificans OCh114

9,332,359

0.0567

Arthrobacter chlorophenolicus A6

2,253,237

0.0137

Sum

164,662,612

1

In order to do the estimation, the ratio of hit reads counted for each strain to the sum of all 10 strains’ hits was calculated for each of the 176 different samples, again, which were selected from the original full sequences by 16 different selection types per 11 different sample sizes.

The result of the calculation from the samples shows that the ratio values for Roseobacter get closer to 0.057, which was the ratio value of Roseobacter calculated from the original full sequences, as the sizes of the samples increase. The ratio values for Arthrobacter get closer to 0.014 similarly. (Fig. 2) The ratio values calculated for the samples of the other strains also show the similar results. To show the tendency that the deviation from the different sampling methods decreases while the size of the sample increases, the smallest values (Additional file 1: Table S1) and the largest values (Additional file 1: Table S2) among the ratio values calculated from 16 samples of each sample size were tabulated. The standard deviation values out of the ratios calculated from 16 samples of each sample size were also tabulated. (Additional file 1: Table S3).

Fig. 2

<Hit for strain/sums of all Hits> for each sample (The labels of x-axis mean the sample sizes. They are placed per one selection type. This means a label represents 4 samples made by 4 different K numbers)

As the size of the sample increases, the smallest values and the largest values show their tendency of getting closer to the ratio values calculated from the original full sequences. At the same time, as the size of the sample increases, the standard deviation value mainly decrease, though there are a few exceptions, since there are relatively large statistical errors where the values are small.

Again, these results support the tendency that a sample with a large enough size has its hit ratios that are close to ones of the original full sequence.

This means that small part of the original full sequences can be used to estimate original taxonomic annotation regardless of selection type, especially for relative comparison, such as to answer a question of which class is annotated most, and a question of which phylum is more annotated than another phylum.

Meanwhile, we can explain the difference between the results from the original full sequences and the ones from the samples as a general statistical error problem of a small size sample.

For a given margin of error, we can approximate a proper sample size, if we consider that estimating a taxonomic proportion of sequences is similar to a general statistical sampling problem, such as a poll to estimate a proportion of voters to an election candidate.

For example, as a rough approximation, if we assume that a given unknown set of metagenomic sequences follows a normal distribution and expected proportion of reads classified as a certain taxon is close to 1/2, which is a widely used value where we do not have any initial information about the actual proportion and the start-up cost of sampling is expensive [16], there is a simplified equation to calculate the size of the sample for a margin of error. (Eq. 1.) [16] By this calculation, the sample size for 1% margin of error and 85% confidence is about 5000 (5184).

$$ mathrm{n}=frac{{left(mathrm{Z}upalpha /2right)}^2kern0.5em cdot frac{1}{2}cdot frac{1}{2}}{{mathrm{E}}^2} $$

Eq. 1. Determining the sample size n in estimation of population proportion, where the probability of the range greater than Zα/2 at the standard normal distribution equals to (1-confidence)/2, and E is margin of error.

If we apply this margin of error calculation to the mock community test, the result from this margin of error calculation might be smaller than the actual errors, because all the ratio values of the mock community from the original full sequences are smaller than 1/2. Nevertheless, BLAST search result from a sample made by selecting 5000 reads from the start of the original full sequences (“selection type 1” and 0 as “K number”) of this mock community still gives fair estimation of the ratio values. (Fig. 3).

Fig. 3

<Hits of strain/sums of all hits> from original and from sample with 5000 Reads

We can compare this to a more general case of statistical sampling problem. For instance, we made the sample whose size is 5000 reads to estimate total 1.22 million reads. On the other hand, New York Times/CBS News performed a poll of 1426 people for 2016 U.S. Presidential election of total 137 million voters [17].

MG-RAST: Applicability in full-scale metagenomics sequencedata sets

All the GC content values calculated from the original full sequences of the 10 public MG-RAST projects, that all have more than 170,000,000 reads, were compared to the GC content values calculated from the samples, that only have 5000 reads selected from them. (Table 2) In most cases, the GC content values calculated from the samples estimate the ones calculated from the originals well.

Table 2

GC Contents, original vs. sample

4,539,528.3

4,701,886.3

62.975

62.161

4,510,219.3

4,701,884.3

53.813

51.365

4,510,173.3

4,701,887.3

50.512

50.477

4,509,400.3

4,701,883.3

62.269

62.468

4,562,385.3

4,701,888.3

56.816

57.733

4,538,997.3

4,701,892.3

58.078

56.902

4,539,575.3

4,701,885.3

60.177

60.568

4,587,432.3

4,701,891.3

52.943

52.68

4,555,915.3

4,701,890.3

48.254

48.132

4,533,611.3

4,701,889.3

56.184

45.012

The most annotated phyla and classes from the original MG-RAST research data were compared to the ones of the samples. (Table 3) For 9 cases out of 10, the most annotated phyla from the MG-RAST projects of the samples show the same results as the ones of the original data. For 9 cases out of 10, the most annotated classes are the same between the original MG-RAST research data and the ones of the samples. Considering that 4 different classes were shown among all the cases, these 9 out of 10 matches support the assumption that these samples can estimate the brief taxonomic information of the originals.

Table 3

Most annotated phylum and classes, original vs. sample

4,539,528.3

4,701,886.3

Proteobacteria

Proteobacteria

Actinobacteria (class)

Actinobacteria (class)

4,510,219.3

4,701,884.3

Proteobacteria

Proteobacteria

Deltaproteobacteria

Deltaproteobacteria

4,510,173.3

4,701,887.3

Proteobacteria

Proteobacteria

Gammaproteobacteria

Gammaproteobacteria

4,509,400.3

4,701,883.3

Proteobacteria

Proteobacteria

Actinobacteria (class)

Actinobacteria (class)

4,562,385.3

4,701,888.3

Proteobacteria

Proteobacteria

Gammaproteobacteria

Gammaproteobacteria

4,538,997.3

4,701,892.3

Proteobacteria

Proteobacteria

Alphaproteobacteria

Deltaproteobacteria

4,539,575.3

4,701,885.3

Proteobacteria

Proteobacteria

Alphaproteobacteria

Alphaproteobacteria

4,587,432.3

4,701,891.3

Firmicutes

Actinobacteria

Actinobacteria (class)

Actinobacteria (class)

4,555,915.3

4,701,890.3

Ascomycota

Ascomycota

Gammaproteobacteria

Gammaproteobacteria

4,533,611.3

4,701,889.3

Proteobacteria

Proteobacteria

Alphaproteobacteria

Alphaproteobacteria

On the other hand, the numbers of the annotated phyla from the samples tend to be smaller than the ones from the originals. (Fig. 4) The numbers of the annotated classes from the samples tend to be even much smaller than the ones from the samples. (Fig. 5) These are because the samples did not include different sequences representing all the different phyla and classes in the original data. A phylum or a class that presents only a small number of sequences in original has low probability of being captured in a sample. This implies that this type of sampling cannot take all the taxonomic diversity information.

Fig. 4

Numbers of annotated phyla -originals vs. samples

Fig. 5

Numbers of annotated classes -originals vs. samples

However, if we apply 1% threshold to remove over-annotation and/or mis-annotation, the numbers of the annotated phyla from the samples get much closer to the ones from the original data. (Fig. 6) The numbers of the annotated classes from the samples also get closer to the ones from the original data. (Fig. 7) This supports the assumption that this samples still can estimate, at least, part of taxonomic diversity information.

Fig. 6

Numbers of annotated phyla (1% threshold) -originals vs. samples

Fig. 7

Numbers of annotated classes (1% threshold) -originals vs. samples

Articles You May Like

For Birds-of-Paradise, Being Hot Is Not Enough to Win a Mate, They Must Also Have TheMoves
Horrific Tiger Slaughterhouse in Czechia Has Exposed Illegal Animal Trade in Europe
The South Pole Is Slowly Cooking From Below – Here’s Why
Neural nets supplant marker genes in analyzing single cell RNA sequencing
11 Things We’d Really Like to Know: Does the Universe Still Need Einstein?

Leave a Reply

Your email address will not be published. Required fields are marked *