Mock community
The original full sequences obtained from the mock community of 10 strains were about 1,220,000 reads or 12.3 Gb. The GC content calculated from them was 53.1%.
Hits of BLAST searches in the original full sequences of the mock community
Escherichia coli KCTC 2571 |
30,658,032 |
0.1862 |
Escherichia coli Strain W |
29,390,176 |
0.1785 |
Staphylococcus epidermidis ATCC |
18,862,322 |
0.1146 |
Pseudomonas stutzeri ATCC 17588 |
18,559,245 |
0.1127 |
Klebsiella pneumoniae KCTC 2242 |
15,708,328 |
0.0954 |
Chromobacterium violaceum ATCC 12472 |
15,466,319 |
0.0939 |
Polaromonas naphthalenivorans CJ2 |
14,081,217 |
0.0855 |
Corynebacterium glutamicum ATCC 13032 |
10,351,377 |
0.0629 |
Roseobacter denitrificans OCh114 |
9,332,359 |
0.0567 |
Arthrobacter chlorophenolicus A6 |
2,253,237 |
0.0137 |
Sum |
164,662,612 |
1 |
In order to do the estimation, the ratio of hit reads counted for each strain to the sum of all 10 strains’ hits was calculated for each of the 176 different samples, again, which were selected from the original full sequences by 16 different selection types per 11 different sample sizes.
As the size of the sample increases, the smallest values and the largest values show their tendency of getting closer to the ratio values calculated from the original full sequences. At the same time, as the size of the sample increases, the standard deviation value mainly decrease, though there are a few exceptions, since there are relatively large statistical errors where the values are small.
Again, these results support the tendency that a sample with a large enough size has its hit ratios that are close to ones of the original full sequence.
This means that small part of the original full sequences can be used to estimate original taxonomic annotation regardless of selection type, especially for relative comparison, such as to answer a question of which class is annotated most, and a question of which phylum is more annotated than another phylum.
Meanwhile, we can explain the difference between the results from the original full sequences and the ones from the samples as a general statistical error problem of a small size sample.
For a given margin of error, we can approximate a proper sample size, if we consider that estimating a taxonomic proportion of sequences is similar to a general statistical sampling problem, such as a poll to estimate a proportion of voters to an election candidate.
For example, as a rough approximation, if we assume that a given unknown set of metagenomic sequences follows a normal distribution and expected proportion of reads classified as a certain taxon is close to 1/2, which is a widely used value where we do not have any initial information about the actual proportion and the start-up cost of sampling is expensive [16], there is a simplified equation to calculate the size of the sample for a margin of error. (Eq. 1.) [16] By this calculation, the sample size for 1% margin of error and 85% confidence is about 5000 (5184).
$$ mathrm{n}=frac{{left(mathrm{Z}upalpha /2right)}^2kern0.5em cdot frac{1}{2}cdot frac{1}{2}}{{mathrm{E}}^2} $$
Eq. 1. Determining the sample size n in estimation of population proportion, where the probability of the range greater than Zα/2 at the standard normal distribution equals to (1-confidence)/2, and E is margin of error.
We can compare this to a more general case of statistical sampling problem. For instance, we made the sample whose size is 5000 reads to estimate total 1.22 million reads. On the other hand, New York Times/CBS News performed a poll of 1426 people for 2016 U.S. Presidential election of total 137 million voters [17].
MG-RAST: Applicability in full-scale metagenomics sequencedata sets
GC Contents, original vs. sample
4,539,528.3 |
4,701,886.3 |
62.975 |
62.161 |
4,510,219.3 |
4,701,884.3 |
53.813 |
51.365 |
4,510,173.3 |
4,701,887.3 |
50.512 |
50.477 |
4,509,400.3 |
4,701,883.3 |
62.269 |
62.468 |
4,562,385.3 |
4,701,888.3 |
56.816 |
57.733 |
4,538,997.3 |
4,701,892.3 |
58.078 |
56.902 |
4,539,575.3 |
4,701,885.3 |
60.177 |
60.568 |
4,587,432.3 |
4,701,891.3 |
52.943 |
52.68 |
4,555,915.3 |
4,701,890.3 |
48.254 |
48.132 |
4,533,611.3 |
4,701,889.3 |
56.184 |
45.012 |
Most annotated phylum and classes, original vs. sample
4,539,528.3 |
4,701,886.3 |
Proteobacteria |
Proteobacteria |
Actinobacteria (class) |
Actinobacteria (class) |
4,510,219.3 |
4,701,884.3 |
Proteobacteria |
Proteobacteria |
Deltaproteobacteria |
Deltaproteobacteria |
4,510,173.3 |
4,701,887.3 |
Proteobacteria |
Proteobacteria |
Gammaproteobacteria |
Gammaproteobacteria |
4,509,400.3 |
4,701,883.3 |
Proteobacteria |
Proteobacteria |
Actinobacteria (class) |
Actinobacteria (class) |
4,562,385.3 |
4,701,888.3 |
Proteobacteria |
Proteobacteria |
Gammaproteobacteria |
Gammaproteobacteria |
4,538,997.3 |
4,701,892.3 |
Proteobacteria |
Proteobacteria |
Alphaproteobacteria |
Deltaproteobacteria |
4,539,575.3 |
4,701,885.3 |
Proteobacteria |
Proteobacteria |
Alphaproteobacteria |
Alphaproteobacteria |
4,587,432.3 |
4,701,891.3 |
Firmicutes |
Actinobacteria |
Actinobacteria (class) |
Actinobacteria (class) |
4,555,915.3 |
4,701,890.3 |
Ascomycota |
Ascomycota |
Gammaproteobacteria |
Gammaproteobacteria |
4,533,611.3 |
4,701,889.3 |
Proteobacteria |
Proteobacteria |
Alphaproteobacteria |
Alphaproteobacteria |