Metagenomic analysis in Bioidea - BioMeta16S package
16S analysis is widely used to identify microorganisms - bacteria, archaea - and to search for phylogenetic relationships between them. It involves the amplification and sequencing of the 16S gene, characterized by a high polymorphism between different species of these microorganisms. The 16S gene is divided into several regions: V1 to V9, whose amplification occurs due to universal primers. The amplification of two 16S regions (e.g. V3V4) is sufficient to identify most bacteria, however, the longer the fragment is amplified, the easier it is to distinguish microorganisms with a high degree of similarity. This is especially important for bacterial identification up to the species taxonomic level.
To identify the microbiome in diagnostic and environmental samples, we use an in-house software package that analyzes NGS reads in a fast and comprehensive way. The BioMeta16S package was created for the purpose of identifying bacteria at all taxonomy levels. Unlike the vast majority of available bioinformatics solutions, the BioMeta16S package identifies bacteria with an accuracy of nearly 90% for the lowest level of taxonomy - species, and with an accuracy of nearly 100% for the genus taxonomic level (applies to medical samples - in environmental samples most organisms are not known so precisely) .
This is achieved, among other things, by developing a proprietary reference sequence database. The BioMeta16SRef reference was created on the basis of data from many sources - including the most frequently used in metagenomics Greengenes and NCBI databases - and is periodically updated automatically with newly known or sequenced organisms. Current reference version 1.1.2 of the reference distinguishes between 19267 bacterial species possessing the 16S gene.
Software verification - case study
The effectiveness of the BioMeta16S package was developed thanks to the publicly available results of mock sequencing, i.e. preparations containing the genomic DNA of known bacteria. Because in each of these preparations it is known what organisms are in the sample, it is easy to assess the effectiveness of the analysis consisting in organisms identification.
The analysis pipeline will be presented based on the publicly available NGS sequencing results of HM-782D. Results FASTQ files are available in the NCBI database under the number SRR2952731. According to the available information, HM-782D contains the genomic DNA of 20 bacteria: Acinetobacter baumannii, Actinomyces odontolyticus, Bacillus cereus, Bacteroides vulgatus, Clostridium beijerinckii, Deinococcus radiodurans, Enterococcus faecalis, Shigella sonnei *, Cutibacterium acnes, Pseudomonas aeruginosa, Rhodobacter sphaeroides, Staphylococcus aureus, Staphylococcus epidermidis, Streptococcus agalactiae, Streptococcus mutans, Streptococcus pneumoniae. To confirm the assigned taxonomy, the reference sequences (16S gene) of these bacteria were manually aligned to the NCBI database using the BLAST program. As a result of this analysis, it was observed that two groups of HM-782D reference sequences should have different taxonomy assigned. Taxonomies have changed: Escherichii coli to Shigella sonnei and Listeria monocytogenes on L. welshimeri. The reason for the change is probably the continuous evolution of the bacterial 16S gene sequence database.
Results of the experiment
As a result of the analysis using the BioMeta16S package, 24 clusters of OTU (operational taxonomic unit) sequences were obtained. Two OTU sequences were discarded from further analysis as a result of very low readings (<0.01%). Taxonomy was assigned to the other 22 OTU sequences using the BioMeta16SRef reference version 1.1.2. As a results, 18 OTU sequences were assigned to species taxonomy consistent with the reference, 3 OTU sequences were assigned to ambiguous species taxonomy, and one OTU sequence were assigned to additional bacterial species - Staphylococcus mitis, whose sequence is very similar to Staphylococcus aureus. Two OTU clusters have been assigned to the same taxonomy - Lactobacillus gasseri and Staphylococcus aureus. Other OTUs presented individual taxonomies.
All analyzed reads were assigned to the taxonomy level of the bacterial kingdom, which allowed the sequence to be divided into all taxonomy levels up to the species level. As a result of the analysis, the reads were assigned to the reference taxonomy of HM-782D with 100% of accuracy at the level of kingdom, phylum, class, order, family and genus. At the species taxonomic level, reads were assigned with an accuracy of 85%. For the three OTU sequences, the assignment of the species taxonomic level was ambiguous. Those OTU sequences were ideally assigned to several species reference sequences, hence it was impossible to select one specific record. Ambiguous results of the analysis have been designated as, for example: "Shigella; Other". This method of analysis protects against the appearance of incorrect annotation results to the species level, while providing 100% correct annotation to the genus level. The analyzed data did not allow to determine the species for the genus: Shigella, Clostridium and Neisseria. The reason for this may be that the sequence is too short to successfully separate closely related bacterial species from each other. In addition, errors at amplification or sequencing stages can affect the amount and quality of sequences needed to identify individual species.
To determine the frequency of each detected species, the percentage of reads per bacterial species was calculated. Most reads were assigned to Bacteroides vulgatus - 13.57% and Helicobacter pylori - 13.53%. The smallest amount of reads were assigned for the extra observed Streptococcus mitis species - 0.12%.
The 16S analysis using our BioMeta16S software and the BioMeta16SRef reference successfully enabled the identification of all genus taxonomy of HM-782D reference bacteria with 100% accuracy and species taxonomy with almost 90% accuracy. However, from the above description we can conclude, that effective 16S analysis is possible only under certain conditions:
- High quality of reads must be obtained. If the quality is low, the occurrence of nucleotide distortions in the reads can artificially overstate the number of identified organisms and cause erroneous taxonomic assignment;
- It is necessary to set the sequencing run to maximize the number of reads per sample. This is particularly important when looking for low-frequency species;
- It is preferable to get the 16S gene product as long as possible. This is especially important when there are closely related species in the sample. In the above-described analysis of the HM-782D preparation, we came across reads that could equally belong to several different species. Additional fragments of the 16S gene, that would make it possible to distinguish these species, have unfortunately not been sequenced. The above conditions are not always successful, hence it will not always be possible to assign reads to the species level - in this case, the reads will be assigned to a higher taxonomy level.