Partners in COMPARE have developed a new composition-based analysis tool for binning unclassified nucleotide sequence reads into their provenance classes DNA or RNA (LVQ-KNN).
Driven by next-generation sequencing (NGS), diagnostic metagenomics enables the detection of yet unknown or unexpected pathogens without their prior enrichment or purification. Sequence fragments are generated from all genetic material present in a sample. The generated genomic sequence data are classified taxonomically to get a clue of the species present in the analyzed sample.
Most of the commonly used methods for taxonomic classification rely on the detection of highly similar or even identical sequences related to known pathogens available in sequence databases. However, given the substantial genomic diversity of viruses known to date, combined with our limited knowledge of the virosphere, most likely the similarity-based approaches will not be sufficient to recognize strongly deviating sequences originating from completely unknown viruses.
Based on our current knowledge, we assume that substantial portions of the yet unknown viruses are RNA viruses. Moreover, substantial portions of the currently known viruses that cause severe disease are RNA viruses. Therefore, to enable the identification of novel RNA viruses, we developed a method to recognize a sequence is derived from a functional RNA molecule, i.e., a viral genome, without an available similar sequence. This enables the detection of so far unknown RNA viruses directly from a short read data set.
A research paper was recently published in the Virus Bioinformatics special issue of Virus Research
(DOI: 10.1016/j.virusres.2018.10.002).