Motivation

When analyzing viral metagenomic sequences, it is often desired to filter the results of a BLAST analysis by the host species of the virus. VHost-Classifier automates this procedure using a natural language processing algorithm written in Python 3, which takes a list of taxonomic identifiers (taxids) returned from a BLAST query using viral sequences as input. The taxid output is binned by the evolutionary lineage of their host, based on string matching the words in their English names. If VHost-Classifier cannot identify a host, it attempts to bin the sequences by the environment from which the sample originated. VHost-Classifier predicts the evolutionary lineage of the host from the virus name and does not rely on referencing taxids against a database; therefore, it is not constrained by the size of a database and can host classify newly characterized viruses.

Results

Benchmarked on a test dataset of 1000 randomly selected viral taxids on the NCBI taxonomy database, VHost-Classifier assigned, with 100% accuracy, a host to the rank of Class for >93% of viruses, and to the rank of Family for >37% of viruses.

Availability and implementation

For more information about VHost-Classifier as well as implementation instructions, visit https://github.com/Kzra/VHost-Classifier.

Supplementary information

Supplementary data are available at Bioinformatics online.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)