The Global Virome, in One Network
The VIRION database is an atlas of the vertebrate-virus network. It was built by, and is curated by, an interdisciplinary team of virologists, ecologists, and data scientists as part of the Verena Consortium, an effort to predict which viruses could infect humans, which animals host them, and where they could someday emerge. VIRION is the most comprehensive database of its kind, drawing data from scientific literature and online databases, and is updated automatically with new data. Today, it includes over 20,000 species interactions that capture the viromes of one in every four mammals, one in every ten birds, and roughly 6% of vertebrates. Unlike many other databases, VIRION has undergone a fully-consistent taxonomic reconciliaton process using a backbone provided by NCBI. We encourage researchers to review this entire guide before using these data.
There are several versions of VIRION you can choose from. VIRION is periodically hand-compiled into a new stable version, which includes taxonomic updates to every sub-component of the dataset. The two dynamic sources (GLOBI and GenBank are also scraped and automatically recompiled into an “up-to-date” version on a daily basis. Note that, while the CLOVER and PREDICT source datasets are static, species name changes may not be reflected in these datasets until the entire dataset is manually recompiled, potentially creating discrepancies between these sources. If you want to reproduce the vignettes we present in the publication, you can also download the entire release of version 0.2.1.
Full database: Up-to-date // Stable
Simplified edgelist: Up-to-date // Stable
Provenance metadata: Up-to-date // Stable
Detection metadata: Up-to-date // Stable
Temporal metadata: Up-to-date // Stable
Host higher taxonomy: Up-to-date // Stable
Virus higher taxonomy: Up-to-date // Stable
You can cite the preprint that accompanies the study as:
Carlson CJ, Gibb RJ, Albery GF, Brierley L, Connor R, Dallas T, Eskew EA, Fagre AC, Farrell MJ, Frank HK, Muylaert RL, Poisot T, Rasmussen AL, Ryan SJ, Seifert SN. The Global Virome in One Network (VIRION): an Atlas of Vertebrate-Virus Associations. mBio. 2022 Mar 1. DOI: 10.1128/mbio.02985-21.
If you want to cite the VIRION database directly, you can also use refer to .
VIRION aggregates seven major sources of information, two of which can be dynamically updated (*):
VIRION can be used for everything from deep learning to simple biological questions. For example, if you wanted to ask which bats a betacoronavirus (like SARS-CoV or MERS-CoV) has ever been isolated from, you could run this R
code:
> library(tidyverse); library(vroom)
>
> virion <- vroom("Virion/Virion.csv.gz")
>
> virion %>%
+ filter(VirusGenus == "betacoronavirus",
+ HostOrder == "chiroptera",
+ DetectionMethod == "Isolation/Observation") %>%
+ pull(Host) %>%
+ unique()
[1] "chaerephon plicatus" "pipistrellus abramus" "rhinolophus affinis"
[4] "rhinolophus ferrumequinum" "rhinolophus macrotis" "rhinolophus pearsonii"
[7] "rhinolophus sinicus" "rousettus leschenaultii" "tylonycteris pachypus"
It’s that simple! Here’s a few small tips and tricks you should know:
taxize::classification("Whateverthe latinnameis", db = "ncbi")
. If the issue is related to that taxonomic backbone, please label your issue ncbi-needed
For now, VIRION lives on Github in a fully open and reproducible format. Downloading the data directly from this website, or cloning the repository, is the easiest way to access the data. To avoid relying on the Large File Storage system, the VIRION database itself is stored in two file formats:
Virion/Virion.csv.gz
which can be easily read as-is using the vroom
package.HostTaxID
and VirusTaxID
fields, while the metadata files can be joined by the AssocID
field (which must first be separated into unique rows). For simple tasks, not every join will be needed.Like most datasets that record host-virus associations, this includes a mix of different lines of evidence, diagnostic methods, and metadata quality. Some associations will be found in every database, with every evidence standard; others will be recorded from a single serological data point with unclear attribution. VIRION can aggregate all this data for you, but it’s your job as a researcher to be thoughtful about how you use these data. Some suggested best practices:
As a starting point, you can remove any records that aren’t taxonomically resolved to the NCBI backbone (HostNCBIResolved == FALSE, VirusNCBIResolved == FALSE
). We particularly suggest this for data that come from other databases that also aggregate content but use multiple taxonomic backbones, which may include invalid names that are not updated.
You should also be wary of records with a flag that indicates host identification by researchers was uncertain (HostFlagID == TRUE
).
Limiting evidence standards based on diagnostic standards (e.g., using Nucleotide and Isolation/Observation records, but no Antibodies) or based on redundancy (i.e., number of datasets that record an association) can also lead to stronger results.
We encourage particular caution with regard to the validity of virus names. Although the NCBI and ICTV taxonomies are updated against each other, valid NCBI names are not guaranteed to be ICTV-valid species level designations, and many may include sampling metadata. We recommend that researchers manually curate names where possible, but can also use simple rubrics to reduce down controversial names. For example, in the list of NCBI-accepted betacoronavirus names, eliminating all virus names that include a “/” (e.g., using stringr::str_detect()
) will reduce many lineage-specific records (“bat coronavirus 2265/philippines/2010”, “coronavirus n.noc/vm199/2007/nld”) and leave behind cleaner names (“alpaca coronavirus”) but won’t necessarily catch everything (“bat coronavirus ank045f”). Another option is to limit analysis to viruses that are ICTV ratified (ICTVRatified == TRUE
), but this is particularly conservative, and will leave a much larger number of valid virus names out.
VIRION is an open database with a CC-0 license. As such, you can do just about anything with it that you’d like. We would prefer it not be reproduced into other formats that lose intentional aspects of VIRION’s design (e.g., in other databases that drop metadata like evidence standards; as static supplemental files on studies that will never be updated; etc.), but it’s your party! That said: if you see ways to improve taxonomic corrections, add new data sources, or improve the format for credit and attribution, please contact us, so we can work together to keep improving this resource.