Check available versions of the dataset
list_deposit_versions()
deposit_summary()
#> id latest_version publication_date
#> 1 15733485 TRUE 2025-06-24
#> 2 15732796 FALSE 2025-06-24
#> 3 15723628 FALSE 2025-06-23
#> 4 15723072 FALSE 2025-06-23
#> 5 15692263 FALSE 2025-06-18
#> 6 15677843 FALSE 2025-06-16
#> 7 15677137 FALSE 2025-06-16
#> 8 15643004 FALSE 2025-06-11
#> doi_url
#> 1 https://doi.org/10.5281/zenodo.15733485
#> 2 https://doi.org/10.5281/zenodo.15732796
#> 3 https://doi.org/10.5281/zenodo.15723628
#> 4 https://doi.org/10.5281/zenodo.15723072
#> 5 https://doi.org/10.5281/zenodo.15692263
#> 6 https://doi.org/10.5281/zenodo.15677843
#> 7 https://doi.org/10.5281/zenodo.15677137
#> 8 https://doi.org/10.5281/zenodo.15643004
Get latest version of the data
# get data - by default this is the latest version of the data.
get_versioned_data(version = "15692263", dir_path = "outputs")
#> Cole Brookson, Collin Schwantes, Timothée Poisot, Tad Dallas, Greg Albery, Colin J. Carlson, Cecilia A. Sanchez, Renata Muylaert, Evan Eskew, Rory Gibb, & Maxwell J Farrell. (2025). The Global Virome in One Network (VIRION): Data Package [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15692263
#> deposit will download to outputs/15692263
#> outputs/15692263
# list files
fs::dir_ls("outputs/15692263")
#> outputs/15692263/datapackage.json outputs/15692263/detection.csv.gz
#> outputs/15692263/edgelist.csv outputs/15692263/provenance.csv.gz
#> outputs/15692263/taxonomy_host.csv outputs/15692263/taxonomy_virus.csv
#> outputs/15692263/temporal.csv.gz outputs/15692263/virion.csv.gz
Read the data
Now that you have the data locally, you can read it! The virion files are comma delimited with period decimal markers.
virion <- vroom::vroom(file = "outputs/15692263/virion.csv.gz")
Cite the data
Citing data makes increases reproducibility and incentivizes data sharing.
# by default the citation will be generated for the current working version.
# this is set when we run `get_versioned_data`
get_version_citation(style = "modern-language-association")
#> Cole Brookson, et al. The Global Virome in One Network (VIRION): Data Package. Zenodo, 18 June 2025, doi:10.5281/zenodo.15692263.
# we can cite a specific version by providing a zenodo id
get_version_citation(zenodo_id = "15643004",style = "apa")
#> Cole Brookson, Collin Schwantes, Timothée Poisot, Tad Dallas, Greg Albery, Colin J. Carlson, Cecilia A. Sanchez, Renata Muylaert, Evan Eskew, Rory Gibb, & Maxwell J Farrell. (2025). The Global Virome in One Network (VIRION): Data Package [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15643004
# sometimes you want a bibtex entry item
export_deposit_bibtex("15643004")
#> @dataset{cole_brookson_2025_15643004,
#> author = {Cole Brookson and
#> Collin Schwantes and
#> Timothée Poisot and
#> Tad Dallas and
#> Greg Albery and
#> Colin J. Carlson and
#> Cecilia A. Sanchez and
#> Renata Muylaert and
#> Evan Eskew and
#> Rory Gibb and
#> Maxwell J Farrell},
#> title = {The Global Virome in One Network (VIRION): Data
#> Package
#> },
#> month = jun,
#> year = 2025,
#> publisher = {Zenodo},
#> doi = {10.5281/zenodo.15643004},
#> url = {https://doi.org/10.5281/zenodo.15643004},
#> }
What about the data sources?
The data sources used to create Virion are referenced directly in the deposit metadata. These can be accessed by retrieving the deposit metadata.
metadata_json_text <- export_deposit_metadata(zenodo_id = "15692263",format = "json",verbose = FALSE)
metadata_list <- jsonlite::fromJSON(metadata_json_text,)
related_identifiers <- metadata_list$metadata$related_identifiers
required_items_filter <- related_identifiers$relation_type$id == "requires"
required_items <- related_identifiers[required_items_filter,]
required_items$resource_type <- required_items$resource_type$id
required_items$relation_type <- "requires"
required_items |>
kableExtra::kable() |>
kableExtra::kable_material()
identifier | relation_type | resource_type | scheme | |
---|---|---|---|---|
3 | https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNuclMetadata/ accessed on 2025-06-09 | requires | dataset | url |
4 | 10.5281/zenodo.5167655 | requires | dataset | doi |
5 | https://catalog.data.gov/dataset/predict-animals-sampled-c593d | requires | dataset | url |
6 | https://ictv.global/sites/default/files/MSL/ICTV_Master_Species_List_2024_MSL40.v1.xlsx | requires | dataset | url |
7 | https://ictv.global/sites/default/files/VMR/VMR_MSL40.v1.20250307.xlsx | requires | dataset | url |
What do the fields in VIRION mean?
Great question! The datapackage.json file contains field descriptions. It is fairly human readable but we can take a closer look using some built in functions.
For a deeper dive, check out the frictionless package.
# this function wraps frictionless functions and extracts
dict_list <- get_data_dictionary(datapackage_json = "outputs/15692263/datapackage.json")
dict_list$detection_csv
#> # A tibble: 5 × 3
#> name type description
#> <chr> <chr> <chr>
#> 1 AssocID number Row number from current VIRION version
#> 2 DetectionMethod string Four harmonized categories in descending order of s…
#> 3 DetectionOriginal string Method used for determing the presence of a virus a…
#> 4 HostFlagID boolean Denotes the presence of possible uncertainty in hos…
#> 5 NCBIAccession string A unique identifier assigned to a record in sequenc…
Use purrr::map
to make a full data dictionary for a
given deposit.
html_tables <- purrr::map(dict_list, function(x){
html_tables <- kableExtra::kable(x = x) |>
kableExtra::kable_material()
})
list_items <- names(dict_list)
tables <- sprintf("<h3>%s</h3><br>%s",list_items,html_tables) |>
paste(collapse = "<br>")
cat(tables)
detection_csv
name | type | description |
---|---|---|
AssocID | number | Row number from current VIRION version |
DetectionMethod | string | Four harmonized categories in descending order of strength of evidence: “Isolation/Observation,” “PCR/Sequencing,” “Antibodies,” and “Not specified”. In some cases where detection method is not available via metadata, source information is used as DetectionOriginal (e.g., “NCBI Nucleotide”). |
DetectionOriginal | string | Method used for determing the presence of a virus as described in the original work |
HostFlagID | boolean | Denotes the presence of possible uncertainty in host identification, which users may want to check before proceeding any further. |
NCBIAccession | string | A unique identifier assigned to a record in sequence databases such as GenBank |
edgelist
name | type | description |
---|---|---|
HostTaxID | number | Taxonomic identification number from NCBI for host taxa. |
VirusTaxID | number | Taxonomic identification number from NCBI for virus taxa. |
AssocID | string | Row number from current VIRION version |
provenance_csv
name | type | description |
---|---|---|
AssocID | number | Row number from current VIRION version |
HostOriginal | string | Host name from original dataset |
VirusOriginal | string | Virus name from original dataset |
Database | string | Source for the record. One of EID2, Shaw, HP3, GMPD2, PREDICT, OR GenBank |
DatabaseVersion | string | For static data, a citation. For dynamic data (e.g. Genbank) the access URL and a time stamp |
ReferenceText | string | A text description of literature sources |
PMID | number | PubMed identifiers for literature sources |
taxonomy_host
name | type | description |
---|---|---|
HostTaxID | number | Taxonomic identification number from NCBI for host taxa. |
Host | string | Host species name |
HostGenus | string | Host genus name |
HostFamily | string | Host family name |
HostOrder | string | Host order name |
HostClass | string | Host class name |
HostNCBIResolved | boolean | Indicates whether or not the host taxa could harmonized with the NCBI taxonomy. |
taxonomy_virus
name | type | description |
---|---|---|
VirusTaxID | number | Taxonomic identification number from NCBI for virus taxa. |
Virus | string | Virus species name |
VirusGenus | string | Virus genus name |
VirusFamily | string | Virus family name |
VirusOrder | string | Virus order name |
VirusClass | string | Virus class name |
VirusNCBIResolved | boolean | Indicates whether or not the virus taxa could harmonized with the NCBI taxonomy. |
ICTVRatified | boolean | Indicates whether or not the virus taxa is ratified by the nternational Committee on Taxonomy of Viruses (ICTV). |
Database | string | Source for the record. One of EID2, Shaw, HP3, GMPD2, PREDICT, OR GenBank |
temporal_csv
name | type | description |
---|---|---|
AssocID | number | Row number from current VIRION version |
PublicationYear | number | For literature derived records. PublicationYear provides the year the literature source was published, accessed either from the original database’s reference description or from scraping the PubMed database. |
ReleaseYear | number | The year a given association was “released” in public information (EID2 and PREDICT) or a publicly deposited sample on GenBank. For PREDICT, all values are given as 2021, given the release of a static file at that time even though some findings may have been published or deposited in GenBank earlier. (This redundancy should be captured in overlap with GenBank and EID2.) |
ReleaseMonth | number | The month a given association was “released” to the public |
ReleaseDay | number | The day a given association was “released” to the public |
CollectionYear | number | Reports the year of actual sample collection (GenBank and Predict) |
CollectionMonth | number | Reports the month of actual sample collection (GenBank and Predict) |
CollectionDay | number | Reports the day of actual sample collection (GenBank and Predict) |
virion_csv
name | type | description |
---|---|---|
Host | string | Host species name |
Virus | string | Virus species name |
HostTaxID | number | Taxonomic identification number from NCBI for host taxa. |
VirusTaxID | number | Taxonomic identification number from NCBI for virus taxa. |
HostNCBIResolved | boolean | Indicates whether or not the host taxa could harmonized with the NCBI taxonomy. |
VirusNCBIResolved | boolean | Indicates whether or not the virus taxa could harmonized with the NCBI taxonomy. |
ICTVRatified | boolean | Indicates whether or not the virus taxa is ratified by the nternational Committee on Taxonomy of Viruses (ICTV). |
HostGenus | string | Host genus name |
HostFamily | string | Host family name |
HostOrder | string | Host order name |
HostClass | string | Host class name |
HostOriginal | string | Host name from original dataset |
VirusGenus | string | Virus genus name |
VirusFamily | string | Virus family name |
VirusOrder | string | Virus order name |
VirusClass | string | Virus class name |
VirusOriginal | string | Virus name from original dataset |
HostFlagID | boolean | Denotes the presence of possible uncertainty in host identification, which users may want to check before proceeding any further. |
DetectionMethod | string | Four harmonized categories in descending order of strength of evidence: “Isolation/Observation,” “PCR/Sequencing,” “Antibodies,” and “Not specified”. In some cases where detection method is not available via metadata, source information is used as DetectionOriginal (e.g., “NCBI Nucleotide”). |
DetectionOriginal | string | Method used for determing the presence of a virus as described in the original work |
Database | string | Source for the record. One of EID2, Shaw, HP3, GMPD2, PREDICT, OR GenBank |
DatabaseVersion | string | For static data, a citation. For dynamic data (e.g. Genbank) the access URL and a time stamp |
PublicationYear | number | For literature derived records. PublicationYear provides the year the literature source was published, accessed either from the original database’s reference description or from scraping the PubMed database. |
ReferenceText | string | A text description of literature sources |
PMID | number | PubMed identifiers for literature sources |
ReleaseYear | number | The year a given association was “released” in public information (EID2 and PREDICT) or a publicly deposited sample on GenBank. For PREDICT, all values are given as 2021, given the release of a static file at that time even though some findings may have been published or deposited in GenBank earlier. (This redundancy should be captured in overlap with GenBank and EID2.) |
ReleaseMonth | number | The month a given association was “released” to the public |
ReleaseDay | number | The day a given association was “released” to the public |
CollectionYear | number | Reports the year of actual sample collection (GenBank and Predict) |
CollectionMonth | number | Reports the month of actual sample collection (GenBank and Predict) |
CollectionDay | number | Reports the day of actual sample collection (GenBank and Predict) |
AssocID | number | Row number. Used as an id. Will be specific to a given version of the data |
DatabaseDOI | string | Persistent digital identifer for the database |
Release_Date | date | Date data were released |
Collection_Date | string | Date of actual sample collection |
NCBIAccession | string | A unique identifier assigned to a record in sequence databases such as GenBank |