Skip to contents

Check available versions of the dataset

 list_deposit_versions()
  deposit_summary()
#>         id latest_version publication_date
#> 1 15733485           TRUE       2025-06-24
#> 2 15732796          FALSE       2025-06-24
#> 3 15723628          FALSE       2025-06-23
#> 4 15723072          FALSE       2025-06-23
#> 5 15692263          FALSE       2025-06-18
#> 6 15677843          FALSE       2025-06-16
#> 7 15677137          FALSE       2025-06-16
#> 8 15643004          FALSE       2025-06-11
#>                                   doi_url
#> 1 https://doi.org/10.5281/zenodo.15733485
#> 2 https://doi.org/10.5281/zenodo.15732796
#> 3 https://doi.org/10.5281/zenodo.15723628
#> 4 https://doi.org/10.5281/zenodo.15723072
#> 5 https://doi.org/10.5281/zenodo.15692263
#> 6 https://doi.org/10.5281/zenodo.15677843
#> 7 https://doi.org/10.5281/zenodo.15677137
#> 8 https://doi.org/10.5281/zenodo.15643004

Get latest version of the data

# get data - by default this is the latest version of the data.
get_versioned_data(version = "15692263", dir_path = "outputs")
#> Cole Brookson, Collin Schwantes, Timothée Poisot, Tad Dallas, Greg Albery, Colin J. Carlson, Cecilia A. Sanchez, Renata Muylaert, Evan Eskew, Rory Gibb, & Maxwell J Farrell. (2025). The Global Virome in One Network (VIRION): Data Package [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15692263
#> deposit will download to outputs/15692263
#> outputs/15692263
# list files
fs::dir_ls("outputs/15692263")
#> outputs/15692263/datapackage.json   outputs/15692263/detection.csv.gz   
#> outputs/15692263/edgelist.csv       outputs/15692263/provenance.csv.gz  
#> outputs/15692263/taxonomy_host.csv  outputs/15692263/taxonomy_virus.csv 
#> outputs/15692263/temporal.csv.gz    outputs/15692263/virion.csv.gz

Read the data

Now that you have the data locally, you can read it! The virion files are comma delimited with period decimal markers.


virion <- vroom::vroom(file = "outputs/15692263/virion.csv.gz")

Cite the data

Citing data makes increases reproducibility and incentivizes data sharing.


# by default the citation will be generated for the current working version.
# this is set when we run `get_versioned_data`
get_version_citation(style = "modern-language-association")
#> Cole Brookson, et al. The Global Virome in One Network (VIRION): Data Package. Zenodo, 18 June 2025, doi:10.5281/zenodo.15692263.
# we can cite a specific version by providing a zenodo id
get_version_citation(zenodo_id = "15643004",style = "apa")
#> Cole Brookson, Collin Schwantes, Timothée Poisot, Tad Dallas, Greg Albery, Colin J. Carlson, Cecilia A. Sanchez, Renata Muylaert, Evan Eskew, Rory Gibb, & Maxwell J Farrell. (2025). The Global Virome in One Network (VIRION): Data Package [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15643004
# sometimes you want a bibtex entry item
export_deposit_bibtex("15643004")
#> @dataset{cole_brookson_2025_15643004,
#>   author       = {Cole Brookson and
#>                   Collin Schwantes and
#>                   Timothée Poisot and
#>                   Tad Dallas and
#>                   Greg Albery and
#>                   Colin J. Carlson and
#>                   Cecilia A. Sanchez and
#>                   Renata Muylaert and
#>                   Evan Eskew and
#>                   Rory Gibb and
#>                   Maxwell J Farrell},
#>   title        = {The Global Virome in One Network (VIRION): Data
#>                    Package
#>                   },
#>   month        = jun,
#>   year         = 2025,
#>   publisher    = {Zenodo},
#>   doi          = {10.5281/zenodo.15643004},
#>   url          = {https://doi.org/10.5281/zenodo.15643004},
#> }

What about the data sources?

The data sources used to create Virion are referenced directly in the deposit metadata. These can be accessed by retrieving the deposit metadata.


metadata_json_text <- export_deposit_metadata(zenodo_id = "15692263",format = "json",verbose = FALSE)

metadata_list <- jsonlite::fromJSON(metadata_json_text,)

related_identifiers <- metadata_list$metadata$related_identifiers

required_items_filter <- related_identifiers$relation_type$id == "requires"

required_items <- related_identifiers[required_items_filter,]

required_items$resource_type <- required_items$resource_type$id

required_items$relation_type <- "requires"

required_items |>
  kableExtra::kable() |>
  kableExtra::kable_material()
identifier relation_type resource_type scheme
3 https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNuclMetadata/ accessed on 2025-06-09 requires dataset url
4 10.5281/zenodo.5167655 requires dataset doi
5 https://catalog.data.gov/dataset/predict-animals-sampled-c593d requires dataset url
6 https://ictv.global/sites/default/files/MSL/ICTV_Master_Species_List_2024_MSL40.v1.xlsx requires dataset url
7 https://ictv.global/sites/default/files/VMR/VMR_MSL40.v1.20250307.xlsx requires dataset url

What do the fields in VIRION mean?

Great question! The datapackage.json file contains field descriptions. It is fairly human readable but we can take a closer look using some built in functions.

For a deeper dive, check out the frictionless package.


# this function wraps frictionless functions and extracts 
dict_list <- get_data_dictionary(datapackage_json = "outputs/15692263/datapackage.json")

dict_list$detection_csv
#> # A tibble: 5 × 3
#>   name              type    description                                         
#>   <chr>             <chr>   <chr>                                               
#> 1 AssocID           number  Row number from current VIRION version              
#> 2 DetectionMethod   string  Four harmonized categories in descending order of s…
#> 3 DetectionOriginal string  Method used for determing the presence of a virus a…
#> 4 HostFlagID        boolean Denotes the presence of possible uncertainty in hos…
#> 5 NCBIAccession     string  A unique identifier assigned to a record in sequenc…

Use purrr::map to make a full data dictionary for a given deposit.


html_tables <- purrr::map(dict_list, function(x){
  html_tables <- kableExtra::kable(x = x) |>
  kableExtra::kable_material()
  
})

list_items <- names(dict_list)

tables <- sprintf("<h3>%s</h3><br>%s",list_items,html_tables) |>
  paste(collapse = "<br>")

cat(tables)

detection_csv


name type description
AssocID number Row number from current VIRION version
DetectionMethod string Four harmonized categories in descending order of strength of evidence: “Isolation/Observation,” “PCR/Sequencing,” “Antibodies,” and “Not specified”. In some cases where detection method is not available via metadata, source information is used as DetectionOriginal (e.g., “NCBI Nucleotide”).
DetectionOriginal string Method used for determing the presence of a virus as described in the original work
HostFlagID boolean Denotes the presence of possible uncertainty in host identification, which users may want to check before proceeding any further.
NCBIAccession string A unique identifier assigned to a record in sequence databases such as GenBank

edgelist


name type description
HostTaxID number Taxonomic identification number from NCBI for host taxa.
VirusTaxID number Taxonomic identification number from NCBI for virus taxa.
AssocID string Row number from current VIRION version

provenance_csv


name type description
AssocID number Row number from current VIRION version
HostOriginal string Host name from original dataset
VirusOriginal string Virus name from original dataset
Database string Source for the record. One of EID2, Shaw, HP3, GMPD2, PREDICT, OR GenBank
DatabaseVersion string For static data, a citation. For dynamic data (e.g. Genbank) the access URL and a time stamp
ReferenceText string A text description of literature sources
PMID number PubMed identifiers for literature sources

taxonomy_host


name type description
HostTaxID number Taxonomic identification number from NCBI for host taxa.
Host string Host species name
HostGenus string Host genus name
HostFamily string Host family name
HostOrder string Host order name
HostClass string Host class name
HostNCBIResolved boolean Indicates whether or not the host taxa could harmonized with the NCBI taxonomy.

taxonomy_virus


name type description
VirusTaxID number Taxonomic identification number from NCBI for virus taxa.
Virus string Virus species name
VirusGenus string Virus genus name
VirusFamily string Virus family name
VirusOrder string Virus order name
VirusClass string Virus class name
VirusNCBIResolved boolean Indicates whether or not the virus taxa could harmonized with the NCBI taxonomy.
ICTVRatified boolean Indicates whether or not the virus taxa is ratified by the nternational Committee on Taxonomy of Viruses (ICTV).
Database string Source for the record. One of EID2, Shaw, HP3, GMPD2, PREDICT, OR GenBank

temporal_csv


name type description
AssocID number Row number from current VIRION version
PublicationYear number For literature derived records. PublicationYear provides the year the literature source was published, accessed either from the original database’s reference description or from scraping the PubMed database.
ReleaseYear number The year a given association was “released” in public information (EID2 and PREDICT) or a publicly deposited sample on GenBank. For PREDICT, all values are given as 2021, given the release of a static file at that time even though some findings may have been published or deposited in GenBank earlier. (This redundancy should be captured in overlap with GenBank and EID2.)
ReleaseMonth number The month a given association was “released” to the public
ReleaseDay number The day a given association was “released” to the public
CollectionYear number Reports the year of actual sample collection (GenBank and Predict)
CollectionMonth number Reports the month of actual sample collection (GenBank and Predict)
CollectionDay number Reports the day of actual sample collection (GenBank and Predict)

virion_csv


name type description
Host string Host species name
Virus string Virus species name
HostTaxID number Taxonomic identification number from NCBI for host taxa.
VirusTaxID number Taxonomic identification number from NCBI for virus taxa.
HostNCBIResolved boolean Indicates whether or not the host taxa could harmonized with the NCBI taxonomy.
VirusNCBIResolved boolean Indicates whether or not the virus taxa could harmonized with the NCBI taxonomy.
ICTVRatified boolean Indicates whether or not the virus taxa is ratified by the nternational Committee on Taxonomy of Viruses (ICTV).
HostGenus string Host genus name
HostFamily string Host family name
HostOrder string Host order name
HostClass string Host class name
HostOriginal string Host name from original dataset
VirusGenus string Virus genus name
VirusFamily string Virus family name
VirusOrder string Virus order name
VirusClass string Virus class name
VirusOriginal string Virus name from original dataset
HostFlagID boolean Denotes the presence of possible uncertainty in host identification, which users may want to check before proceeding any further.
DetectionMethod string Four harmonized categories in descending order of strength of evidence: “Isolation/Observation,” “PCR/Sequencing,” “Antibodies,” and “Not specified”. In some cases where detection method is not available via metadata, source information is used as DetectionOriginal (e.g., “NCBI Nucleotide”).
DetectionOriginal string Method used for determing the presence of a virus as described in the original work
Database string Source for the record. One of EID2, Shaw, HP3, GMPD2, PREDICT, OR GenBank
DatabaseVersion string For static data, a citation. For dynamic data (e.g. Genbank) the access URL and a time stamp
PublicationYear number For literature derived records. PublicationYear provides the year the literature source was published, accessed either from the original database’s reference description or from scraping the PubMed database.
ReferenceText string A text description of literature sources
PMID number PubMed identifiers for literature sources
ReleaseYear number The year a given association was “released” in public information (EID2 and PREDICT) or a publicly deposited sample on GenBank. For PREDICT, all values are given as 2021, given the release of a static file at that time even though some findings may have been published or deposited in GenBank earlier. (This redundancy should be captured in overlap with GenBank and EID2.)
ReleaseMonth number The month a given association was “released” to the public
ReleaseDay number The day a given association was “released” to the public
CollectionYear number Reports the year of actual sample collection (GenBank and Predict)
CollectionMonth number Reports the month of actual sample collection (GenBank and Predict)
CollectionDay number Reports the day of actual sample collection (GenBank and Predict)
AssocID number Row number. Used as an id. Will be specific to a given version of the data
DatabaseDOI string Persistent digital identifer for the database
Release_Date date Date data were released
Collection_Date string Date of actual sample collection
NCBIAccession string A unique identifier assigned to a record in sequence databases such as GenBank