Skip to contents

Summary

This document provides background information about data standards, json-schemas in general and the structure of the Wildlife Disease Data standard specifically.

What is a JSON-schema

A JSON-schema is a human and machine readable document that defines a data standard by describing the structure, properties, and constraints for a dataset. For those of us more accustomed to thinking about spreadsheet files and data frames, a property is roughly equivalent to a field or column. The JSON-schema defines the rules around the type of data used in particular property (character, numeric, logical, etc), and its values (e.g. massUnits must be one of kg, mg, or g; latitude must be between -90 and 90; sampleID must be unique). The schema also describes how those fields should be combined into a coherent whole (i.e. the structure of the dataset).

In a JSON-schema, fields can have parent child relationships. A field may itself be schema. For example, the data property in this standard defines a data object that is a flat table with constraints, types, and/or requirements. In this way, JSON-schema allows for the construction of modular schema documents that can leverage existing schemas (e.g. darwin core, or datacite).

Once we have created a schema, we can then validate data against it. The validation process happens via a validation engine and tells us if the data conform to the standard. If the data do not conform, then the validation engine tells us precisely where the data are non-conformant and what the data standard expected to see.

For more detailed information see JSON-Schema.org

Wildlife Disease Data Standard (WDDS) Structure

The Wildlife Disease Data Standard is composed of two sub-schemas (1) disease_data and (2) project_metadata.

disease_data describes the structure and contents of the wildlife disease data. It has certain required fields and is extensible. This data should be stored as a tidy dataset in a flat file like a CSV. This component of the standard relies heavily on the Darwin Core data standard.

project_metadata describes the structure and contents of the descriptive metadata. That is, metadata about the project that enables discovery, identification, and attribution. This component of the standard relies heavily on the Data Cite Metadata Schema.

Researchers may validate their data against each sub-schema separately, or use them in tandem to validate an entire data package. The term “data package” refers to a list or JSON object that contains both the disease_data and project_metadata components.

Important vocabulary

Property: synonymous with field or column in a table. A property corresponds to a particular attribute (e.g. age, collectedBy, latitude, etc) of the data.
Required: A property is must be included for a given schema or object within a schema. Type: Type of data. Common values include array, object, string, number, integer, null, and boolean.
Array: A comma separated group of values. Similar to a vector in R but a little more flexible.
Array Items: Array items define acceptable values for an array.
- minItems - how many items must be present in the array - minimum - inclusive - smallest value allowed in an array - maximum - inclusive - largest value allowed in an array - enum - controlled vocabulary for an array

Terms

Below is a list of terms for the data standard. It is created directly from the wdds_schema.json file and reflects the structure of that file. disease_data and project_metadata are top level properties of the schema, with many sub-properties. Any item marked as REQUIRED must be included.

disease_data

Type: object
Description: REQUIRED Wildlife disease data. Stored in tidy form.
Required Fields: sampleID, latitude, longitude, sampleCollectionMethod, hostIdentification, detectionTarget, detectionMethod, detectionOutcome, parasiteIdentification
Reference: schemas/disease_data.json

  • sampleID

    Type: array
    Description: REQUIRED A researcher-generated unique ID for the sample: usually a unique string of both characters and integers (e.g., OS BZ19-114 to indicate an oral swab taken from animal BZ19-114; see worked example below), to avoid conflicts that can arise when datasets are merged with number-only notation for samples. Ideally, sample names should be kept consistent across all online databases and physical resources (e.g., museum collections or project-specific sample archives).
    Array Items
    • type: string, null
    • minItems: 1
  • animalID

    Type: array
    Description: A researcher-generated unique ID for the individual animal from which the sample was collected: usually a unique string of both characters and integers (e.g., BZ19-114 to indicate animal 114 sampled in 2019 in Belize). Ideally, animal names should again be kept consistent across online databases and physical resources.
    Array Items
    • type: string, null
    • minItems: 1
  • latitude

    Type: array
    Description: REQUIRED Latitude of the collection site in decimal format. See http://rs.tdwg.org/dwc/terms/decimalLatitude
    Array Items
    • type: number, null
    • minItems: 1
    • maximum: 90
    • minimum: -90
  • longitude

    Type: array
    Description: REQUIRED Longitude of the collection site in decimal format. See http://rs.tdwg.org/dwc/terms/decimalLongitude
    Array Items
    • type: number, null
    • minItems: 1
    • maximum: 180
    • minimum: -180
  • spatialUncertainty

    Type: array
    Description: Coordinate uncertainty from GPS recordings, post-hoc digitization, or systematic alterations (e.g., jittering or rounding) expressed in meters. See http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters
    Array Items
    • type: number, null
    • minItems: 1
    • minimum: 0
  • collectionDay

    Type: array
    Description: The day of the month on which the specimen was collected. See http://rs.tdwg.org/dwc/terms/day
    Array Items
    • type: integer, null
    • minItems: 1
    • minimum: 1
    • maximum: 31
  • collectionMonth

    Type: array
    Description: The month in which the specimen was collected. See http://rs.tdwg.org/dwc/terms/month
    Array Items
    • type: integer, null
    • minItems: 1
    • minimum: 1
    • maximum: 12
  • collectionYear

    Type: array
    Description: The year in which the specimen was collected. See http://rs.tdwg.org/dwc/terms/year
    Array Items
    • type: integer, null
  • sampleCollectionMethod

    Type: array
    Description: REQUIRED The technique used to acquire the sample and/or the tissue from which the sample was acquired (e.g. visual inspection; swab; wing punch; necropsy).
    Example Values: visual inspection, swab, wing punch, necropsy
    Array Items
    • type: string
    • minItems: 1
  • sampleMaterial

    Type: array
    Description: Organic material from the animal body (i.e., tissue or fluid) extracted into a sample (e.g., “liver”; “blood”; “skin”; “whole organism”).
    Example Values: liver, blood, skin, whole organism
    Array Items
    • type: string, null
    • minItems: 1
  • sampleCollectionBodyPart

    Type: array
    Description: Part of the animal body that samples are generated or collected from (e.g., “rectum”; “wing”).
    Example Values: rectum, wing
    Array Items
    • type: string, null
    • minItems: 1
  • hostIdentification

    Type: array
    Description: REQUIRED The Linnaean classification of the animal from which the sample was collected, reported at the lowest possible level (ideally, species binomial name: e.g., Odocoileus virginianus or Ixodes scapularis). As necessary, researchers may also include an additional field indicating when uncertainty exists in the identification of the host organism (see Adding new fields). See http://rs.tdwg.org/dwc/terms/scientificName
    Array Items
    • type: string, null
    • minItems: 1
    • not: [HOMOhomo]{4} [SAPIENSsapiens]{7}
  • organismSex

    Type: array
    Description: The sex of the individual animal from which the sample was collected. See http://rs.tdwg.org/dwc/terms/sex
    Example Values: male, female, hermaphrodite
    Array Items
    • type: string, null
    • minItems: 1
  • liveCapture

    Type: array
    Description: Whether the individual animal from which the sample was collected was alive at the time of capture. Should be TRUE or FALSE; lethal sampling should be recorded as TRUE as this field describes the organism at the time of capture.
    Array Items
    • type: boolean, null
    • minItems: 1
  • healthNotes

    Type: array
    Description: Any additional (unstructured) notes about the state of the animal, such as disease presentation.
    Array Items
    • type: string, null
    • minItems: 1
  • hostLifeStage

    Type: array
    Description: The life stage of the animal from which the sample was collected (as appropriate for the organism) (e.g., juvenile, adult). See http://rs.tdwg.org/dwc/terms/lifeStage
    Example Values: juvenile, adult, larva
    Array Items
    • type: string, null
    • minItems: 1
  • age

    Type: array
    Description: The numeric age of the animal from which the sample was collected, at the time of sample collection, if known (e.g., in monitored populations).
    Array Items
    • type: number, null
    • minItems: 1
    • minimum: 0
  • ageUnits

    Type: array
    Description: The units in which age is measured (usually years).
    Array Items
    • type: string, null
    • enum: years, months, days, hours, minutes, seconds
    • minItems: 1
  • mass

    Type: array
    Description: The mass of the animal from which the sample was collected, at the time of sample collection.
    Array Items
    • type: number, null
    • minItems: 1
    • minimum: 0
  • massUnits

    Type: array
    Description: The units that mass is recorded in (e.g., kg).
    Example Values: kg, g, mg, kilogram, milligram
    Array Items
    • type: string, null
    • minItems: 1
  • length

    Type: array
    Description: The numeric length of the animal from which the sample was collected, at the time of sample collection.
    Array Items
    • type: number, null
    • minItems: 1
    • minimum: 0
  • lengthMeasurement

    Type: array
    Description: The axis of measurement for the organism being measured (e.g., snout-vent length or just SVL; wing length; primary feather).
    Example Values: snout-vent length, intertegular distance, primary feather
    Array Items
    • type: string, null
    • minItems: 1
  • lengthUnits

    Type: array
    Description: The units that length is recorded in (e.g., meters).
    Example Values: mm, meters, cm, km
    Array Items
    • type: string, null
    • minItems: 1
  • organismQuantity

    Type: array
    Description: A number or enumeration value for the quantity of organisms. See http://rs.tdwg.org/dwc/terms/organismQuantity
    Example Values: 1, 1.4, 12
    Array Items
    • type: number, null
    • minItems: 1
    • minimum: 0
  • organismQuantityUnits

    Type: array
    Description: The units that organism quantity is recorded in (e.g. “individuals”). See http://rs.tdwg.org/dwc/iri/organismQuantityType
    Example Values: individual, biomass, Braun-Blanquet scale
    Array Items
    • type: string, null
    • minItems: 1
  • detectionTarget

    Type: array
    Description: REQUIRED The taxonomic identity of the parasite being screened for in the sample. This will often be coarser than the identity of a specific parasite identified in the sample: for example, in a study screening for novel bat coronaviruses, the entire family Coronaviridae might be the target; in a parasite dissection, the targets might be Acanthocephala, Cestoda, Nematoda, and Trematoda. For deep sequencing approaches (e.g., metagenomic and metatranscriptomic viral discovery), researchers should report each alignment target used as a new test to maximize reporting of negative data, or alternatively, select a subset that reflect specific study objectives and the focus of analysis (e.g., specific viral families). See http://rs.tdwg.org/dwc/terms/associatedOccurrences
    Array Items
    • type: string, null
    • minItems: 1
  • detectionMethod

    Type: array
    Description: REQUIRED The type of test performed to detect the parasite or parasite-specific antibody (e.g., ‘qPCR’, ‘ELISA’)
    Array Items
    • type: string, null
    • minItems: 1
  • forwardPrimerSequence

    Type: array
    Description: The sequence of the forward primer used for parasite detection (e.g., for a pan-coronavirus primer: 5’ CDCAYGARTTYTGYTCNCARC 3’). (Strongly encouraged if applicable, e.g., for PCR.)
    Example Values: 5’ CDCAYGARTTYTGYTCNCARC 3’
    Array Items
    • type: string, null
    • minItems: 1
  • reversePrimerSequence

    Type: array
    Description: The sequence of the reverse primer used for parasite detection (e.g., 5’ RHGGRTANGCRTCWATDGC 3’). (Strongly encouraged if applicable, e.g., for PCR.)
    Example Values: 5’ RHGGRTANGCRTCWATDGC 3’
    Array Items
    • type: string, null
    • minItems: 1
  • geneTarget

    Type: array
    Description: The parasite gene targeted by the primer (e.g. “RdRp” for PCR.).
    Example Values: RdRp
    Array Items
    • type: string, null
    • minItems: 1
  • primerCitation

    Type: array
    Description: Citation(s) for the primer(s) (ideally doi, or other permanent identifier for a work, e.g. PMID).
    Example Values: https://doi.org/10.1016/j.virol.2007.06.009, Complete genome sequence of bat coronavirus HKU2 from Chinese horseshoe bats revealed a much smaller spike gene with a different evolutionary lineage from the rest of the genome, PMC7103351, https://openalex.org/works/w2036144053
    Array Items
    • type: string, null
    • minItems: 1
  • probeTarget

    Type: array
    Description: Antibody or antigen targeted for detection. (Strongly encouraged if applicable, e.g., for ELISA.)
    Array Items
    • type: string, null
    • minItems: 1
  • probeType

    Type: array
    Description: Antibody or antigen used for detection. (Strongly encouraged if applicable, e.g., for ELISA.)
    Array Items
    • type: string, null
    • minItems: 1
  • probeCitation

    Type: array
    Description: Citation(s) for the probe(s) (ideally doi, or other permanent identifier for a work, e.g. PMID).
    Array Items
    • type: string, null
    • minItems: 1
  • detectionOutcome

    Type: array
    Description: REQUIRED The test result (i.e., positive, negative, or inconclusive). To avoid ambiguity, these specific values are suggested over numeric values (0 or 1). See http://rs.tdwg.org/dwc/terms/occurrenceStatus
    Array Items
    • type: string, null
    • minItems: 1
  • detectionMeasurement

    Type: array
    Description: Any numeric measurement of parasite detection that is more detailed than simple positive or negative results (e.g., viral titer, parasite counts, sequence reads).
    Array Items
    • type: number, null
    • minItems: 1
  • detectionMeasurementUnits

    Type: array
    Description: Units for quantitative measurements of parasite intensity or test results (e.g., Ct, TCID50/mL, or parasite count).
    Example Values: Ct, TCID50/mL, parasite count
    Array Items
    • type: string, null
    • minItems: 1
  • parasiteIdentification

    Type: array
    Description: REQUIRED The identity of a parasite detected by the test, if any, reported to the lowest possible taxonomic level, either as a Linnaean binomial classification or within the convention of a relevant taxonomic authority (e.g., Borrelia burgdorferi or Zika virus). Parasite identification may be more specific than detection target.
    Example Values: Zika virus, Borrelia burgdorferi, Onchocerca volvulus
    Array Items
    • type: string, null
    • minItems: 1
  • parasiteID

    Type: array
    Description: A researcher-generated unique ID for an individual parasite (primarily useful in nested cases where this ID is used as an animal ID in another row, such as pathogen testing of a blood-feeding arthropod removed from a vertebrate host).
    Example Values: 001, TICK201923
    Array Items
    • type: string, null
    • minItems: 1
  • parasiteLifeStage

    Type: array
    Description: The life stage of the parasite from which the sample was collected (as appropriate for the organism) (e.g., juvenile, adult).
    Example Values: juvenile, adult, sporozoite
    Array Items
    • type: string, null
    • minItems: 1
  • genbankAccession

    Type: array
    Description: The GenBank accession for any parasite genetic sequence(s), if appropriate. Accession numbers or other identifiers for related data stored on another platform should be added in a different field (e.g. GISAID Accession, Immport Accession). See http://rs.tdwg.org/dwc/terms/otherCatalogNumbers
    Example Values: U49845 | U49846, U11111
    Array Items
    • type: string, null
    • minItems: 1

project_metadata

Type: object
Description: REQUIRED Metadata for a project that largely follows the Datacite data standard.
Required Fields: methodology, creators, titles, publicationYear, language, descriptions, fundingReferences
Reference: schemas/project_metadata.json