Summary
This document provides background information about data standards, json-schemas in general and the structure of the Wildlife Disease Data standard specifically.
What is a JSON-schema
A JSON-schema is a human and machine readable document that defines a
data standard by describing the structure, properties, and constraints
for a dataset. For those of us more accustomed to thinking about
spreadsheet files and data frames, a property is roughly equivalent to a
field or column. The JSON-schema defines the rules around the type of
data used in particular property (character, numeric, logical, etc), and
its values (e.g. massUnits
must be one of kg, mg, or g;
latitude
must be between -90 and 90; sampleID
must be unique). The schema also describes how those fields should be
combined into a coherent whole (i.e. the structure of the dataset).
In a JSON-schema, fields can have parent child relationships. A field
may itself be schema. For example, the data
property in
this standard defines a data
object that is a flat table
with constraints, types, and/or requirements. In this way, JSON-schema
allows for the construction of modular schema documents that can
leverage existing schemas (e.g. darwin core, or datacite).
Once we have created a schema, we can then validate data against it. The validation process happens via a validation engine and tells us if the data conform to the standard. If the data do not conform, then the validation engine tells us precisely where the data are non-conformant and what the data standard expected to see.
For more detailed information see JSON-Schema.org
Wildlife Disease Data Standard (WDDS) Structure
The Wildlife Disease Data Standard is composed of two sub-schemas (1)
disease_data
and (2) project_metadata
.
disease_data
describes the structure and contents of the
wildlife disease data. It has certain required fields and is extensible.
This data should be stored as a tidy dataset in a flat
file like a CSV. This component of the standard relies heavily on the Darwin Core data standard.
project_metadata
describes the structure and contents of
the descriptive metadata. That is, metadata about the project that
enables discovery, identification, and attribution. This component of
the standard relies heavily on the Data Cite
Metadata Schema.
Researchers may validate their data against each sub-schema
separately, or use them in tandem to validate an entire data package.
The term “data package” refers to a list or JSON object that contains
both the disease_data
and project_metadata
components.
Important vocabulary
Property: synonymous with field or column in a
table. A property corresponds to a particular attribute (e.g. age,
collectedBy, latitude, etc) of the data.
Required: A property is must be included for a given
schema or object within a schema. Type: Type
of data. Common values include array, object, string, number,
integer, null, and boolean.
Array: A comma separated group of values. Similar to a
vector in R but a little more flexible.
Array Items: Array items define acceptable values for
an array.
- minItems - how many items must be present in the array - minimum -
inclusive - smallest value allowed in an array - maximum - inclusive -
largest value allowed in an array - enum - controlled vocabulary for an
array
Terms
Below is a list of terms for the data standard. It is created
directly from the wdds_schema.json
file and reflects the
structure of that file. disease_data
and
project_metadata
are top level properties of the schema,
with many sub-properties. Any item marked as REQUIRED must be
included.
disease_data
Type: object
Description: REQUIRED Wildlife
disease data. Stored in tidy form.
Required Fields: sampleID, latitude, longitude,
sampleCollectionMethod, hostIdentification, detectionTarget,
detectionMethod, detectionOutcome, parasiteIdentification
Reference: schemas/disease_data.json
-
sampleID
Type: array
Description: REQUIRED A researcher-generated unique ID for the sample: usually a unique string of both characters and integers (e.g., OS BZ19-114 to indicate an oral swab taken from animal BZ19-114; see worked example below), to avoid conflicts that can arise when datasets are merged with number-only notation for samples. Ideally, sample names should be kept consistent across all online databases and physical resources (e.g., museum collections or project-specific sample archives).
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
animalID
Type: array
Description: A researcher-generated unique ID for the individual animal from which the sample was collected: usually a unique string of both characters and integers (e.g., BZ19-114 to indicate animal 114 sampled in 2019 in Belize). Ideally, animal names should again be kept consistent across online databases and physical resources.
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
latitude
Type: array
Description: REQUIRED Latitude of the collection site in decimal format. See http://rs.tdwg.org/dwc/terms/decimalLatitude
Array Items-
type: number, null
-
minItems: 1
-
maximum: 90
- minimum: -90
-
type: number, null
-
longitude
Type: array
Description: REQUIRED Longitude of the collection site in decimal format. See http://rs.tdwg.org/dwc/terms/decimalLongitude
Array Items-
type: number, null
-
minItems: 1
-
maximum: 180
- minimum: -180
-
type: number, null
-
spatialUncertainty
Type: array
Description: Coordinate uncertainty from GPS recordings, post-hoc digitization, or systematic alterations (e.g., jittering or rounding) expressed in meters. See http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters
Array Items-
type: number, null
-
minItems: 1
- minimum: 0
-
type: number, null
-
collectionDay
Type: array
Description: The day of the month on which the specimen was collected. See http://rs.tdwg.org/dwc/terms/day
Array Items-
type: integer, null
-
minItems: 1
-
minimum: 1
- maximum: 31
-
type: integer, null
-
collectionMonth
Type: array
Description: The month in which the specimen was collected. See http://rs.tdwg.org/dwc/terms/month
Array Items-
type: integer, null
-
minItems: 1
-
minimum: 1
- maximum: 12
-
type: integer, null
-
collectionYear
Type: array
Description: The year in which the specimen was collected. See http://rs.tdwg.org/dwc/terms/year
Array Items- type: integer, null
-
sampleCollectionMethod
Type: array
Description: REQUIRED The technique used to acquire the sample and/or the tissue from which the sample was acquired (e.g. visual inspection; swab; wing punch; necropsy).
Example Values: visual inspection, swab, wing punch, necropsy
Array Items-
type: string
- minItems: 1
-
type: string
-
sampleMaterial
Type: array
Description: Organic material from the animal body (i.e., tissue or fluid) extracted into a sample (e.g., “liver”; “blood”; “skin”; “whole organism”).
Example Values: liver, blood, skin, whole organism
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
sampleCollectionBodyPart
Type: array
Description: Part of the animal body that samples are generated or collected from (e.g., “rectum”; “wing”).
Example Values: rectum, wing
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
hostIdentification
Type: array
Description: REQUIRED The Linnaean classification of the animal from which the sample was collected, reported at the lowest possible level (ideally, species binomial name: e.g., Odocoileus virginianus or Ixodes scapularis). As necessary, researchers may also include an additional field indicating when uncertainty exists in the identification of the host organism (see Adding new fields). See http://rs.tdwg.org/dwc/terms/scientificName
Array Items-
type: string, null
-
minItems: 1
- not: [HOMOhomo]{4} [SAPIENSsapiens]{7}
-
type: string, null
-
organismSex
Type: array
Description: The sex of the individual animal from which the sample was collected. See http://rs.tdwg.org/dwc/terms/sex
Example Values: male, female, hermaphrodite
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
liveCapture
Type: array
Description: Whether the individual animal from which the sample was collected was alive at the time of capture. Should be TRUE or FALSE; lethal sampling should be recorded as TRUE as this field describes the organism at the time of capture.
Array Items-
type: boolean, null
- minItems: 1
-
type: boolean, null
-
healthNotes
Type: array
Description: Any additional (unstructured) notes about the state of the animal, such as disease presentation.
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
hostLifeStage
Type: array
Description: The life stage of the animal from which the sample was collected (as appropriate for the organism) (e.g., juvenile, adult). See http://rs.tdwg.org/dwc/terms/lifeStage
Example Values: juvenile, adult, larva
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
age
Type: array
Description: The numeric age of the animal from which the sample was collected, at the time of sample collection, if known (e.g., in monitored populations).
Array Items-
type: number, null
-
minItems: 1
- minimum: 0
-
type: number, null
-
ageUnits
Type: array
Description: The units in which age is measured (usually years).
Array Items-
type: string, null
-
enum: years, months, days, hours, minutes,
seconds
- minItems: 1
-
type: string, null
-
mass
Type: array
Description: The mass of the animal from which the sample was collected, at the time of sample collection.
Array Items-
type: number, null
-
minItems: 1
- minimum: 0
-
type: number, null
-
massUnits
Type: array
Description: The units that mass is recorded in (e.g., kg).
Example Values: kg, g, mg, kilogram, milligram
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
length
Type: array
Description: The numeric length of the animal from which the sample was collected, at the time of sample collection.
Array Items-
type: number, null
-
minItems: 1
- minimum: 0
-
type: number, null
-
lengthMeasurement
Type: array
Description: The axis of measurement for the organism being measured (e.g., snout-vent length or just SVL; wing length; primary feather).
Example Values: snout-vent length, intertegular distance, primary feather
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
lengthUnits
Type: array
Description: The units that length is recorded in (e.g., meters).
Example Values: mm, meters, cm, km
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
organismQuantity
Type: array
Description: A number or enumeration value for the quantity of organisms. See http://rs.tdwg.org/dwc/terms/organismQuantity
Example Values: 1, 1.4, 12
Array Items-
type: number, null
-
minItems: 1
- minimum: 0
-
type: number, null
-
organismQuantityUnits
Type: array
Description: The units that organism quantity is recorded in (e.g. “individuals”). See http://rs.tdwg.org/dwc/iri/organismQuantityType
Example Values: individual, biomass, Braun-Blanquet scale
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
detectionTarget
Type: array
Description: REQUIRED The taxonomic identity of the parasite being screened for in the sample. This will often be coarser than the identity of a specific parasite identified in the sample: for example, in a study screening for novel bat coronaviruses, the entire family Coronaviridae might be the target; in a parasite dissection, the targets might be Acanthocephala, Cestoda, Nematoda, and Trematoda. For deep sequencing approaches (e.g., metagenomic and metatranscriptomic viral discovery), researchers should report each alignment target used as a new test to maximize reporting of negative data, or alternatively, select a subset that reflect specific study objectives and the focus of analysis (e.g., specific viral families). See http://rs.tdwg.org/dwc/terms/associatedOccurrences
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
detectionMethod
Type: array
Description: REQUIRED The type of test performed to detect the parasite or parasite-specific antibody (e.g., ‘qPCR’, ‘ELISA’)
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
forwardPrimerSequence
Type: array
Description: The sequence of the forward primer used for parasite detection (e.g., for a pan-coronavirus primer: 5’ CDCAYGARTTYTGYTCNCARC 3’). (Strongly encouraged if applicable, e.g., for PCR.)
Example Values: 5’ CDCAYGARTTYTGYTCNCARC 3’
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
reversePrimerSequence
Type: array
Description: The sequence of the reverse primer used for parasite detection (e.g., 5’ RHGGRTANGCRTCWATDGC 3’). (Strongly encouraged if applicable, e.g., for PCR.)
Example Values: 5’ RHGGRTANGCRTCWATDGC 3’
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
geneTarget
Type: array
Description: The parasite gene targeted by the primer (e.g. “RdRp” for PCR.).
Example Values: RdRp
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
primerCitation
Type: array
Description: Citation(s) for the primer(s) (ideally doi, or other permanent identifier for a work, e.g. PMID).
Example Values: https://doi.org/10.1016/j.virol.2007.06.009, Complete genome sequence of bat coronavirus HKU2 from Chinese horseshoe bats revealed a much smaller spike gene with a different evolutionary lineage from the rest of the genome, PMC7103351, https://openalex.org/works/w2036144053
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
probeTarget
Type: array
Description: Antibody or antigen targeted for detection. (Strongly encouraged if applicable, e.g., for ELISA.)
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
probeType
Type: array
Description: Antibody or antigen used for detection. (Strongly encouraged if applicable, e.g., for ELISA.)
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
probeCitation
Type: array
Description: Citation(s) for the probe(s) (ideally doi, or other permanent identifier for a work, e.g. PMID).
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
detectionOutcome
Type: array
Description: REQUIRED The test result (i.e., positive, negative, or inconclusive). To avoid ambiguity, these specific values are suggested over numeric values (0 or 1). See http://rs.tdwg.org/dwc/terms/occurrenceStatus
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
detectionMeasurement
Type: array
Description: Any numeric measurement of parasite detection that is more detailed than simple positive or negative results (e.g., viral titer, parasite counts, sequence reads).
Array Items-
type: number, null
- minItems: 1
-
type: number, null
-
detectionMeasurementUnits
Type: array
Description: Units for quantitative measurements of parasite intensity or test results (e.g., Ct, TCID50/mL, or parasite count).
Example Values: Ct, TCID50/mL, parasite count
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
parasiteIdentification
Type: array
Description: REQUIRED The identity of a parasite detected by the test, if any, reported to the lowest possible taxonomic level, either as a Linnaean binomial classification or within the convention of a relevant taxonomic authority (e.g., Borrelia burgdorferi or Zika virus). Parasite identification may be more specific than detection target.
Example Values: Zika virus, Borrelia burgdorferi, Onchocerca volvulus
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
parasiteID
Type: array
Description: A researcher-generated unique ID for an individual parasite (primarily useful in nested cases where this ID is used as an animal ID in another row, such as pathogen testing of a blood-feeding arthropod removed from a vertebrate host).
Example Values: 001, TICK201923
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
parasiteLifeStage
Type: array
Description: The life stage of the parasite from which the sample was collected (as appropriate for the organism) (e.g., juvenile, adult).
Example Values: juvenile, adult, sporozoite
Array Items-
type: string, null
- minItems: 1
-
type: string, null
-
genbankAccession
Type: array
Description: The GenBank accession for any parasite genetic sequence(s), if appropriate. Accession numbers or other identifiers for related data stored on another platform should be added in a different field (e.g. GISAID Accession, Immport Accession). See http://rs.tdwg.org/dwc/terms/otherCatalogNumbers
Example Values: U49845 | U49846, U11111
Array Items-
type: string, null
- minItems: 1
-
type: string, null
project_metadata
Type: object
Description: REQUIRED Metadata
for a project that largely follows the Datacite data standard.
Required Fields: methodology, creators, titles,
publicationYear, language, descriptions, fundingReferences
Reference: schemas/project_metadata.json
-
methodology
Type: object
Description: REQUIRED A broad categorization of how data were collected.
Properties:-
eventBased
Type: boolean
Description: Whether or not research was conducted in response to a known or suspected infectious disease outbreak, observed animal morbidity or mortality, etc.
-
archival
Type: boolean
Description: Whether samples were from an archival source (e.g., museum collections, biobanks).
-
-
creators
Type: array
Description: REQUIRED The full names of the creators. Should be in the format familyName, givenName.
Array Items-
name
Type: string
Description: REQUIRED DataCite name -
nameType
Type: string
Description: DataCite nameType -
givenName
Type: string
Description: DataCite givenName -
familyName
Type: string
Description: DataCite familyName -
nameIdentifiers
Type: array
Description: DataCite nameIdentifiers
Array Items-
nameIdentifier
Type: string
Description: REQUIRED DataCite nameIdentifier
-
nameIdentifierScheme
Type: string
Description: REQUIRED DataCite nameIdentifierScheme
-
schemeUri
Type: string
Description: DataCite schemeUri
-
-
affiliation
Type: array
Description: DataCite affiliation
Array Items-
name
Type: string
Description: REQUIRED DataCite name
-
affiliationIdentifier
Type: string
Description: DataCite affiliationIdentifier
-
affiliationIdentifierScheme
Type: string
Description: DataCite affiliationIdentifierScheme
-
schemeUri
Type: string
Description: DataCite schemeUri
-
-
lang
Type: string
Description: DataCite lang
-
-
titles
Type: array
Description: REQUIRED A name or title by which a resource is known.
Array Items-
title
Type: string
Description: REQUIRED DataCite title
-
titleType
Type: string
Description: DataCite titleType
-
lang
Type: string
Description: DataCite lang
-
-
identifier
Type: array
Description: A unique string that identifies a resource.
Array Items-
identifier
Type: string
Description: REQUIRED DataCite identifier
-
identifierType
Type: string
Description: DataCite identifierType
-
-
subjects
Type: array
Description: Subject, keyword, classification code, or key phrase describing the resource.
Array Items-
subject
Type: string
Description: REQUIRED DataCite subject
-
subjectScheme
Type: string
Description: DataCite subjectScheme
-
schemeUri
Type: string
Description: DataCite schemeUri
-
valueUri
Type: string
Description: DataCite valueUri
-
classificationCode
Type: string
Description: DataCite classificationCode
-
lang
Type: string
Description: DataCite lang
-
-
publicationYear
Type: string
Description: REQUIRED The year when the data was or will be made publicly available. -
rights
Type: array
Description: Any rights information for this resource.
Array Items-
rights
Type: string
Description: DataCite rights
-
rightsUri
Type: string
Description: DataCite rightsUri
-
rightsIdentifier
Type: string
Description: DataCite rightsIdentifier
-
rightsIdentifierScheme
Type: string
Description: DataCite rightsIdentifierScheme
-
schemeUri
Type: string
Description: DataCite schemeUri
-
lang
Type: string
Description: DataCite lang
-
-
descriptions
Type: array
Description: REQUIRED All additional information that does not fit in any of the other categories. May be used for technical information or detailed information associated with a scientific instrument.
Array Items-
description
Type: string
Description: REQUIRED DataCite description
-
descriptionType
Type: string
Description: REQUIRED DataCite descriptionType
-
lang
Type: string
Description: DataCite lang
-
-
language
Type: string
Description: REQUIRED The primary language of the resource. -
fundingReferences
Type: array
Description: REQUIRED Name and other identifying information of a funding provider.
Array Items-
funderName
Type: string
Description: REQUIRED DataCite funderName
-
funderIdentifier
Type: string
Description: DataCite funderIdentifier
-
funderIdentifierType
Type: string
Description: DataCite funderIdentifierType
-
awardNumber
Type: string
Description: DataCite awardNumber
-
awardUri
Type: string
Description: DataCite awardUri
-
awardTitle
Type: string
Description: DataCite awardTitle
-
-
relatedIdentifiers
Type: array
Description: DataCite relatedIdentifiers
Array Items-
relationType
Type: string
Description: REQUIRED DataCite relationType -
relatedMetadataScheme
Type: string
Description: DataCite relatedMetadataScheme -
schemeUri
Type: string
Description: DataCite schemeUri -
schemeType
Type: string
Description: DataCite schemeType -
resourceTypeGeneral
Type: string
Description: DataCite resourceTypeGeneral -
relatedIdentifier
Type: string
Description: REQUIRED DataCite relatedIdentifier -
relatedIdentifierType
Type: string
Description: REQUIRED DataCite relatedIdentifierType
-