schemata

generate-markdown

HDR UK Schemata - Dataset V2.1

1. HDR UK Dataset Schema - YAML - JSON

The latest version specification required for datasets to be on boarded onto the Gateway are shown in this repository and is comprised of the following:

2. Dataset Properties Breakdown

Below is the breakdown of the HDR UK V2 Dataset Schema by its properties and sub properties as defined in the JSON Schema. Each property from 1-7 has its own Schema with a description of its corresponding sub properties, including their data type and whether it is a required field.

0. Metadata: Properties generated when dataset is entered into the system.

1. summary: Summary metadata must be completed by Data Custodians onboarding metadata into the Innovation Gateway MVP.

3. coverage: This information includes attributes for geographical and temporal coverage, cohort details etc. to enable a deeper understanding of the dataset content so that researchers can make decisions about the relevance of the underlying data.

4. provenance: Provenance information allows researchers to understand data within the context of its origins and can be an indicator of quality, authenticity and timeliness.

5. accessibility: Accessibility information allows researchers to understand access, usage, limitations, formats, standards and linkage or interoperability with toolsets.

7. observations: Multiple observations about the dataset may be provided and users are expected to provide at least one observation (1..*). We will be supporting the schema.org observation model (https://schema.org/Observation) with default values. Users will be encouraged to provide their own statistical populations as the project progresses.

8. structuralMetadata: Descriptions and details about the tables and columns within a dataset.

3. Metadata Quality Scoring

Once a dataset is onboarded onto the Gateway, a quality check is run on its corresponding json schema to produce a weighted quality score based on weighted field completeness and weighted field error percentage. Weights of each field can be found here (https://github.com/HDRUK/datasets/tree/master/config/weights) and details of the quality score calculation can be found here (https://github.com/HDRUK/datasets/tree/master/reports#how-scores-are-calculated).

Based on the weighted quality score, a dataset is given a medallion rating as follows: