Terminology

Terminology used in this document #

datum
A specific value, encoded in a particular way, that travels through the data pipeline.
config.yaml
A file (potentially with a different name) used by the data pipeline to allow users to override default pipeline behaviour. See the File API specification for more details.
metadata.yaml
A file used by the data pipeline to describe available data files, listing their associated metadata. See the File API specification for more details.
access.yaml
A file (potentially with a different name) generated by the data pipeline API to record file access. See the File API specification for more details.

Metadata #

data_product
Identifies which kind of quantity a datum represents (e.g. “human/mixing-matrix”). Path-formatted to permit structure in the filename scheme (defined below). The desired data_product is typically specified in model code, and it is a core part of the data identifiers used in config.yaml, metadata.yaml, and access.yaml.
version
A semver identifying a version of a data_product (the file API will select the most recent version if this is not specified).
component
Identifies a part of a data_product.
filename
Specifies the path to a file, typically relative to the data root. Only required on read, and typically inferred from metadata.yaml.
extension
Specifies the extension of a file. Required on write to generate a standard filename. Typically provided by a datatype API.
run_id
Specifies a unique identifier for a model run. Required on write to generate a standard filename, typically generated by the file API.
verified_hash
Specifies a “verified good” SHA1 hash for a file. Used by the file API to verify file contents. Typically defined in metadata.yaml.
calculated_hash
Specifies the SHA1 hash computed by the file API for a file. Typically only defined in access.yaml.
max_warning
Specifies the maximum known warning level for a particular datum. Could be used by the file API to filter “bad” data (currently not supported). Typically defined in metadata.yaml.

Filenames #

{data root}/{data_product}…/{run_id}.{extension}

e.g. {data root}/human/mixing-matrix/12345.h5