These notes are just extracted from Zulip, so are more stream of consciousness than specifications, but they attempt to describe what a command line tool might do that carried out syncing between the local and remote registries, and what else we might want it to do. Given some of the functionality described here is already provided by the download and upload scripts in the data_pipeline_api repo, it probably makes sense for this to be written in python.
There is an alternative to this command line solution, which would be for the Django interface to the local registry to provide the same (or similar) functionality. This would allow the commands to be access through the web interface, or through the REST API and (optionally) also through the native language interfaces. This would not preclude a command line option as well, because it would simply be a wrapper around the REST API.
Distinction between modelling API and backend #
Individual languages communicate with the local registry and the local data store, but communicate no further afield. They need to be able to parse the
config.yaml file to understand the relationship between the api requests they receive and the requests they need to make of the registry, however.
All of the syncing:
- between the remote registry and the local registry
- to populate the local data store by downloading data from remote data stores
- to create remote storage locations for local data that is being pushed to the remote registry
is handled by an as-yet-unwritten command line tool, which (for the sake of convenience) I have been calling
fdp. This will act a little bit like
git. I’m currently imagining it will look like:
fdp pull config.yamlwill pull all of the registry entries relevant to this config file into the local registry, and download any files that are referred to in
register:blocks to the local data store.
fdp run config.yamlwill run the submission script referred to in the
fdp commit <some-token> | <path/to/file>will mark a code run or a specific file (and all of its provenance) to be synced back to the remote registry (once we’ve worked out how to refer to code runs in an intuitive way?)
fdp pushwill sync back any committed changes to the remote registry and upload any associated data to the remote data store (but we still haven’t determined how to specify what to store where)
Other commands that might be useful:
fdp statusreport on overall sync status of registry
fdp status <some-token> | <path/to/file>report on sync status of a specific thing (e.g. a code run) or a specific file
fdp updatepull the most recent registry entries (and associated data) for anything in the current registry
fdp gc [--files [--all]]maybe to remove anything from the registry that is not in the provenance history of an unsynced file? Add
--filesto also delete the associated files from the local data store unless they have restricted accessibility. Add
--allto also delete the files from the local data store even if they have restricted accessibility.
fdp delete [--force] <some-token>or
fdp delete [--force] <path/to/file>delete a specific thing (e.g. a code run) or a specific file from the registry and local data store, adding
--forceif it is unsynced.
fdp info <some-token> | <path/to/file>- return metadata or registry information on a specific registry entry or file
fdp code will handle syncing between two registries using two REST APIs, one at each end, which (I think?) is a significantly harder problem, and so is probably best only solved once though!
We can’t see a way of easily allowing the functionality we need without having a local registry. I think the idea is that the local registry is going to be as standalone as possible. At the moment it’s a single command to install it, and a second to start it up, a third to stop it. Ideally
fdp might be able to check if it’s running and start it up and stop it on its own I guess, so the only step would be to install it.
In principle the “local” registry could be on a separate computer and exist for multiple users (say inside a safe haven or on HPC), but it would have to have access to the same filesystem as the machine you’re running on otherwise we get back into issues of remote data access again.
Specialising config.yaml files #
One thing we haven’t quite resolved is the distinction between the state of the world when we do
fdp pull config.yaml, and the state of the world when we do
fdp run config.yaml. A few things may have happened:
- the local registry may have been garbage collected using
fdp gc, and now:
- stuff is missing that is required by the code run
- the remote registry may have been updated and:
2. we haven’t checked, so the local registry is out of date
3. we have run another
fdp pull other-config.yamland so some, but possibly not all, of the local registry entries may have been updated
- the remote registry may be updated during a run (possibly by calling
fdp pull other-config.yamlfor a different run), and now: 4. queries made later in a run are inconsistent with queries made earlier in a run. Note this has to happen because we want to be able to write then read something just created during a run - this is one of the requirements that necessitates having a local registry
The first two of these are probably easy to handle (crash and ignore, respectively!), but the last two are more tricky. Those may (will?) have issues for the reproducibility of the code run, because there may be no way of being able to reproduce the state of the registry at the point at which it was run. I wonder whether
fdp run config.yaml should actually generate a fully qualified
config.yaml file specifically for that run, that lists the exact version and data product, etc. with no globbing or text replacement left for the API code to worry about?
Alternatively this could happen when we call
fdp pull config.yaml to generate
fdp run config-2353563.yaml could be executed on the fully qualified version of the file…
config-2353563.yaml was just the name for a file generated by
config.yaml which removes all calculations and lookups, so that
human/disease/SARS-CoV-2/* becomes a whole series of entries, one for each data product, which specifies namespace, data product and version so you know exactly what the right version was at the point you ran
If there’s a working version of the
config.yaml that will reproducibly generate the right aliases no matter what you subsequently do to the registry, then it makes things much simpler (and because it is what the API code is actually using, we know that it will generate the right aliases).
fdp pull config.yaml uses the original. Maybe
fdp run config.yaml creates the new one and starts up the job, or maybe you have a specific
fdp generate config.yaml config-working.yaml to generate the working one? The modelling code should never require an internet connection though.
On a related note to above, the specialised
config-working.yaml file should have all of the
run_metadata defaults explicitly stated so that changing your config won’t change the run…
Going back to the lack of internet connectivity though, generating the working yaml file doesn’t require an internet connection though. It’s just saying “given the current state of the local registry, what does this
config.yaml file really mean in practice if I use it?”.
Local data storage #
How about this? You setup a new local registry, go to /Users/johnsmith/datastore and run
fdp config "johnsmith" to set this directory as your data store. This will generate an API token for the local registry and add an entry to
.scrc/.users noting the fact that the johnsmith username / namespace is associated with /Users/johnsmith/datastore… or do something fancier to specify the local registry url, or even set passwords? Your
config.yaml now contains a username / namespace rather than a list of defaults associated with your account.
If someone wants to share your datastore, that’s fine, they run
fdp config "otherperson" from the same working directory and their details are added to
fdp config might also generate a specific login node, recorded in