Job tracking

Optionally support a `libsql` connection URI which will be used to track jobs as they are processed by `twine-writerd` or `twine-cli`.

A job consists of:

* A UUID to identify it
* *Optional* a parent UUID
* A URI to identify it (which may simply be a `urn:uuid:` representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing)
* Timestamps for added and updated
* A status: `WAITING`, `ACTIVE`, `ABORTED` (by the user), `COMPLETE`, `FAILED`, `ERRORS` (partial failure)
* A status annotation (free-text) which may be set to indicate the failure reason
* If active, the cluster/instance details of the node processing the job (preserved for diagnosis once set)
* Processing item `x` of `y` progress indicators (particularly for bulk ingests from filesystem sources)

UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.

A job stack should be maintained internally to `libtwine` in order to track parent/child relationships, rather than requiring it to be made explicit.

As an example, an ingest of N-Quads from a file, processing with `spindle-correlate` might yield the following:

* A job is created in state `WAITING` with a newly-generated UUID and a `file:///` URI
* The N-Quads are parsed and the number of graphs determined; the job is updated to state `ACTIVE`, with progress set to 0 of *number-of-graphs* 
* For each graph that is correlated by Spindle, progress is updated, and a new child job is created in state `WAITING`, using the Spindle-generated UUID and URI
* Once processing of the N-Quads is complete, the job status is updated to `COMPLETE`

As `spindle-generate` later processes its queue of items, it performs the following:

* A job is created in state `WAITING` using the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisation
* As the proxy is generated, its status is updated accordingly

With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.

Open question: how would Twine know when to preserve versus replace the parent of a job?

Perhaps it could be as simple as user action (i.e., `twine-cli`) taking precedence over an on-going process — thus, a queue-driven `twine-writerd` will only set the parent of a job if it's newly-created, whereas `twine-cli` will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.

Tracked as [RESDATA-1279](https://jira.dev.bbc.co.uk/browse/RESDATA-1279)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job tracking #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Job tracking #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions