Optionally support a libsql connection URI which will be used to track jobs as they are processed by twine-writerd or twine-cli.
A job consists of:
- A UUID to identify it
- Optional a parent UUID
- A URI to identify it (which may simply be a
urn:uuid: representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing)
- Timestamps for added and updated
- A status:
WAITING, ACTIVE, ABORTED (by the user), COMPLETE, FAILED, ERRORS (partial failure)
- A status annotation (free-text) which may be set to indicate the failure reason
- If active, the cluster/instance details of the node processing the job (preserved for diagnosis once set)
- Processing item
x of y progress indicators (particularly for bulk ingests from filesystem sources)
UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.
A job stack should be maintained internally to libtwine in order to track parent/child relationships, rather than requiring it to be made explicit.
As an example, an ingest of N-Quads from a file, processing with spindle-correlate might yield the following:
- A job is created in state
WAITING with a newly-generated UUID and a file:/// URI
- The N-Quads are parsed and the number of graphs determined; the job is updated to state
ACTIVE, with progress set to 0 of number-of-graphs
- For each graph that is correlated by Spindle, progress is updated, and a new child job is created in state
WAITING, using the Spindle-generated UUID and URI
- Once processing of the N-Quads is complete, the job status is updated to
COMPLETE
As spindle-generate later processes its queue of items, it performs the following:
- A job is created in state
WAITING using the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisation
- As the proxy is generated, its status is updated accordingly
With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.
Open question: how would Twine know when to preserve versus replace the parent of a job?
Perhaps it could be as simple as user action (i.e., twine-cli) taking precedence over an on-going process — thus, a queue-driven twine-writerd will only set the parent of a job if it's newly-created, whereas twine-cli will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.
Tracked as RESDATA-1279
Optionally support a
libsqlconnection URI which will be used to track jobs as they are processed bytwine-writerdortwine-cli.A job consists of:
urn:uuid:representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing)WAITING,ACTIVE,ABORTED(by the user),COMPLETE,FAILED,ERRORS(partial failure)xofyprogress indicators (particularly for bulk ingests from filesystem sources)UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.
A job stack should be maintained internally to
libtwinein order to track parent/child relationships, rather than requiring it to be made explicit.As an example, an ingest of N-Quads from a file, processing with
spindle-correlatemight yield the following:WAITINGwith a newly-generated UUID and afile:///URIACTIVE, with progress set to 0 of number-of-graphsWAITING, using the Spindle-generated UUID and URICOMPLETEAs
spindle-generatelater processes its queue of items, it performs the following:WAITINGusing the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisationWith this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.
Open question: how would Twine know when to preserve versus replace the parent of a job?
Perhaps it could be as simple as user action (i.e.,
twine-cli) taking precedence over an on-going process — thus, a queue-driventwine-writerdwill only set the parent of a job if it's newly-created, whereastwine-cliwill always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.Tracked as RESDATA-1279