-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Background
Currently, a lagging scheduler is switched off by other instances, if it hasn't written a heartbeat to the database table scheduler_instance for a configurable period (e.g. 20 seconds).
In a first step, one other instance (the first to notice) sets the status of the lagging scheduler to DEACTIVATED in the scheduler_instance table. Then, all workflows that are assigned to the lagging scheduler are unassigned (so that the remaining schedulers can assign them among themselves) and the lagging scheduler is deleted from the scheduler_instance table.
If the lagging scheduler comes online later, and was in fact deactivated wrongly, it will switch itself off, as it cannot find itself (its id) in the scheduler_instance database. The scheduler instance can be reactivated manually by calling the /admin/startManager/ endpoint.
If the database goes down, all scheduler instances can not update their heartbeats, and therefore only one of them will survive after the database comes online again. This is unnecessary. A scheduler should be able to switch itself back on, by registering itself under a new scheduler instance id in the scheduler_instance table.
Unfortunately, this doesn't solve (but also doesn't exacerbate) the problem that two workflows could be launched simultaneously. Any started job submissions on the wrongly deactivated instance will finish. If the workflow hasn't been submitted yet, but already a new job submission was started on the new instance, then two parallel job submissions (and executions) may occur.
Feature
The scheduler instance should not shut itself off when catching the SchedulerInstanceAlreadyDeactivatedException, but clean up sensors, and get a new scheduler instance.
Proposed Solution
Instead of calling stopManager() when catching the exception, only call
sensors.cleanUpSensors()
workflowBalancer.resetSchedulerInstanceId()