Skip to content

Conversation

@khushboobhatia01
Copy link
Contributor

This pull request adds a new abandon_wait_config which will allow user to configure wait period before worker starts abandoning incomplete executions during shutdown. This will ensure short lived executions can complete instead of just abandoning.

Why is this change needed?

  • In K8S HA environment we have a terminationGracePeriod config which waits for x seconds before killing the pod.
    We plan to set this config, so that short lived executions can complete without any issue. With the current implementation actionrunner will directly start abandoning incomplete executions and terminationGracePeriod config won't be of much use.

@pull-request-size pull-request-size bot added the size/M PR that changes 30-99 lines. Good size to review. label Aug 26, 2021
@Kami
Copy link
Member

Kami commented Aug 27, 2021

Thanks for the contribution.

The change indeed looks reasonable, but perhaps also a corresponding documentation PR may be in order - aka in what kind of situations users may want to set this value to something larger than 0 (basically what you described as your motivation for the change - in case users have short running executions or similar).

@khushboobhatia01
Copy link
Contributor Author

Thanks for the contribution.

The change indeed looks reasonable, but perhaps also a corresponding documentation PR may be in order - aka in what kind of situations users may want to set this value to something larger than 0 (basically what you described as your motivation for the change - in case users have short running executions or similar).

Thanks @Kami for the review. I've added the new config details in the CHANGELOG.

@amanda11 amanda11 added this to the 3.6.0 milestone Sep 14, 2021
@amanda11
Copy link
Contributor

amanda11 commented Oct 1, 2021

@khushboobhatia01 / @m4dcoder Given the discussions about a clean shutdown behaviour (#5367, is this PR still targetted for the 3.6.0 release.
Are we aiming to put in this as a workaround or wait until we get a full solution instead?

If we are going to wait, then we should probably remove it from the 3.6 milestone and project.
If we want to put this in to improve the situation, then let us know if review is required.

@cognifloyd
Copy link
Member

I think a pause is good, but only

  • if it stops consuming new actions first, and
  • if there there is something running to wait for.

So, I don't think this should go in 3.6

@khushboobhatia01
Copy link
Contributor Author

I think a pause is good, but only

  • if it stops consuming new actions first, and
  • if there there is something running to wait for.

So, I don't think this should go in 3.6

Agree.
Current implementation just abandons all ongoing executions which is not graceful.

We could think of a similar approach as discussed for workflow engine graceful shutdown #5373)

  • On shutdown, the worker should remove itself from the service registry.
  • If there are other workers in the service registry, then do the shutdown sequence. The shutdown sequence should be something like stop accepting incoming requests and wait until this worker completes processing any requests.
  • Else if there are no other worker in the service registry, then the worker needs to stop accepting requests, pause all running action executions, and then proceed with rest of the shutdown sequence.
  • On worker startup, any action executions paused by the shutdown should be resumed.

In this case abandonment will happen only for orphaned action executions stuck in some running/canceling/pausing state

Thoughts @cognifloyd @m4dcoder?

@khushboobhatia01
Copy link
Contributor Author

@khushboobhatia01 / @m4dcoder Given the discussions about a clean shutdown behaviour (#5367, is this PR still targetted for the 3.6.0 release. Are we aiming to put in this as a workaround or wait until we get a full solution instead?

If we are going to wait, then we should probably remove it from the 3.6 milestone and project. If we want to put this in to improve the situation, then let us know if review is required.

@amanda11 I think this PR is still open for discussion. I had created this MR before the workflow engine graceful shutdown discussion.

@cognifloyd
Copy link
Member

Yeah. We should use use the same approach for workflow engine and actionrunner.

As an interim PR, I think getting the services to add/remove self from registry would be a good start. Everything at once would probably be more difficult to review & merge.

@khushboobhatia01
Copy link
Contributor Author

Yeah. We should use use the same approach for workflow engine and actionrunner.

As an interim PR, I think getting the services to add/remove self from registry would be a good start. Everything at once would probably be more difficult to review & merge.

Yes, will start with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

service: action runner size/M PR that changes 30-99 lines. Good size to review. status:under discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants