Skip to content

Fail WFT if Local Activity execution experienced an Error#1591

Merged
Spikhalskiy merged 1 commit intotemporalio:masterfrom
Spikhalskiy:issue-1533-1
Jan 9, 2023
Merged

Fail WFT if Local Activity execution experienced an Error#1591
Spikhalskiy merged 1 commit intotemporalio:masterfrom
Spikhalskiy:issue-1533-1

Conversation

@Spikhalskiy
Copy link
Copy Markdown
Contributor

@Spikhalskiy Spikhalskiy commented Jan 8, 2023

What was changed

Now if a Local Activity throws an Error, it will lead to WFT failure, not a failure of activity execution.

Closes #1533

@Spikhalskiy Spikhalskiy force-pushed the issue-1533-1 branch 3 times, most recently from dd1c6e1 to b0ddc72 Compare January 9, 2023 18:00
@Spikhalskiy Spikhalskiy marked this pull request as ready for review January 9, 2023 18:13
Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- this behavior should probably be added to the docstring for newLocalActivityStub / wherever else you can start LAs from Workflow

*
* @param workflowTask task to handle
* @return an object that can be used to build workflow task completion or failure response
* @throws Throwable if processing experienced unrecoverable unexpected issues
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say "unrecoverable"? Since the idea is to let it try again with WFT failure? Unrecoverable for some smaller scope, but recoverable in the bigger sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it may be unclear with this wording. I actually mean "Unrecoverable on this worker in the context of this Workflow Task". Because you are totally correct, it doesn't mean that it's unrecoverable for the workflow, hence WFT failure. I will add a clarification.

@Spikhalskiy
Copy link
Copy Markdown
Contributor Author

Spikhalskiy commented Jan 9, 2023

@Sushisource Java devs pretty much never throw Errors, so I feel it's over-communication and just an implied correct behavior. If Error is ever thrown, the WFT is invalidated. But it may be worth mentioning in LA section of our Developer Guide for Java.

@Spikhalskiy Spikhalskiy enabled auto-merge (squash) January 9, 2023 18:36
@bergundy
Copy link
Copy Markdown
Member

bergundy commented Jan 9, 2023

I think I'm missing context here, why is this the intended behavior?

@Spikhalskiy
Copy link
Copy Markdown
Contributor Author

Spikhalskiy commented Jan 9, 2023

@bergundy it's explained in the original issue and in the test comment:

 * If Local Activity throws an {@link Error}, it should immediately fail Workflow Task. Java Error
 * signals a problem with the Worker and shouldn't lead to a failure of a Local Activity execution
 * or a Workflow.

Users wouldn't expect a failure of a Workflow just because their worker is OutOfMemory. Currently, pre this PR, it leads to a final Local Activity failure and likely a Workflow failure.

@Spikhalskiy Spikhalskiy merged commit 8b14ee9 into temporalio:master Jan 9, 2023
@Spikhalskiy Spikhalskiy deleted the issue-1533-1 branch January 9, 2023 18:47
@bergundy
Copy link
Copy Markdown
Member

What about IOError, why fail the WFT in this case?
There are probably other exceptions to the rule here.

@Spikhalskiy
Copy link
Copy Markdown
Contributor Author

Spikhalskiy commented Jan 10, 2023

@bergundy
There is no IOError in Java, there is IOException. No IO problems can ever be Errors and meet the specification, because they are always recoverable. Other than ones that render an application unusable and in a broken state.
There are no exceptions to this rule, any Error happening in Local Activity is an unrecoverable failure of the application instance by definition of Error: https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/Error.html
Check out "Direct Known Subclasses" section to get an idea of what kind of failures are considered Errors in Java.

IOException will be treated as it should after this change: a failure of an LA attempt or an LA failure if we are out of retries.

@bergundy
Copy link
Copy Markdown
Member

Looks like there is an IOError class that is thrown in extreme cases https://docs.oracle.com/javase/7/docs/api/java/io/IOError.html

According to Stackoverflow, it usually would be thrown from Console methods and it's much more typical to get an IOException.

Is AssertionError supposed to fail the workflow task? I'm not certain.

Anyways, I'm not sure why we need this special behavior for Errors but it seems like it should be relatively harmless to merge.

@Spikhalskiy
Copy link
Copy Markdown
Contributor Author

Spikhalskiy commented Jan 10, 2023

@bergundy Yeah, I forgot about this class. IOError is thrown if the file system is broken. Or some interface is completely inaccessible and will not be accessible. For example, your app is reading from a console, but the app has no more STDIN (you will get it with Ctrl + D I believe). Anyway, an Error means that the app can't do its job anymore. If a Worker can't do its job correctly, it should not be able to fail a Workflow or an Activity execution.

@bergundy
Copy link
Copy Markdown
Member

Might as well throw a critical failure from the worker in this case instead of failing a WFT.
This behavior might affect low latency applications negatively preventing workflows from completing in a timely fashion.

@Spikhalskiy
Copy link
Copy Markdown
Contributor Author

@bergundy Yes, this may be added, while it's tricky. Most java frameworks don't do it. The problem is that there is no reliable way to do anything after an Error, including shutting down a worker. There is also no easy standard way to notify user's code about it.
But we still should at least try to explicitly fail WFT, and actually specifically for low-latency applications usecase. Otherwise, the Server will have to wait until WFT timeout.

@bergundy
Copy link
Copy Markdown
Member

I agree not waiting for WFT timeout is better here but the server also throttles WFT retries with backoff so failing local activity and reporting to the workflow ASAP is the better outcome in (what I believe is) most latency sensitive cases.

@Spikhalskiy
Copy link
Copy Markdown
Contributor Author

Spikhalskiy commented Jan 12, 2023

@bergundy

  1. I'm not sure that even for most latency-sensitive usecases a failure is better than a delay. In some - sure, but it's not the rule. Some applications will be fine with registering a client timeout and leaving the workflow executing. Alternatively, see 3.
  2. After an Error, reliable reporting of Local Activity / Workflow Failure (sending of Workflow Task Completion) is not possible by Error spec in Java. In this PR I made sure Error always reliably leads to either a WFT timeout or an explicit WFT failure (which is quite consistent and leads to the same effects) for the Server and Workflow.
  3. To correctly address the situation you describe, a low latency app that prefers a failure over a wait should set scheduleToClose on the workflow. In case of Error, an activity didn't really fail. It wasn't able to execute because of the Worker's broken operational state. And a timeout is the real failure reason from the application point of view. Failure of a worker is an implementation detail of Temporal. Workflow scheduleToClose is also needed for a different reason for such a use-case: Imagine a WFT that is not picked up for any reason, for example, all workers are down.

Additionally addressing the backoff point, for low latency mode of operation we probably will have to reconsider WFT failure backoff. WFT failure in low latency mode should cause immediate scheduling if the user chooses so.

@bergundy
Copy link
Copy Markdown
Member

Let's discuss off PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in a local activity should fail the WorkflowTask instead of failing an activity execution

3 participants