fix: harden supervisor recovery and stuck scan#207
fix: harden supervisor recovery and stuck scan#207liobrasil wants to merge 14 commits intoholyfuchs/supervisor-fixfrom
Conversation
| return true | ||
| } | ||
|
|
||
| if status == FlowTransactionScheduler.Status.Executed |
There was a problem hiding this comment.
This is the same issue we faced in onflow/FlowYieldVaultsEVM#70
The problem with the fix is that if the transaction panics, lastRebalanceTimestamp is not updated. This makes the rebalancer permanantely stuck because it's Executed and the lastRebalanceTimestamp was never updated.
You might want to consider a grace period based fix.
| /// A transaction is considered active when it is: | ||
| /// - still `Scheduled`, or | ||
| /// - already marked `Executed` by FlowTransactionScheduler, but the AutoBalancer has not | ||
| /// yet advanced its last rebalance timestamp past that transaction's scheduled time. | ||
| /// | ||
| /// The second case matters because FlowTransactionScheduler flips status to `Executed` | ||
| /// before the handler actually runs. Without treating that in-flight window as active, | ||
| /// the Supervisor can falsely classify healthy vaults as stuck and recover them twice. | ||
| /// |
There was a problem hiding this comment.
Could you explain this?
I am not sure I understand.
The status can be Executed even though it is not executed?
There was a problem hiding this comment.
yes, Executed can be set before the handler actually runs. In FlowTransactionScheduler, the scheduler marks a tx as Executed optimistically before the handler logic has actually finished running. The contract says this directly here:
https://github.com/onflow/flow-core-contracts/blob/27e0eb625ebe056c78cf42d6feaa6ce00a8e06c9/contracts/FlowTransactionScheduler.cdc#L1169-L1186
https://github.com/onflow/flow-core-contracts/blob/27e0eb625ebe056c78cf42d6feaa6ce00a8e06c9/contracts/FlowTransactionScheduler.cdc#L250-L264
| yieldVaultID: uniqueID.id, | ||
| handlerCap: handlerCap, | ||
| scheduleCap: scheduleCap, | ||
| participatesInStuckScan: recurringConfig != nil |
There was a problem hiding this comment.
The contract mentions its:
A registry of all yield vault IDs that participate in scheduled rebalancing
Would we ever have an instance where this is not the case?
To me it seems the better approach to ensure that SchedulerRegistry only has Vaults that are getting scheduled and we shouldn't add the ones which aren't.
There was a problem hiding this comment.
Yes, we could do that. It would make the semantics cleaner: SchedulerRegistry would contain only vaults that are currently recurring/scheduled.
I kept this PR narrower because that would be a broader refactor. Today registration follows the vault lifecycle, not the recurring-config lifecycle, so changing the global registry to recurring-only would mean adding/removing entries whenever recurring config is enabled/disabled, and updating the related admin/recovery flows to match.
So I agree your approach is valid, but I’d treat it as a separate design change. In this PR I only made the stuck-scan ordering recurring-only. I’ll update the comments to make that distinction explicit.
67c0e5e to
eab7ad0
Compare
eab7ad0 to
999cd1d
Compare
Summary
Scope Note
FlowYieldVaultsAutoBalancers,FlowYieldVaultsSchedulerRegistry, and the supervisor regression testsVerification