Skip to content

fix: harden supervisor recovery and stuck scan#207

Open
liobrasil wants to merge 14 commits intoholyfuchs/supervisor-fixfrom
lionel/fix-supervisor-scan-recovery
Open

fix: harden supervisor recovery and stuck scan#207
liobrasil wants to merge 14 commits intoholyfuchs/supervisor-fixfrom
lionel/fix-supervisor-scan-recovery

Conversation

@liobrasil
Copy link
Contributor

@liobrasil liobrasil commented Mar 11, 2026

Summary

  • add the final supervisor regression coverage for duplicate recoveries and mixed recurring/non-recurring scan sets
  • fix duplicate recovery churn by treating recently executed recurring transactions as active for a short optimistic-execution grace period
  • restrict stuck-scan ordering to recurring participants and lazily prune stale non-recurring entries during candidate walks
  • align scheduler docs/comments with the recurring-only scan behavior

Scope Note

  • the core supervisor fix is in FlowYieldVaultsAutoBalancers, FlowYieldVaultsSchedulerRegistry, and the supervisor regression tests
  • the remaining docs/comment updates are alignment-only and do not change supervisor behavior

Verification

  • flow test cadence/tests/scheduler_mixed_population_regression_test.cdc
  • flow test cadence/tests/scheduled_supervisor_test.cdc

return true
}

if status == FlowTransactionScheduler.Status.Executed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same issue we faced in onflow/FlowYieldVaultsEVM#70
The problem with the fix is that if the transaction panics, lastRebalanceTimestamp is not updated. This makes the rebalancer permanantely stuck because it's Executed and the lastRebalanceTimestamp was never updated.
You might want to consider a grace period based fix.

Comment on lines +87 to +95
/// A transaction is considered active when it is:
/// - still `Scheduled`, or
/// - already marked `Executed` by FlowTransactionScheduler, but the AutoBalancer has not
/// yet advanced its last rebalance timestamp past that transaction's scheduled time.
///
/// The second case matters because FlowTransactionScheduler flips status to `Executed`
/// before the handler actually runs. Without treating that in-flight window as active,
/// the Supervisor can falsely classify healthy vaults as stuck and recover them twice.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this?
I am not sure I understand.
The status can be Executed even though it is not executed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Executed can be set before the handler actually runs. In FlowTransactionScheduler, the scheduler marks a tx as Executed optimistically before the handler logic has actually finished running. The contract says this directly here:

https://github.com/onflow/flow-core-contracts/blob/27e0eb625ebe056c78cf42d6feaa6ce00a8e06c9/contracts/FlowTransactionScheduler.cdc#L1169-L1186
https://github.com/onflow/flow-core-contracts/blob/27e0eb625ebe056c78cf42d6feaa6ce00a8e06c9/contracts/FlowTransactionScheduler.cdc#L250-L264

yieldVaultID: uniqueID.id,
handlerCap: handlerCap,
scheduleCap: scheduleCap,
participatesInStuckScan: recurringConfig != nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract mentions its:
A registry of all yield vault IDs that participate in scheduled rebalancing

Would we ever have an instance where this is not the case?
To me it seems the better approach to ensure that SchedulerRegistry only has Vaults that are getting scheduled and we shouldn't add the ones which aren't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do that. It would make the semantics cleaner: SchedulerRegistry would contain only vaults that are currently recurring/scheduled.

I kept this PR narrower because that would be a broader refactor. Today registration follows the vault lifecycle, not the recurring-config lifecycle, so changing the global registry to recurring-only would mean adding/removing entries whenever recurring config is enabled/disabled, and updating the related admin/recovery flows to match.

So I agree your approach is valid, but I’d treat it as a separate design change. In this PR I only made the stuck-scan ordering recurring-only. I’ll update the comments to make that distinction explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check this commit : 9284c7e

@liobrasil liobrasil force-pushed the lionel/fix-supervisor-scan-recovery branch from 67c0e5e to eab7ad0 Compare March 19, 2026 16:28
@liobrasil liobrasil force-pushed the lionel/fix-supervisor-scan-recovery branch from eab7ad0 to 999cd1d Compare March 19, 2026 18:22
@liobrasil liobrasil requested review from a team and nvdtf March 19, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants