fix(registry): support canister migration from/to deleted subnet in registry#10470
fix(registry): support canister migration from/to deleted subnet in registry#10470mraszyk wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the registry canister’s canister-migration logic to tolerate migrations involving subnets that have been deleted from the registry/routing table, and adds regression tests to lock in the behavior.
Changes:
- Adjust routing-table mutation so migrating to a target subnet that is absent from the routing table does not leave any ranges assigned to that (deleted) target subnet.
- Relax payload validation to allow migrations to target subnets that no longer exist in the registry.
- Add/extend unit tests covering migration when the source subnet is “deleted” (unrouted gap) and when the target subnet does not exist.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| rs/registry/canister/src/mutations/routing_table.rs | Skips persisting assignments to a target subnet that has no existing ranges in the routing table. |
| rs/registry/canister/src/mutations/do_migrate_canisters.rs | Removes target-subnet existence validation and adds tests for deleted-source / deleted-target migration scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
|
I believe this PR is not worth an entry in the changelog... |
|
✅ No security or compliance issues detected. Reviewed everything up to 183afe3. Security Overview
Detected Code Changes
|
| .collect::<Result<Vec<_>, _>>()?; | ||
|
|
||
| // Intentionally do not validate that the target subnet exists: the migration orchestrator | ||
| // may still attempt to migrate canisters to/from a subnet that has been deleted. |
There was a problem hiding this comment.
migrating to a deleted subnet is confusing (even migrating from a deleted canister is confusing, but less). Can you elaborate (in this comment)? Like, what is the eventual result?
| // Verify that migrate_canisters succeeds when the target subnet does not exist | ||
| // (has no key in subnet_list and no ranges in routing_table). | ||
| // | ||
| // We simulate this by using a subnet ID that is not in the routing table to begin with. |
There was a problem hiding this comment.
This describes the whole test, right? Such comments should appear BEFORE the test, just like in all other cases: when a (non-inline) comment talks about a thing, the comment appears before that thing (in this case, the test), not within it.
Looks like in the case of tests (e.g. test_basic_migrate_canisters), such comments are not triple slash, even though tests are technically functions.
| } | ||
|
|
||
| #[test] | ||
| fn test_migrate_canisters_succeeds_if_target_subnet_does_not_exist() { |
There was a problem hiding this comment.
(We discussed in this in Slack, but just so that this lives in the record of GitHub...)
It seems like such requests should be blocked, not result in canister deletion, because that would be a surprising data loss, the worst kind of surprise.
There was a problem hiding this comment.
Subnet deletion is (currently) a manual process and we don't expect to actually delete a subnet while a canister migration is ongoing (instead we expect to disable the canister migration API first and only delete a subnet once there's no ongoing canister migration). However, e.g., if a subnet is permanently stuck, it might be impossible to successfully complete an ongoing canister migration and in this case, we'd like to still be able to delete such a subnet. This leaves us with two options:
- adjust canister migration so that it can gracefully abort canister migration to a subnet that is stuck/deleted (that would very likely result in substantial engineering overhead and risk);
- adjust canister migration so that performing canister migration from/to a stuck/deleted subnet (transparently) behaves as if canister migration completed before the subnet got stuck/deleted.
The second option requires less engineering overhead which seems like a good trade-off given that it is rather unlikely that we'll need this in practice.
|
I would say this is a fix, not chore. Why: IIUC, people are already allowed to migrate when deletion is happening, but if they do that, the system ends up in an inconsistent state. Whereas, this change makes it so that such migrations do NOT result in inconsistent state. Again, chore is for if you don't really care if this gets into production in a timely manner. But it seems to me, that you do care. If you care, don't use chore, because we more or less ignore chore PRs when considering whether an upgrade should be done. Also, scope to registry. |
This PR supports canister migration from/to deleted subnet in the registry by:
The motivation for this change is to ensure that the state machine in the migration canister does not get stuck if a subnet is deleted while a canister migration is ongoing. In such a case, the canister migration is supposed to complete with the same outcome as if the canister migration completed before subnet deletion (which is infeasible to always ensure in practice though, e.g., if the deleted subnet was stuck).