cpu-seq: Legalize the identity transition by hawkw · Pull Request #2064 · oxidecomputer/hubris

hawkw · 2025-05-11T17:53:34Z

Currently, the cpu-seq API has the sequencer return an
IllegalTransition error in the case where the requested power state is
the same as the current power state. This is also the error variant
returned when the requested state differs from the current state, but a
transition between those two states cannot be performed. This is
unfortunate, as the caller cannot distinguish between cases where the
system is in the requested state and everything is basically fine, and
cases where the caller asked the sequencer to do something that will
never succeed.

This commit changes the cpu-seq Idol API to return a Transition type
from the set_power_state and set_power_state_with_reason IPCs that
indicates whether or not a power state change occurred. Sequencer
implementations are changed to return Ok(Transition::NoChange) instead
of Err(IllegalTransition) when no state change occurred because the
current and requested states are the same.

I considered alternatively having the IllegalTransition error return
the current state so that the caller can distinguish between illegal
transitions that are totally disallowed and cases where no transition
occurs. However, this requires changing the entire IPC from using
zerocopy to hubpack, or adding an empty byte of padding to every
other SeqError variant so that we can use the support for returning
Complex error types with zerocopy I added in
oxidecomputer/idolatry#59. That gets a bit trickier, as we also convert
the SeqError code into a u32 to send it upstack, so I figured
keeping it as a C-like enum was better than adding logic to convert the
error into a u32 error code rather than just as casting it. Also,
it felt a bit wrong for this to be an "error", in my opinion. Doing
nothing because we were already in the requested state feels more like a
different type of "success" to me...

This is a first step towards fixing
oxidecomputer/management-gateway-service#270. Fully fixing that will
also require adding a way to indicate to MGS whether a state transition
occurred or not, which also requires a change to the
gateway-sp-messages crate.

Currently, the `cpu-seq` API has the sequencer return an `IllegalTransition` error in the case where the requested power state is the same as the current power state. This is also the error variant returned when the requested state differs from the current state, but a transition between those two states cannot be performed. This is unfortunate, as the caller cannot distinguish between cases where the system *is* in the requested state and everything is basically fine, and cases where the caller asked the sequencer to do something that will never succeed. This commit changes the `cpu-seq` Idol API to return a `Transition` type from the `set_power_state` and `set_power_state_with_reason` IPCs that indicates whether or not a power state change occurred. Sequencer implementations are changed to return `Ok(Transition::NoChange)` instead of `Err(IllegalTransition)` when no state change occurred because the current and requested states are the same. I considered alternatively having the `IllegalTransition` error return the current state so that the caller can distinguish between illegal transitions that are totally disallowed and cases where no transition occurs. However, this requires changing the entire IPC from using `zerocopy` to `hubpack`, or adding an empty byte of padding to every *other* `SeqError` variant so that we can use the support for returning `Complex` error types with `zerocopy` I added in oxidecomputer/idolatry#59. That gets a bit trickier, as we also convert the `SeqError` code into a `u32` to send it upstack, so I figured keeping it as a C-like enum was better than adding logic to convert the error into a `u32` error code rather than just `as` casting it. Also, it felt a bit wrong for this to be an "error", in my opinion. Doing nothing because we were already in the requested state feels more like a different type of "success" to me... This is a first step towards fixing oxidecomputer/management-gateway-service#270. Fully fixing that will also require adding a way to indicate to MGS whether a state transition occurred or not, which also requires a change to the `gateway-sp-messages` crate.

The `mock-gimlet-seq-server` and `grapefruit-seq-server` had grown an amount of repeated logic between `set_state` and `set_state_with_reason` that I didn't love --- I`d prefer them both to just call `set_state_impl`. This commit changes them to just do that.

hubris.testout.0

task/host-sp-comms/src/main.rs

jgallagher

➕ to Matt's comments; LGTM otherwise.

labbott · 2025-05-12T16:50:42Z

drv/grapefruit-seq-server/src/main.rs

+
+        // Note that we don't use `Self::set_state_impl` here, as that will
+        // first attempt to get the current power state from `jefe`, and we
+        // haven't set it yet!
+        server.jefe.set_state(PowerState::A2 as u32);


Was this an underlying bug?

No, it's just a consequence of the refactoring I did in 40a867b. Previously the mock sequencer implementations had a a set_state_impl function that just called jefe.set_state, and a validate_state_change function that checked the transition against the current state. I combined this into set_state_impl in 40a867b. I thought this was worth doing mostly becausethe need to determine whether or not a change is a no-op and only call the jefe.set_state IPC if it isn't a no-op (to accurately simulate the real sequencer's behavior WRT Jefe) made the IPC server's set_power_state and set_power_state_with_reason repeat a bit more logic and I thought centralizing it was better. On the other hand, having a set_state_impl that's called both in initialization and in the IPC methods was basically only abstracting over the cast from PowerState to u32, which seemed less important to have an abstraction for.

Honestly I think this might be explained a bit better by looking at that diff, rather than my rambly attempt to describe it.

drv/cpu-seq-api/src/lib.rs

Presently, the `SetPowerState` MGS request returns an `IllegalTransition` error when the system is already in the desired power state. This is generally not the desired behavior: the `IllegalTransition` error code represents both power state transitions that will *never* succeed, and cases where the SP was already in the desired power state and we didn't have to do anything (which is a success from the perspective of a maohority of callers). In the cases where this should be treated as an error, it's useful to be able to disambiguate between "you tried to request a transition that is never allowed" and "your request failed because it raced with another request", and in most other cases where callers just want to ensure that the system is in the desired power state, it should be treated as a success. See #270 for details. This change modifies the gateway-SP protocol to allow the SP to communicate more accurately about no-op power state transitions. `SpHandler`'s `set_power_state` method now returns a `PowerStateTransition` type that indicates whether or not a change occurred, and the `SpResponse` message enum has grown a new `PowerStateUnchanged` variant. We now construct either that variant or the `PowerStateSet` variant (née `SetPowerStateAck`, it's okay to rename variants already in the protocol as it won't change their encoding on the wire) based on whether or not the handler actually indicates that a change occurred. In `gateway-sp-comms`, we then re-construct the `PowerStateTransition` enum fro mthe two possible `SpResposne` messages. Encoding the no-op case as a new `SpResponse` message was chosen because it's the most backwards-compatible way to do this: in previous versions, the `SpResponse::SetPowerStateAck` message would only be sent if the state transitioned, and an error would be sent otherwise. Adding a new variant for the "no op" case preserves the semantics of the existing message. Alternatively, we could have added a new `SpResponse` variant that holds an additional enum or bool, and deprecate the `SetPowerStateAck` message, but that felt unfortunate, as it left behind a deprecated enum variant that would still need to be handled until it was eventually removed. Also, with two separate variants, we can encode both messages as a single byte instead of two. On the SP side, oxidecomputer/hubris#2064 lays the groundwork for this by changing the sequencer IPC interface to also indicate no-op power state transitions from `IllegalTransition` errors. An additional change will be necessary to integrate that with the new `gateway-messages` messages in this branch, if the Hubris PR merges first.

mkeeter

This looks reasonable to me!

I also tested humility dashboard and humility hiffy -f Sequencer.set_state, and they all work out of the box:

➜  hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A0
humility: attached via ST-Link V3
Sequencer.set_state() => NoChange
➜  hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A2
humility: attached via ST-Link V3
Sequencer.set_state() => Done
➜  hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A2
humility: attached via ST-Link V3
Sequencer.set_state() => NoChange
➜  hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A1
humility: attached via ST-Link V3
Sequencer.set_state() => Err(IllegalTransition)

Presently, the `SetPowerState` MGS request returns an `IllegalTransition` error when the system is already in the desired power state. This is generally not the desired behavior: the `IllegalTransition` error code represents both power state transitions that will *never* succeed, and cases where the SP was already in the desired power state and we didn't have to do anything (which is a success from the perspective of a majority of callers). In the cases where this should be treated as an error, it's useful to be able to disambiguate between "you tried to request a transition that is never allowed" and "your request failed because it raced with another request", and in most other cases where callers just want to ensure that the system is in the desired power state, it should be treated as a success. See #270 for details. This change modifies the gateway-SP protocol to allow the SP to communicate more accurately about no-op power state transitions. `SpHandler`'s `set_power_state` method now returns a `PowerStateTransition` type that indicates whether or not a change occurred, and the `SpResponse` message enum has grown a new `PowerStateUnchanged` variant. We now construct either that variant or the `PowerStateSet` variant (née `SetPowerStateAck`, it's okay to rename variants already in the protocol as it won't change their encoding on the wire) based on whether or not the handler actually indicates that a change occurred. In `gateway-sp-comms`, we then re-construct the `PowerStateTransition` enum fro mthe two possible `SpResposne` messages. Encoding the no-op case as a new `SpResponse` message was chosen because it's the most backwards-compatible way to do this: in previous versions, the `SpResponse::SetPowerStateAck` message would only be sent if the state transitioned, and an error would be sent otherwise. Adding a new variant for the "no op" case preserves the semantics of the existing message. Alternatively, we could have added a new `SpResponse` variant that holds an additional enum or bool, and deprecate the `SetPowerStateAck` message, but that felt unfortunate, as it left behind a deprecated enum variant that would still need to be handled until it was eventually removed. Also, with two separate variants, we can encode both messages as a single byte instead of two. On the SP side, oxidecomputer/hubris#2064 lays the groundwork for this by changing the sequencer IPC interface to also indicate no-op power state transitions from `IllegalTransition` errors.

azaleacolburn · 2025-05-12T20:48:02Z

🏳️‍⚧️

Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.

PR #2064 was _supposed_ to change the Gimlet and Cosmo CPU sequencer API so that requesting an idempotent power state transition (i.e. A0 -> A0) returns `Ok(Transition::Unchanged)` rather than `Err(SeqError::IllegalTransition)`. Unfortunately, it doesn't always do that in some cases. This is because the match arm in which we detect no-op power state transitions tests that the current and requested `PowerState`s are _equal_. But, unfortunately, the `PowerState` enum has more variants than just "A2" and "A0" --- it also represents _substates_ of A0 and A2, such as `A0PlusHP` and `A2PlusFans`. In particular, when a compute sled is fully up and running, it's actually *not* in `PowerState::A0`, it's in `PowerState::A0PlusHP`, because the NIC hotplug controller is enabled. The `PowerState::A0` and `PowerState::A0PlusHP` enum variants are not equal, so when a sled that's actually in `A0PlusHP` is told to go to `A0`, it doesn't match and falls through to the error case incorrectly. I'm pretty sure the reason we didn't realize this earlier is because, while I did test both A2->A2 and A0->A0 transitions when I was testing PR #2064, I would send the A0->A0 request more or less as soon as the system reached A0. I didn't wait for the host OS to come up before testing it, so the system was still in `PowerState::A0` and not `PowerState::A0PlusHP`. Whoopsie. This commit makes this operation idempotent in those cases by treating `A0PlusHP->A0` and `A2PlusFans->A2` transitions as idempotent successes rather than `IllegalTransition`s. I did *not* change the behavior in A0 substates that indicate a CPU reset condition (`A0Reset` and `A0Thermtrip`), as in those cases, we require an explicit transition back to A2 before the system will return to A0. Fixes #2271

hawkw added 2 commits May 11, 2025 10:40

hawkw requested review from jgallagher and mkeeter May 11, 2025 17:53

hawkw self-assigned this May 11, 2025

hawkw marked this pull request as ready for review May 11, 2025 17:54

mkeeter reviewed May 12, 2025

View reviewed changes

hubris.testout.0 Outdated Show resolved Hide resolved

mkeeter reviewed May 12, 2025

View reviewed changes

task/host-sp-comms/src/main.rs Outdated Show resolved Hide resolved

jgallagher reviewed May 12, 2025

View reviewed changes

hawkw added 2 commits May 12, 2025 09:21

remove accidentally committed file

7c1df4d

Fix wrong trace

0cb13b9

hawkw requested review from jgallagher and mkeeter May 12, 2025 16:27

labbott reviewed May 12, 2025

View reviewed changes

that also wasn't meant to be committed

86d014f

mkeeter reviewed May 12, 2025

View reviewed changes

drv/cpu-seq-api/src/lib.rs Outdated Show resolved Hide resolved

hawkw mentioned this pull request May 12, 2025

Make no-op SetPowerState calls a success oxidecomputer/management-gateway-service#390

Merged

Actually we don't care about discriminants

6454bb2

mkeeter approved these changes May 12, 2025

View reviewed changes

Merge branch 'master' into eliza/illegal-transition

f3f71ac

hawkw enabled auto-merge (squash) May 12, 2025 20:11

s/NoChange/Unchanged/g

d1b21a3

hawkw merged commit 707a16d into master May 12, 2025
135 checks passed

hawkw deleted the eliza/illegal-transition branch May 12, 2025 20:26

hawkw mentioned this pull request May 12, 2025

Report power state transitions to MGS #2065

Merged

labbott mentioned this pull request May 14, 2025

gimlet-seq-server: Allow A0 -> A0 and A2 -> A2 transitions #1466

Closed

hawkw mentioned this pull request Jun 6, 2025

Make SingleSp::set_power_state idempotent oxidecomputer/management-gateway-service#271

Closed

hawkw mentioned this pull request Oct 15, 2025

CPU sequencer A0->A0 power state transitions are not always idempotent #2271

Closed

hawkw mentioned this pull request Oct 15, 2025

cpu_seq: fix set_power_state idempotency for substates #2272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu-seq: Legalize the identity transition#2064

cpu-seq: Legalize the identity transition#2064
hawkw merged 8 commits intomasterfrom
eliza/illegal-transition

hawkw commented May 11, 2025

Uh oh!

Uh oh!

Uh oh!

jgallagher left a comment

Uh oh!

labbott May 12, 2025

Uh oh!

hawkw May 12, 2025

Uh oh!

Uh oh!

mkeeter left a comment

Uh oh!

Uh oh!

azaleacolburn commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hawkw commented May 11, 2025

Uh oh!

Uh oh!

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

labbott May 12, 2025

Choose a reason for hiding this comment

Uh oh!

hawkw May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkeeter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

azaleacolburn commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants