cpu-seq: Legalize the identity transition#2064
Conversation
Currently, the `cpu-seq` API has the sequencer return an `IllegalTransition` error in the case where the requested power state is the same as the current power state. This is also the error variant returned when the requested state differs from the current state, but a transition between those two states cannot be performed. This is unfortunate, as the caller cannot distinguish between cases where the system *is* in the requested state and everything is basically fine, and cases where the caller asked the sequencer to do something that will never succeed. This commit changes the `cpu-seq` Idol API to return a `Transition` type from the `set_power_state` and `set_power_state_with_reason` IPCs that indicates whether or not a power state change occurred. Sequencer implementations are changed to return `Ok(Transition::NoChange)` instead of `Err(IllegalTransition)` when no state change occurred because the current and requested states are the same. I considered alternatively having the `IllegalTransition` error return the current state so that the caller can distinguish between illegal transitions that are totally disallowed and cases where no transition occurs. However, this requires changing the entire IPC from using `zerocopy` to `hubpack`, or adding an empty byte of padding to every *other* `SeqError` variant so that we can use the support for returning `Complex` error types with `zerocopy` I added in oxidecomputer/idolatry#59. That gets a bit trickier, as we also convert the `SeqError` code into a `u32` to send it upstack, so I figured keeping it as a C-like enum was better than adding logic to convert the error into a `u32` error code rather than just `as` casting it. Also, it felt a bit wrong for this to be an "error", in my opinion. Doing nothing because we were already in the requested state feels more like a different type of "success" to me... This is a first step towards fixing oxidecomputer/management-gateway-service#270. Fully fixing that will also require adding a way to indicate to MGS whether a state transition occurred or not, which also requires a change to the `gateway-sp-messages` crate.
The `mock-gimlet-seq-server` and `grapefruit-seq-server` had grown an amount of repeated logic between `set_state` and `set_state_with_reason` that I didn't love --- I`d prefer them both to just call `set_state_impl`. This commit changes them to just do that.
jgallagher
left a comment
There was a problem hiding this comment.
➕ to Matt's comments; LGTM otherwise.
|
|
||
| // Note that we don't use `Self::set_state_impl` here, as that will | ||
| // first attempt to get the current power state from `jefe`, and we | ||
| // haven't set it yet! | ||
| server.jefe.set_state(PowerState::A2 as u32); |
There was a problem hiding this comment.
Was this an underlying bug?
There was a problem hiding this comment.
No, it's just a consequence of the refactoring I did in 40a867b. Previously the mock sequencer implementations had a a set_state_impl function that just called jefe.set_state, and a validate_state_change function that checked the transition against the current state. I combined this into set_state_impl in 40a867b. I thought this was worth doing mostly becausethe need to determine whether or not a change is a no-op and only call the jefe.set_state IPC if it isn't a no-op (to accurately simulate the real sequencer's behavior WRT Jefe) made the IPC server's set_power_state and set_power_state_with_reason repeat a bit more logic and I thought centralizing it was better. On the other hand, having a set_state_impl that's called both in initialization and in the IPC methods was basically only abstracting over the cast from PowerState to u32, which seemed less important to have an abstraction for.
Honestly I think this might be explained a bit better by looking at that diff, rather than my rambly attempt to describe it.
Presently, the `SetPowerState` MGS request returns an `IllegalTransition` error when the system is already in the desired power state. This is generally not the desired behavior: the `IllegalTransition` error code represents both power state transitions that will *never* succeed, and cases where the SP was already in the desired power state and we didn't have to do anything (which is a success from the perspective of a maohority of callers). In the cases where this should be treated as an error, it's useful to be able to disambiguate between "you tried to request a transition that is never allowed" and "your request failed because it raced with another request", and in most other cases where callers just want to ensure that the system is in the desired power state, it should be treated as a success. See #270 for details. This change modifies the gateway-SP protocol to allow the SP to communicate more accurately about no-op power state transitions. `SpHandler`'s `set_power_state` method now returns a `PowerStateTransition` type that indicates whether or not a change occurred, and the `SpResponse` message enum has grown a new `PowerStateUnchanged` variant. We now construct either that variant or the `PowerStateSet` variant (née `SetPowerStateAck`, it's okay to rename variants already in the protocol as it won't change their encoding on the wire) based on whether or not the handler actually indicates that a change occurred. In `gateway-sp-comms`, we then re-construct the `PowerStateTransition` enum fro mthe two possible `SpResposne` messages. Encoding the no-op case as a new `SpResponse` message was chosen because it's the most backwards-compatible way to do this: in previous versions, the `SpResponse::SetPowerStateAck` message would only be sent if the state transitioned, and an error would be sent otherwise. Adding a new variant for the "no op" case preserves the semantics of the existing message. Alternatively, we could have added a new `SpResponse` variant that holds an additional enum or bool, and deprecate the `SetPowerStateAck` message, but that felt unfortunate, as it left behind a deprecated enum variant that would still need to be handled until it was eventually removed. Also, with two separate variants, we can encode both messages as a single byte instead of two. On the SP side, oxidecomputer/hubris#2064 lays the groundwork for this by changing the sequencer IPC interface to also indicate no-op power state transitions from `IllegalTransition` errors. An additional change will be necessary to integrate that with the new `gateway-messages` messages in this branch, if the Hubris PR merges first.
mkeeter
left a comment
There was a problem hiding this comment.
This looks reasonable to me!
I also tested humility dashboard and humility hiffy -f Sequencer.set_state, and they all work out of the box:
➜ hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A0
humility: attached via ST-Link V3
Sequencer.set_state() => NoChange
➜ hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A2
humility: attached via ST-Link V3
Sequencer.set_state() => Done
➜ hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A2
humility: attached via ST-Link V3
Sequencer.set_state() => NoChange
➜ hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A1
humility: attached via ST-Link V3
Sequencer.set_state() => Err(IllegalTransition)Presently, the `SetPowerState` MGS request returns an `IllegalTransition` error when the system is already in the desired power state. This is generally not the desired behavior: the `IllegalTransition` error code represents both power state transitions that will *never* succeed, and cases where the SP was already in the desired power state and we didn't have to do anything (which is a success from the perspective of a majority of callers). In the cases where this should be treated as an error, it's useful to be able to disambiguate between "you tried to request a transition that is never allowed" and "your request failed because it raced with another request", and in most other cases where callers just want to ensure that the system is in the desired power state, it should be treated as a success. See #270 for details. This change modifies the gateway-SP protocol to allow the SP to communicate more accurately about no-op power state transitions. `SpHandler`'s `set_power_state` method now returns a `PowerStateTransition` type that indicates whether or not a change occurred, and the `SpResponse` message enum has grown a new `PowerStateUnchanged` variant. We now construct either that variant or the `PowerStateSet` variant (née `SetPowerStateAck`, it's okay to rename variants already in the protocol as it won't change their encoding on the wire) based on whether or not the handler actually indicates that a change occurred. In `gateway-sp-comms`, we then re-construct the `PowerStateTransition` enum fro mthe two possible `SpResposne` messages. Encoding the no-op case as a new `SpResponse` message was chosen because it's the most backwards-compatible way to do this: in previous versions, the `SpResponse::SetPowerStateAck` message would only be sent if the state transitioned, and an error would be sent otherwise. Adding a new variant for the "no op" case preserves the semantics of the existing message. Alternatively, we could have added a new `SpResponse` variant that holds an additional enum or bool, and deprecate the `SetPowerStateAck` message, but that felt unfortunate, as it left behind a deprecated enum variant that would still need to be handled until it was eventually removed. Also, with two separate variants, we can encode both messages as a single byte instead of two. On the SP side, oxidecomputer/hubris#2064 lays the groundwork for this by changing the sequencer IPC interface to also indicate no-op power state transitions from `IllegalTransition` errors.
|
🏳️⚧️ |
Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.
Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.
Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.
Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.
Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.
Basically, this commit glues together the changes from #2064 and oxidecomputer/management-gateway-service#390. The `SpHandler::set_power_state` implementation for compute sleds in the `control-plane-agent` task now report the outcome of power state transitions using the new `gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390. This fixes oxidecomputer/management-gateway-service#270, and closes oxidecomputer/management-gateway-service#271. We'll need to update MGS proper (in the Omicron repo) to pick up the new `gateway-messages` changes, as well.
PR #2064 was _supposed_ to change the Gimlet and Cosmo CPU sequencer API so that requesting an idempotent power state transition (i.e. A0 -> A0) returns `Ok(Transition::Unchanged)` rather than `Err(SeqError::IllegalTransition)`. Unfortunately, it doesn't always do that in some cases. This is because the match arm in which we detect no-op power state transitions tests that the current and requested `PowerState`s are _equal_. But, unfortunately, the `PowerState` enum has more variants than just "A2" and "A0" --- it also represents _substates_ of A0 and A2, such as `A0PlusHP` and `A2PlusFans`. In particular, when a compute sled is fully up and running, it's actually *not* in `PowerState::A0`, it's in `PowerState::A0PlusHP`, because the NIC hotplug controller is enabled. The `PowerState::A0` and `PowerState::A0PlusHP` enum variants are not equal, so when a sled that's actually in `A0PlusHP` is told to go to `A0`, it doesn't match and falls through to the error case incorrectly. I'm pretty sure the reason we didn't realize this earlier is because, while I did test both A2->A2 and A0->A0 transitions when I was testing PR #2064, I would send the A0->A0 request more or less as soon as the system reached A0. I didn't wait for the host OS to come up before testing it, so the system was still in `PowerState::A0` and not `PowerState::A0PlusHP`. Whoopsie. This commit makes this operation idempotent in those cases by treating `A0PlusHP->A0` and `A2PlusFans->A2` transitions as idempotent successes rather than `IllegalTransition`s. I did *not* change the behavior in A0 substates that indicate a CPU reset condition (`A0Reset` and `A0Thermtrip`), as in those cases, we require an explicit transition back to A2 before the system will return to A0. Fixes #2271
PR #2064 was _supposed_ to change the Gimlet and Cosmo CPU sequencer API so that requesting an idempotent power state transition (i.e. A0 -> A0) returns `Ok(Transition::Unchanged)` rather than `Err(SeqError::IllegalTransition)`. Unfortunately, it doesn't always do that in some cases. This is because the match arm in which we detect no-op power state transitions tests that the current and requested `PowerState`s are _equal_. But, unfortunately, the `PowerState` enum has more variants than just "A2" and "A0" --- it also represents _substates_ of A0 and A2, such as `A0PlusHP` and `A2PlusFans`. In particular, when a compute sled is fully up and running, it's actually *not* in `PowerState::A0`, it's in `PowerState::A0PlusHP`, because the NIC hotplug controller is enabled. The `PowerState::A0` and `PowerState::A0PlusHP` enum variants are not equal, so when a sled that's actually in `A0PlusHP` is told to go to `A0`, it doesn't match and falls through to the error case incorrectly. I'm pretty sure the reason we didn't realize this earlier is because, while I did test both A2->A2 and A0->A0 transitions when I was testing PR #2064, I would send the A0->A0 request more or less as soon as the system reached A0. I didn't wait for the host OS to come up before testing it, so the system was still in `PowerState::A0` and not `PowerState::A0PlusHP`. Whoopsie. This commit makes this operation idempotent in those cases by treating `A0PlusHP->A0` and `A2PlusFans->A2` transitions as idempotent successes rather than `IllegalTransition`s. I did *not* change the behavior in A0 substates that indicate a CPU reset condition (`A0Reset` and `A0Thermtrip`), as in those cases, we require an explicit transition back to A2 before the system will return to A0. Fixes #2271
Currently, the
cpu-seqAPI has the sequencer return anIllegalTransitionerror in the case where the requested power state isthe same as the current power state. This is also the error variant
returned when the requested state differs from the current state, but a
transition between those two states cannot be performed. This is
unfortunate, as the caller cannot distinguish between cases where the
system is in the requested state and everything is basically fine, and
cases where the caller asked the sequencer to do something that will
never succeed.
This commit changes the
cpu-seqIdol API to return aTransitiontypefrom the
set_power_stateandset_power_state_with_reasonIPCs thatindicates whether or not a power state change occurred. Sequencer
implementations are changed to return
Ok(Transition::NoChange)insteadof
Err(IllegalTransition)when no state change occurred because thecurrent and requested states are the same.
I considered alternatively having the
IllegalTransitionerror returnthe current state so that the caller can distinguish between illegal
transitions that are totally disallowed and cases where no transition
occurs. However, this requires changing the entire IPC from using
zerocopytohubpack, or adding an empty byte of padding to everyother
SeqErrorvariant so that we can use the support for returningComplexerror types withzerocopyI added inoxidecomputer/idolatry#59. That gets a bit trickier, as we also convert
the
SeqErrorcode into au32to send it upstack, so I figuredkeeping it as a C-like enum was better than adding logic to convert the
error into a
u32error code rather than justascasting it. Also,it felt a bit wrong for this to be an "error", in my opinion. Doing
nothing because we were already in the requested state feels more like a
different type of "success" to me...
This is a first step towards fixing
oxidecomputer/management-gateway-service#270. Fully fixing that will
also require adding a way to indicate to MGS whether a state transition
occurred or not, which also requires a change to the
gateway-sp-messagescrate.