Skip to content

cpu-seq: Legalize the identity transition#2064

Merged
hawkw merged 8 commits intomasterfrom
eliza/illegal-transition
May 12, 2025
Merged

cpu-seq: Legalize the identity transition#2064
hawkw merged 8 commits intomasterfrom
eliza/illegal-transition

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented May 11, 2025

Currently, the cpu-seq API has the sequencer return an
IllegalTransition error in the case where the requested power state is
the same as the current power state. This is also the error variant
returned when the requested state differs from the current state, but a
transition between those two states cannot be performed. This is
unfortunate, as the caller cannot distinguish between cases where the
system is in the requested state and everything is basically fine, and
cases where the caller asked the sequencer to do something that will
never succeed.

This commit changes the cpu-seq Idol API to return a Transition type
from the set_power_state and set_power_state_with_reason IPCs that
indicates whether or not a power state change occurred. Sequencer
implementations are changed to return Ok(Transition::NoChange) instead
of Err(IllegalTransition) when no state change occurred because the
current and requested states are the same.

I considered alternatively having the IllegalTransition error return
the current state so that the caller can distinguish between illegal
transitions that are totally disallowed and cases where no transition
occurs. However, this requires changing the entire IPC from using
zerocopy to hubpack, or adding an empty byte of padding to every
other SeqError variant so that we can use the support for returning
Complex error types with zerocopy I added in
oxidecomputer/idolatry#59. That gets a bit trickier, as we also convert
the SeqError code into a u32 to send it upstack, so I figured
keeping it as a C-like enum was better than adding logic to convert the
error into a u32 error code rather than just as casting it. Also,
it felt a bit wrong for this to be an "error", in my opinion. Doing
nothing because we were already in the requested state feels more like a
different type of "success" to me...

This is a first step towards fixing
oxidecomputer/management-gateway-service#270. Fully fixing that will
also require adding a way to indicate to MGS whether a state transition
occurred or not, which also requires a change to the
gateway-sp-messages crate.

hawkw added 2 commits May 11, 2025 10:40
Currently, the `cpu-seq` API has the sequencer return an
`IllegalTransition` error in the case where the requested power state is
the same as the current power state. This is also the error variant
returned when the requested state differs from the current state, but a
transition between those two states cannot be performed. This is
unfortunate, as the caller cannot distinguish between cases where the
system *is* in the requested state and everything is basically fine, and
cases where the caller asked the sequencer to do something that will
never succeed.

This commit changes the `cpu-seq` Idol API to return a `Transition` type
from the `set_power_state` and `set_power_state_with_reason` IPCs that
indicates whether or not a power state change occurred. Sequencer
implementations are changed to return `Ok(Transition::NoChange)` instead
of `Err(IllegalTransition)` when no state change occurred because the
current and requested states are the same.

I considered alternatively having the `IllegalTransition` error return
the current state so that the caller can distinguish between illegal
transitions that are totally disallowed and cases where no transition
occurs. However, this requires changing the entire IPC from using
`zerocopy` to `hubpack`, or adding an empty byte of padding to every
*other* `SeqError` variant so that we can use the support for returning
`Complex` error types with `zerocopy` I added in
oxidecomputer/idolatry#59. That gets a bit trickier, as we also convert
the `SeqError` code into a `u32` to send it upstack, so I figured
keeping it as a C-like enum was better than adding logic to convert the
error into a `u32` error code rather than just `as` casting it. Also,
it felt a bit wrong for this to be an "error", in my opinion. Doing
nothing because we were already in the requested state feels more like a
different type of "success" to me...

This is a first step towards fixing
oxidecomputer/management-gateway-service#270. Fully fixing that will
also require adding a way to indicate to MGS whether a state transition
occurred or not, which also requires a change to the
`gateway-sp-messages` crate.
The `mock-gimlet-seq-server` and `grapefruit-seq-server` had grown an
amount of repeated logic between `set_state` and `set_state_with_reason`
that I didn't love --- I`d prefer them both to just call
`set_state_impl`. This commit changes them to just do that.
@hawkw hawkw requested review from jgallagher and mkeeter May 11, 2025 17:53
@hawkw hawkw self-assigned this May 11, 2025
@hawkw hawkw marked this pull request as ready for review May 11, 2025 17:54
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➕ to Matt's comments; LGTM otherwise.

@hawkw hawkw requested review from jgallagher and mkeeter May 12, 2025 16:27
Comment on lines +99 to +103

// Note that we don't use `Self::set_state_impl` here, as that will
// first attempt to get the current power state from `jefe`, and we
// haven't set it yet!
server.jefe.set_state(PowerState::A2 as u32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this an underlying bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's just a consequence of the refactoring I did in 40a867b. Previously the mock sequencer implementations had a a set_state_impl function that just called jefe.set_state, and a validate_state_change function that checked the transition against the current state. I combined this into set_state_impl in 40a867b. I thought this was worth doing mostly becausethe need to determine whether or not a change is a no-op and only call the jefe.set_state IPC if it isn't a no-op (to accurately simulate the real sequencer's behavior WRT Jefe) made the IPC server's set_power_state and set_power_state_with_reason repeat a bit more logic and I thought centralizing it was better. On the other hand, having a set_state_impl that's called both in initialization and in the IPC methods was basically only abstracting over the cast from PowerState to u32, which seemed less important to have an abstraction for.

Honestly I think this might be explained a bit better by looking at that diff, rather than my rambly attempt to describe it.

hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request May 12, 2025
Presently, the `SetPowerState` MGS request returns an
`IllegalTransition` error when the system is already in the desired
power state. This is generally not the desired behavior: the
`IllegalTransition` error code represents both power state transitions
that will *never* succeed, and cases where the SP was already in the
desired power state and we didn't have to do anything (which is a
success from the perspective of a maohority of callers). In the cases
where this should be treated as an error, it's useful to be able to
disambiguate between "you tried to request a transition that is never
allowed" and "your request failed because it raced with another
request", and in most other cases where callers just want to ensure that
the system is in the desired power state, it should be treated as a
success. See #270 for details.

This change modifies the gateway-SP protocol to allow the SP to
communicate more accurately about no-op power state transitions.
`SpHandler`'s `set_power_state` method now returns a
`PowerStateTransition` type that indicates whether or not a change
occurred, and the `SpResponse` message enum has grown a new
`PowerStateUnchanged` variant. We now construct either that variant or
the `PowerStateSet` variant (née `SetPowerStateAck`, it's okay to rename
variants already in the protocol as it won't change their encoding on
the wire) based on whether or not the handler actually indicates that a
change occurred. In `gateway-sp-comms`, we then re-construct the
`PowerStateTransition` enum fro mthe two possible `SpResposne` messages.

Encoding the no-op case as a new `SpResponse` message was chosen because
it's the most backwards-compatible way to do this: in previous versions,
the `SpResponse::SetPowerStateAck` message would only be sent if the
state transitioned, and an error would be sent otherwise. Adding a new
variant for the "no op" case preserves the semantics of the existing
message. Alternatively, we could have added a new `SpResponse` variant
that holds an additional enum or bool, and deprecate the
`SetPowerStateAck` message, but that felt unfortunate, as it left behind
a deprecated enum variant that would still need to be handled until it
was eventually removed. Also, with two separate variants, we can encode
both messages as a single byte instead of two.

On the SP side, oxidecomputer/hubris#2064 lays the groundwork for this
by changing the sequencer IPC interface to also indicate no-op
power state transitions from `IllegalTransition` errors. An additional
change will be necessary to integrate that with the new
`gateway-messages` messages in this branch, if the Hubris PR merges
first.
Copy link
Collaborator

@mkeeter mkeeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me!

I also tested humility dashboard and humility hiffy -f Sequencer.set_state, and they all work out of the box:

hubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A0
humility: attached via ST-Link V3
Sequencer.set_state() => NoChangehubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A2
humility: attached via ST-Link V3
Sequencer.set_state() => Donehubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A2
humility: attached via ST-Link V3
Sequencer.set_state() => NoChangehubris jj:(nsn (empty)) h hiffy -c Sequencer.set_state -a state=A1
humility: attached via ST-Link V3
Sequencer.set_state() => Err(IllegalTransition)

hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request May 12, 2025
Presently, the `SetPowerState` MGS request returns an
`IllegalTransition` error when the system is already in the desired
power state. This is generally not the desired behavior: the
`IllegalTransition` error code represents both power state transitions
that will *never* succeed, and cases where the SP was already in the
desired power state and we didn't have to do anything (which is a
success from the perspective of a majority of callers). In the cases
where this should be treated as an error, it's useful to be able to
disambiguate between "you tried to request a transition that is never
allowed" and "your request failed because it raced with another
request", and in most other cases where callers just want to ensure that
the system is in the desired power state, it should be treated as a
success. See #270 for details.

This change modifies the gateway-SP protocol to allow the SP to
communicate more accurately about no-op power state transitions.
`SpHandler`'s `set_power_state` method now returns a
`PowerStateTransition` type that indicates whether or not a change
occurred, and the `SpResponse` message enum has grown a new
`PowerStateUnchanged` variant. We now construct either that variant or
the `PowerStateSet` variant (née `SetPowerStateAck`, it's okay to rename
variants already in the protocol as it won't change their encoding on
the wire) based on whether or not the handler actually indicates that a
change occurred. In `gateway-sp-comms`, we then re-construct the
`PowerStateTransition` enum fro mthe two possible `SpResposne` messages.

Encoding the no-op case as a new `SpResponse` message was chosen because
it's the most backwards-compatible way to do this: in previous versions,
the `SpResponse::SetPowerStateAck` message would only be sent if the
state transitioned, and an error would be sent otherwise. Adding a new
variant for the "no op" case preserves the semantics of the existing
message. Alternatively, we could have added a new `SpResponse` variant
that holds an additional enum or bool, and deprecate the
`SetPowerStateAck` message, but that felt unfortunate, as it left behind
a deprecated enum variant that would still need to be handled until it
was eventually removed. Also, with two separate variants, we can encode
both messages as a single byte instead of two.

On the SP side, oxidecomputer/hubris#2064 lays the groundwork for this
by changing the sequencer IPC interface to also indicate no-op
power state transitions from `IllegalTransition` errors.
@hawkw hawkw enabled auto-merge (squash) May 12, 2025 20:11
@hawkw hawkw merged commit 707a16d into master May 12, 2025
135 checks passed
@hawkw hawkw deleted the eliza/illegal-transition branch May 12, 2025 20:26
@azaleacolburn
Copy link

🏳️‍⚧️

hawkw added a commit that referenced this pull request May 12, 2025
Basically, this commit glues together the changes from
#2064 and
oxidecomputer/management-gateway-service#390. The
`SpHandler::set_power_state` implementation for compute sleds in the
`control-plane-agent`
task now report the outcome of power state transitions using the new
`gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390.

This fixes oxidecomputer/management-gateway-service#270, and closes
oxidecomputer/management-gateway-service#271.

We'll need to update MGS proper (in the Omicron repo) to pick up the new
`gateway-messages` changes, as well.
hawkw added a commit that referenced this pull request Jun 6, 2025
Basically, this commit glues together the changes from
#2064 and
oxidecomputer/management-gateway-service#390. The
`SpHandler::set_power_state` implementation for compute sleds in the
`control-plane-agent`
task now report the outcome of power state transitions using the new
`gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390.

This fixes oxidecomputer/management-gateway-service#270, and closes
oxidecomputer/management-gateway-service#271.

We'll need to update MGS proper (in the Omicron repo) to pick up the new
`gateway-messages` changes, as well.
lzrd pushed a commit that referenced this pull request Jun 9, 2025
Basically, this commit glues together the changes from
#2064 and
oxidecomputer/management-gateway-service#390. The
`SpHandler::set_power_state` implementation for compute sleds in the
`control-plane-agent`
task now report the outcome of power state transitions using the new
`gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390.

This fixes oxidecomputer/management-gateway-service#270, and closes
oxidecomputer/management-gateway-service#271.

We'll need to update MGS proper (in the Omicron repo) to pick up the new
`gateway-messages` changes, as well.
lzrd pushed a commit that referenced this pull request Jun 16, 2025
Basically, this commit glues together the changes from
#2064 and
oxidecomputer/management-gateway-service#390. The
`SpHandler::set_power_state` implementation for compute sleds in the
`control-plane-agent`
task now report the outcome of power state transitions using the new
`gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390.

This fixes oxidecomputer/management-gateway-service#270, and closes
oxidecomputer/management-gateway-service#271.

We'll need to update MGS proper (in the Omicron repo) to pick up the new
`gateway-messages` changes, as well.
lzrd pushed a commit that referenced this pull request Jun 17, 2025
Basically, this commit glues together the changes from
#2064 and
oxidecomputer/management-gateway-service#390. The
`SpHandler::set_power_state` implementation for compute sleds in the
`control-plane-agent`
task now report the outcome of power state transitions using the new
`gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390.

This fixes oxidecomputer/management-gateway-service#270, and closes
oxidecomputer/management-gateway-service#271.

We'll need to update MGS proper (in the Omicron repo) to pick up the new
`gateway-messages` changes, as well.
lzrd pushed a commit that referenced this pull request Jun 18, 2025
Basically, this commit glues together the changes from
#2064 and
oxidecomputer/management-gateway-service#390. The
`SpHandler::set_power_state` implementation for compute sleds in the
`control-plane-agent`
task now report the outcome of power state transitions using the new
`gateway_messages::PowerStateTransition` type added in oxidecomputer/management-gateway-service#390.

This fixes oxidecomputer/management-gateway-service#270, and closes
oxidecomputer/management-gateway-service#271.

We'll need to update MGS proper (in the Omicron repo) to pick up the new
`gateway-messages` changes, as well.
hawkw added a commit that referenced this pull request Oct 15, 2025
PR #2064 was _supposed_ to change the Gimlet and Cosmo CPU sequencer API
so that requesting an idempotent power state transition (i.e. A0 -> A0)
returns `Ok(Transition::Unchanged)` rather than
`Err(SeqError::IllegalTransition)`. Unfortunately, it doesn't always do
that in some cases.

This is because the match arm in which we detect no-op power state
transitions tests that the current and requested `PowerState`s are
_equal_. But, unfortunately, the `PowerState` enum has more variants
than just "A2" and "A0" --- it also represents _substates_ of A0 and A2,
such as `A0PlusHP` and `A2PlusFans`.

In particular, when a compute sled is fully up and running, it's
actually *not* in `PowerState::A0`, it's in `PowerState::A0PlusHP`,
because the NIC hotplug controller is enabled. The `PowerState::A0` and
`PowerState::A0PlusHP` enum variants are not equal, so when a sled
that's actually in `A0PlusHP` is told to go to `A0`, it doesn't match
and falls through to the error case incorrectly.

I'm pretty sure the reason we didn't realize this earlier is because,
while I did test both A2->A2 and A0->A0 transitions when I was testing
PR #2064, I would send the A0->A0 request more or less as soon as the
system reached A0. I didn't wait for the host OS to come up before
testing it, so the system was still in `PowerState::A0` and not
`PowerState::A0PlusHP`. Whoopsie.

This commit makes this operation idempotent in those cases by treating
`A0PlusHP->A0` and `A2PlusFans->A2` transitions as idempotent successes
rather than `IllegalTransition`s. I did *not* change the behavior in A0
substates that indicate a CPU reset condition (`A0Reset` and
`A0Thermtrip`), as in those cases, we require an explicit transition
back to A2 before the system will return to A0.

Fixes #2271
hawkw added a commit that referenced this pull request Oct 15, 2025
PR #2064 was _supposed_ to change the Gimlet and Cosmo CPU sequencer API
so that requesting an idempotent power state transition (i.e. A0 -> A0)
returns `Ok(Transition::Unchanged)` rather than
`Err(SeqError::IllegalTransition)`. Unfortunately, it doesn't always do
that in some cases.

This is because the match arm in which we detect no-op power state
transitions tests that the current and requested `PowerState`s are
_equal_. But, unfortunately, the `PowerState` enum has more variants
than just "A2" and "A0" --- it also represents _substates_ of A0 and A2,
such as `A0PlusHP` and `A2PlusFans`.

In particular, when a compute sled is fully up and running, it's
actually *not* in `PowerState::A0`, it's in `PowerState::A0PlusHP`,
because the NIC hotplug controller is enabled. The `PowerState::A0` and
`PowerState::A0PlusHP` enum variants are not equal, so when a sled
that's actually in `A0PlusHP` is told to go to `A0`, it doesn't match
and falls through to the error case incorrectly.

I'm pretty sure the reason we didn't realize this earlier is because,
while I did test both A2->A2 and A0->A0 transitions when I was testing
PR #2064, I would send the A0->A0 request more or less as soon as the
system reached A0. I didn't wait for the host OS to come up before
testing it, so the system was still in `PowerState::A0` and not
`PowerState::A0PlusHP`. Whoopsie.

This commit makes this operation idempotent in those cases by treating
`A0PlusHP->A0` and `A2PlusFans->A2` transitions as idempotent successes
rather than `IllegalTransition`s. I did *not* change the behavior in A0
substates that indicate a CPU reset condition (`A0Reset` and
`A0Thermtrip`), as in those cases, we require an explicit transition
back to A2 before the system will return to A0.

Fixes #2271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants