Skip to content

Conversation

@ujfalusi
Copy link
Collaborator

@ujfalusi ujfalusi commented Jan 15, 2024

When the system is suspended while audio is active, the
sof_ipc4_pcm_hw_free() is invoked to reset the pipelines since during
suspend the DSP is turned off, streams will be re-started after resume.

If the firmware crashes during while audio is running (or when we reset
the stream before suspend) then the sof_ipc4_set_multi_pipeline_state()
will fail with IPC error and the state change is interrupted.
This will cause misalignment between the kernel and firmware state on next
DSP boot resulting errors returned by firmware for IPC messages, eventually
failing the audio resume.
On stream close the errors are ignored so the kernel state will be
corrected on the next DSP boot, so the second boot after the DSP panic.

If sof_ipc4_trigger_pipelines() is called from sof_ipc4_pcm_hw_free() then
state parameter is SOF_IPC4_PIPE_RESET and only in this case.

Treat a forced pipeline reset similarly to how we treat a pcm_free by
ignoring error on state sending to allow the kernel's state to be
consistent with the state the firmware will have after the next boot.

Link: thesofproject/sof#8721
Signed-off-by: Peter Ujfalusi [email protected]

This was referenced Jan 15, 2024
@ujfalusi
Copy link
Collaborator Author

Most if not all CI PR tests have a DSP panic on TGL machines with audio+suspend, similar to thesofproject/sof#8721

In all cases we cannot recover after that DSP crash since the kernel is left in a broken state and the next audio will fail with firmware errors.

I know, this is not elegant, but on the cleanup path this is the only place where a DSP panic can break the execution of the cleanup and we did had not once seen similar issues.

@ujfalusi
Copy link
Collaborator Author

@ranj063, @bardliao, do you have better idea what we can do?

bardliao
bardliao previously approved these changes Jan 16, 2024
dbaluta
dbaluta previously approved these changes Jan 16, 2024
* widgets will be correct for the next boot.
*/
if (sdev->fw_state != SOF_FW_CRASHED ||
!(cmd == SNDRV_PCM_TRIGGER_STOP && state == SOF_IPC4_PIPE_RESET))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking only for state == SOF_IPC4_PIPE_RESET is enough for the workaround.

I'll update the patch and the commit message for more details I have gathered since yesterday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi can we do something like if (state == RESET && ret != -ETIMEDOUT) instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the firmware crashed before (during audio playback/capture) then the error is not going to be timeout, but -ENODEV and we need to ignore that as well.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 16, 2024

@ranj063, others could we trigger a DSP panic using some fancy new /sys/debug/ interface? Then maybe we could test this daily with sof-test.
It would not be perfect because too "deterministic", triggering the panic at more or less always the same time but it would still much better than no testing at all?

EDIT, this is what made me think about this:

@ujfalusi ujfalusi dismissed stale reviews from dbaluta and bardliao via 34a57bd January 17, 2024 10:10
@ujfalusi ujfalusi force-pushed the peter/sof/pr/ipc4-fw-panic-on-stop-01 branch from e2b5ccd to 34a57bd Compare January 17, 2024 10:10
When the system is suspended while audio is active, the
sof_ipc4_pcm_hw_free() is invoked to reset the pipelines since during
suspend the DSP is turned off, streams will be re-started after resume.

If the firmware crashes during while audio is running (or when we reset
the stream before suspend) then the sof_ipc4_set_multi_pipeline_state()
will fail with IPC error and the state change is interrupted.
This will cause misalignment between the kernel and firmware state on next
DSP boot resulting errors returned by firmware for IPC messages, eventually
failing the audio resume.
On stream close the errors are ignored so the kernel state will be
corrected on the next DSP boot, so the second boot after the DSP panic.

If sof_ipc4_trigger_pipelines() is called from sof_ipc4_pcm_hw_free() then
state parameter is SOF_IPC4_PIPE_RESET and only in this case.

Treat a forced pipeline reset similarly to how we treat a pcm_free by
ignoring error on state sending to allow the kernel's state to be
consistent with the state the firmware will have after the next boot.

Link: thesofproject/sof#8721
Signed-off-by: Peter Ujfalusi <[email protected]>
@ujfalusi ujfalusi changed the title ASoC: SOF: ipc4-pcm: Workaround for firmware crash on stream stop ASoC: SOF: ipc4-pcm: Workaround for crashed firmware on system suspend Jan 17, 2024
@ujfalusi ujfalusi marked this pull request as ready for review January 17, 2024 10:10
@ujfalusi
Copy link
Collaborator Author

Changes since v1:

  • ignore IPC errors when sof_ipc4_trigger_pipelines() is called via sof_ipc4_pcm_hw_free() (state == SOF_IPC4_PIPE_RESET)
  • Update comment and commit message
  • ready for review

@ujfalusi
Copy link
Collaborator Author

@ranj063, others could we trigger a DSP panic using some fancy new /sys/debug/ interface?

Just start audio playback ;)

Then maybe we could test this daily with sof-test. It would not be perfect because too "deterministic", triggering the panic at more or less always the same time but it would still much better than no testing at all?

I rather not spend time on introducing such an interface which can be abused on real systems. You can already send arbitrary messages to the DSP which might crash it, so...

EDIT, this is what made me think about this:

* [ASoC: SOF: debug: Introduce debug option to test DSP ops #4726](https://github.com/thesofproject/linux/pull/4726)

That PR was trying to introduce completely different interface and if that is 'alive' then the driver is effectively disabled, you cannot test how the driver would cope with a DSP panic at a given time.

@marc-hb
Copy link
Collaborator

marc-hb commented Jan 17, 2024

I rather not spend time on introducing such an interface which can be abused on real systems.

Once you have any sort of debug interface I'm pretty sure all security bets are already off.

You can already send arbitrary messages to the DSP which might crash it,

Fuzzing is supposed to catch most of that. Do you have a specific example? cc: @andyross

PR #4726 was trying to introduce completely different interface and if that is 'alive' then the driver is effectively disabled, you cannot test how the driver would cope with a DSP panic at a given time.

Forgot that sorry.

@plbossart plbossart requested review from bardliao and dbaluta January 18, 2024 17:12
@plbossart plbossart merged commit c89b9db into thesofproject:topic/sof-dev Jan 19, 2024
@ujfalusi ujfalusi deleted the peter/sof/pr/ipc4-fw-panic-on-stop-01 branch December 13, 2024 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants