Skip to content

Conversation

@keyonjie
Copy link

@keyonjie keyonjie commented Nov 14, 2019

This is to fix #1492

We should mask interrupt once we got it in the handler, otherwise, we
may get interrupt storm in case where the irq thread has no chance to be
scheduled.

Signed-off-by: Keyon Jie [email protected]

@keyonjie
Copy link
Author

I have only tested this on BSW so wait to see the on device test result on baytrail from the CI.

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even going to look at this as is.
You are changing the entire IPC without a shred of explanations and using registers in completely different ways.
Please.
Let's be serious, shall we?

SHIM_IMRX_DONE);

if ((ipcx & SHIM_BYT_IPCX_DONE) &&
(imrx & SHIM_IMRX_DONE)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this?

SHIM_IMRX_BUSY);

if ((ipcd & SHIM_BYT_IPCD_BUSY) &&
(imrx & SHIM_IMRX_BUSY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyonjie I get the rationale for masking the IPC's before waking the thread but you did not answer @plbossart 's question about why you are changing this.

In my opinion, this check for imrx & SHIM_IMRX_BUSY or imrx & SHIM_IMRX_DONE in the thread is meaningless. You would never get to the thread if they werent true with your changes.

snd_sof_dsp_update_bits64_unlocked(sdev, BYT_DSP_BAR,
SHIM_IMRX,
SHIM_IMRX_DONE,
SHIM_IMRX_DONE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not touching this PR without a document explaining the ideas.

@keyonjie
Copy link
Author

keyonjie commented Nov 14, 2019

I am not even going to look at this as is.
You are changing the entire IPC without a shred of explanations and using registers in completely different ways.
Please.
Let's be serious, shall we?

Sorry, I foresaw this challenge, but I thought I had explained clearly in the commit message.

Let me think more about how to make it more straight-forward.

We should mask interrupt once we got it in the handler, otherwise, we
may get interrupt storm in case where the irq thread has no chance to be
scheduled.

Signed-off-by: Keyon Jie <[email protected]>
@ranj063
Copy link
Collaborator

ranj063 commented Nov 15, 2019

@keyonjie why not disable IPC interrupt (IMRX/IMRD?) in the handler and disable in the thread just like we do for HDA?

@keyonjie
Copy link
Author

@keyonjie why not disable IPC interrupt (IMRX/IMRD?) in the handler and disable in the thread just like we do for HDA?

Yes and no here.

"why not disable IPC interrupt (IMRX/IMRD?) in the handler?"

Yes for this, my PR is changing this to align with SKL+.

"and disable in the thread"

No for this in my PR, In BYT, there isn't a common bit for enabling/disabling IPC interrupt like HDA_DSP_ADSPIS_IPC in HDA case, that means we have to use IMRX for that.

@ranj063
Copy link
Collaborator

ranj063 commented Nov 15, 2019

No for this in my PR, In BYT, there isn't a common bit for enabling/disabling IPC interrupt like HDA_DSP_ADSPIS_IPC in HDA case, that means we have to use IMRX for that.

And is that a problem?

@plbossart
Copy link
Member

@keyonjie We've worked with BYT for a very long time, so the bar to change anything is very very high now.
I don't see why we would re-interpret the meaning of registers.
If something is broken we fix it, but we don't reinvent the IPC.

@keyonjie
Copy link
Author

Hi @plbossart @ranj063 I just filed an issue here:
#1492

Where you will see the issue very straight-forward.

@keyonjie
Copy link
Author

@keyonjie We've worked with BYT for a very long time, so the bar to change anything is very very high now.
I don't see why we would re-interpret the meaning of registers.
If something is broken we fix it, but we don't reinvent the IPC.

Understood, sorry I worked on SKL+ at that time and didn't pay more attention to BYT refining.

Please see the issue I filed here: #1492
It is obvious we have critical issues here, and this is why the system is hang in module unload/reload on BSW.

@plbossart
Copy link
Member

@keyonjie We've worked with BYT for a very long time, so the bar to change anything is very very high now.
I don't see why we would re-interpret the meaning of registers.
If something is broken we fix it, but we don't reinvent the IPC.

Understood, sorry I worked on SKL+ at that time and didn't pay more attention to BYT refining.

Please see the issue I filed here: #1492
It is obvious we have critical issues here, and this is why the system is hang in module unload/reload on BSW.

I've run the module load/unload tests on Baytrail before, no issues.
can you please retest on a regular baytrail device, e.g. Minnowboard, to make sure your assessment is indeed correct?

@keyonjie
Copy link
Author

keyonjie commented Nov 15, 2019

@plbossart as I commented in the issue #1492, both CHT and minnow board have the similar risk I described on #1492, we got multiple entries to the interrupt handler for a single interrupt, though it looks fine at module unload/reload on them.

That's why we need to change the IPC interrupt handling here.

see the dmesg logs below:

On CHT:

[ 6.469040] sof-audio-acpi 808622A8:00: ipc tx: 0x30030000
[ 6.469065] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469091] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469106] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000000000000000, ipcd:0x70028800
[ 6.469113] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469119] Keyon: byt_irq_thread, 216
[ 6.469156] Keyon: byt_irq_handler, 192, isr:0x0
[ 6.469171] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x0, ipcd:0x70028800
[ 6.469175] sof-audio-acpi 808622A8:00: ipc tx succeeded: 0x30030000
[ 6.469191] sof-audio-acpi 808622A8:00: sink PGA3.0 control none source BUF3.0
[ 6.469202] sof-audio-acpi 808622A8:00: ipc tx: 0x30030000
[ 6.469223] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469252] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469268] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000000000000000, ipcd:0x70028800
[ 6.469276] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469282] Keyon: byt_irq_thread, 216
[ 6.469318] Keyon: byt_irq_handler, 192, isr:0x0
[ 6.469335] sof-audio-acpi 808622A8:00: ipc tx succeeded: 0x30030000
[ 6.469352] sof-audio-acpi 808622A8:00: sink BUF3.1 control none source PGA3.0
[ 6.469355] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x0, ipcd:0x70028800
[ 6.469364] sof-audio-acpi 808622A8:00: ipc tx: 0x30030000
[ 6.469392] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469420] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469447] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469461] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000000000000000, ipcd:0x70028800
[ 6.469468] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469474] Keyon: byt_irq_thread, 216
[ 6.469485] Keyon: byt_irq_handler, 192, isr:0x1
[ 6.469529] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x0, ipcd:0x70028800
On Minnow board,

[ 12.833394] sof-audio-acpi 80860F28:01: booting DSP firmware
[ 12.835286] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835380] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835415] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835432] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835449] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835466] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835484] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835509] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835532] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835550] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835559] Keyon: byt_irq_thread, 210, imrx:0x0, ipcx:0x4000, ipcd:0x8000000070028800
[ 12.835567] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835591] Keyon: byt_irq_thread, 243
[ 12.835597] Keyon: byt_irq_handler, 192, isr:0x2
[ 12.835619] sof-audio-acpi 80860F28:01: ipc rx: 0x70000000
[ 12.835628] sof-audio-acpi 80860F28:01: ipc: DSP is ready 0x70000000 offset 0x144000
[ 12.835657] sof-audio-acpi 80860F28:01: Firmware info: version 1:1:0-f004d
[ 12.835665] sof-audio-acpi 80860F28:01: Firmware: ABI 3:11:0 Kernel ABI 3:11:0
[ 12.835673] sof-audio-acpi 80860F28:01: Firmware debug build 1 on Nov 14 2019-21:16:33 - options:

@plbossart
Copy link
Member

@keyonjie Can we first try and understand the initial blocker, which was to deal with module load/unload? There does not seem to be any correlation nor causation between IPC issues and module removals.

@keyonjie
Copy link
Author

@keyonjie Can we first try and understand the initial blocker, which was to deal with module load/unload? There does not seem to be any correlation nor causation between IPC issues and module removals.

Per my understanding, there could be, I believe the scenario we hang there is because of a IPC interrupt storm happened. The fact that applying these will fix the hang issue proved that.

And to say the least, I actually can't bear this kind of obvious risk in the code, can you?

@plbossart
Copy link
Member

@keyonjie Can we first try and understand the initial blocker, which was to deal with module load/unload? There does not seem to be any correlation nor causation between IPC issues and module removals.

Per my understanding, there could be, I believe the scenario we hang there is because of a IPC interrupt storm happened. The fact that applying these will fix the hang issue proved that.

And to say the least, I actually can't bear this kind of obvious risk in the code, can you?

In God we trust, others bring data.

Your theory of IPC rain storm may be right, but why doesn't it impact other platforms then? It needs to be backed by either experimental evidence or demonstration that the hang is removed with your fixes AND there is no regression on other platforms.

I am not going to change one line of IPC code without careful explanation and extensive testing. The code works, maybe not optimally, on all other platforms, and has done so for many months, so the 'obvious risk' is not so obvious, sorry. The last 'obvious' IPC fix was reverted due to other issues which exposed an obvious lack of extensive testing. Not going to happen twice, sorry.

It's fine to experiment, it's a completely different story to submit a PR and ask that it be merged. This PR does not have a 'Draft' or 'TEST' or 'RFC' status, so I will push back if I don't feel comfortable with the its maturity.

@keyonjie
Copy link
Author

@keyonjie Can we first try and understand the initial blocker, which was to deal with module load/unload? There does not seem to be any correlation nor causation between IPC issues and module removals.

Per my understanding, there could be, I believe the scenario we hang there is because of a IPC interrupt storm happened. The fact that applying these will fix the hang issue proved that.
And to say the least, I actually can't bear this kind of obvious risk in the code, can you?

In God we trust, others bring data.

Your theory of IPC rain storm may be right, but why doesn't it impact other platforms then? It needs to be backed by either experimental evidence or demonstration that the hang is removed with your fixes AND there is no regression on other platforms.

I am not going to change one line of IPC code without careful explanation and extensive testing. The code works, maybe not optimally, on all other platforms, and has done so for many months, so the 'obvious risk' is not so obvious, sorry. The last 'obvious' IPC fix was reverted due to other issues which exposed an obvious lack of extensive testing. Not going to happen twice, sorry.

It's fine to experiment, it's a completely different story to submit a PR and ask that it be merged. This PR does not have a 'Draft' or 'TEST' or 'RFC' status, so I will push back if I don't feel comfortable with the its maturity.

I totally understand your point and I appreciate to that, IPC is a crucial part so any change to it should pass stress test, this is the right attitude to guarantee our code quality.

Let me change the subject and call for more test on BYT/CHT/BSW.

@keyonjie keyonjie changed the title ASoC: SOF: Intel: BYT: refine the IPC interrupt handling [Call for Test]ASoC: SOF: Intel: BYT: refine the IPC interrupt handling Nov 15, 2019
@keyonjie
Copy link
Author

keyonjie commented Nov 15, 2019

@mengdonglin @keqiaozhang Can I ask for a stress/extensive test on BTY/CHT/BSW with this PR applied, it is not expected to fix issues on BYT/CHT, so no regression with it applied is good enough to me.

@fredoh9
Copy link
Collaborator

fredoh9 commented Nov 21, 2019

I ran one hour of stress-test. I don't see any regression issue.

@keyonjie
Copy link
Author

@fredoh9 Thank you. @plbossart @ranj063 The test on my side looks good on cht and bsw also.

@keyonjie keyonjie changed the title [Call for Test]ASoC: SOF: Intel: BYT: refine the IPC interrupt handling ASoC: SOF: Intel: BYT: refine the IPC interrupt handling Nov 21, 2019
@keyonjie keyonjie requested a review from plbossart November 24, 2019 04:54
@keyonjie
Copy link
Author

keyonjie commented Dec 1, 2019

@plbossart @ranj063 Can you consider taking this one please, the logs in #1492 shows the risk without this and test shows no regression with the PR.

@plbossart
Copy link
Member

@plbossart @ranj063 Can you consider taking this one please, the logs in #1492 shows the risk without this and test shows no regression with the PR.

no. I don't have a full picture of why this is needed and what your fix accomplishes and what it actually fixes, and if we have similar issues with other platforms.

@keyonjie
Copy link
Author

keyonjie commented Dec 3, 2019

@plbossart @ranj063 Can you consider taking this one please, the logs in #1492 shows the risk without this and test shows no regression with the PR.

no. I don't have a full picture of why this is needed and what your fix accomplishes and what it actually fixes, and if we have similar issues with other platforms.

Why this is needed:
With today's code, on BYT, byt_irq_handler() is entered multiple times for a single IPC interrupt, as we don't mask the interrupt in time in the handler, the next interrupt handler entry(caused by the same IPC interrupt) could happen before the previous interrupt thread mask the interrupt and preempt it. The test code and test result demonstrate this.
Not talking about the consequence and risk of it, this "byt_irq_handler() is entered multiple times for a single IPC interrupt" goes against our design original intention, isn't it?

What my fix accomplishes:
With moving the masking of the SHIM_IMRX_BUSY/DONE bits from the interrupt thread to the interrupt handler, the 2nd interrupt handler for the same IPC interrupt will not happen, and the interrupt thread will not be scheduled more than one time with wrongly.

What it actually fixes:
I hit issue that sometimes the interrupt handler will keep invoking and the irq thread can never be run and then the whole system is hang, on BSW.

Similar issues with other platforms?
I believe we need the similar fixes for BDW/HSW as we don't have a GIE to be disabled there, but for cAVS platforms, we don't need this fix as we disable GIE in the interrupt handler the first time.

@plbossart
Copy link
Member

@plbossart @ranj063 Can you consider taking this one please, the logs in #1492 shows the risk without this and test shows no regression with the PR.

no. I don't have a full picture of why this is needed and what your fix accomplishes and what it actually fixes, and if we have similar issues with other platforms.

Why this is needed:
With today's code, on BYT, byt_irq_handler() is entered multiple times for a single IPC interrupt, as we don't mask the interrupt in time in the handler, the next interrupt handler entry(caused by the same IPC interrupt) could happen before the previous interrupt thread mask the interrupt and preempt it. The test code and test result demonstrate this.
Not talking about the consequence and risk of it, this "byt_irq_handler() is entered multiple times for a single IPC interrupt" goes against our design original intention, isn't it?

What my fix accomplishes:
With moving the masking of the SHIM_IMRX_BUSY/DONE bits from the interrupt thread to the interrupt handler, the 2nd interrupt handler for the same IPC interrupt will not happen, and the interrupt thread will not be scheduled more than one time with wrongly.

What it actually fixes:
I hit issue that sometimes the interrupt handler will keep invoking and the irq thread can never be run and then the whole system is hang, on BSW.

Similar issues with other platforms?
I believe we need the similar fixes for BDW/HSW as we don't have a GIE to be disabled there, but for cAVS platforms, we don't need this fix as we disable GIE in the interrupt handler the first time.

ok, let me make my point very simple:

please work on D0i3 support as your first priority.
When it's complete, we can go back to IPC - and you will have to request QA support.

Multi-tasking across completely unrelated platforms is inefficient. I will not even look at baytrail IPC until D0i3 is complete, merged and fully tested.

@keyonjie
Copy link
Author

keyonjie commented Dec 4, 2019

ok, let me make my point very simple:

please work on D0i3 support as your first priority.
When it's complete, we can go back to IPC - and you will have to request QA support.

Multi-tasking across completely unrelated platforms is inefficient. I will not even look at baytrail IPC until D0i3 is complete, merged and fully tested.

That's true, thanks for your clear point.

@plbossart plbossart added the Unclear No agreement on problem statement and resolution label Feb 4, 2020
@plbossart
Copy link
Member

replaced by PR #2138

@plbossart plbossart closed this May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Unclear No agreement on problem statement and resolution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BYT: byt_irq_handler() is entered multiple times for a single IPC interrupt

4 participants