drivers/timers/arch_alarm: Revert removal of ndelay_accurate#17221
Conversation
This reverts the removal of ndelay_accurate from apache#14450, since as mentioned in apache#17011, this fails to consider the `sim` architecture where CONFIG_BOARD_LOOPSPERMSEC was set to 0 because of reliance on the accurate implementations of the up_delay functions. All the commit did was remove a more accurate implementation in favour of a less accurate one. Signed-off-by: Matteo Golin <matteo.golin@gmail.com>
db1aa9e to
88b302d
Compare
|
@jlaitine FYI |
cederom
left a comment
There was a problem hiding this comment.
Thank you @linguini1 for the PR and @xiaoxiang781216 for the hints in #17011 :-)
One remark here, if we revert this commit there may be a short window of time where timers will be again inaccurate until fixes come from #17011? Or is it safe / necessary to merge in order to proceed with #17011 ?
This change should make the timers more accurate, by adding back the accurate Ultimately I think both ways have trade-offs, but the PR #14450 only says "Fixes many issues where up_udelay is used", not really citing any drivers/functionality that broke. However, concretely, removing this accurate |
Did you even read the summary of PR #14450 ? It doesn't "only say" what you claim, but explain the problem that the PR fixes. The ndelay_accurate was not very accurate when it quantized the delay to systick length. This caused e.g. udelay(1) to delay for close to 10000 microseconds instead of busylooping for 1 microsecond, on a system with 10ms tick. For any hardware driver initialization, where one needed to wait for a few hundred nanoseconds, the udelay(1) was commontly used. With the "accurate" ndelay everything just blew up. So when you revert that, please make sure that it doesn't happen again. I made a fix for mpfs to use architecture specific delays, which really check the time from a timer when busylooping, and that is of course the way to go. But not using the current time from oneshot_current, which calculated it from the tick! For other platforms, such as arm64, the issue will be back. |
|
Thanks @jlaitine for the valuable feedback! Yes @linguini1 is working on fixing this timing issues that you mention globally.. as you too know this area could you please take a look at #17011 and provide some hints is this the right direction? :-) |
Sorry @jlaitine , I didn't mean to imply that's all your PR said. I should have worded a bit better. What I mean is, your PR points out this flaw with the "accurate timer", namely that it's not very accurate for short sleep times where the busy wait excels. I agree with your assessment entirely. But, the PR doesn't mention any affected drivers or subsystems that broke (to my knowledge, writing this on the bus from memory), rather just that the delay time is suboptimal. However, in this PR, I do reintroduce a sleep method that is suboptimal for short delays like you mentioned, but it's actively fixing the SIM architecture which does not respect delays at all, broken.
You're right, I want a better solution than just reverting. Right now that seems like the lesser evil to me (worse short delay times for respected delays on sim). Do you have any ideas for how we can improve the implementation in general that would help both scenarios (i.e more accurate than both the old implementation and the busy wait)?
arm64 should also have a good, accurate timer like mpfs to fix this, right? Do you think that @anchao 's recent replacement of up_delay functions with nxsched_delay would resolve this problem? Please let me know what you think, I really value your input and I didn't mean to be reductive about your PR. |
The accuracy lose is introduced by #15929. Before this PR, the accuracy is same as the hardware timer.
it's a very bad idea to add tick variant in #7033, @Fix-Point take a long time to fix this problem and provide a better solution, which summary in this year workshop, please watch it: so, it's better to revert your change(both #15929 and #14450) to unblock other arch(e.g. sim and many other arch/chip) which never implement tick variant. |
|
So the intent is to now put back the systick drift, which will randomly break the hardware drivers by waking up the tick based sleeps too early? The change #15929 was done because in I2C and SPI drivers on MPFS platforms the nxsem_tickwait_uninterruptible randombly woke up too early (asked to sleep at least for 1 tick but randomly woke up on the next tick start). And the same issue occured also on arm64, and on every platform I tried to use. If currently the only broken architecture is "sim", why not add specific up_ndelay and up_udelay for "sim", and not break every other arch? It is easy to add a architecture specific variant of those functions (like I did in #16485 for mpfs). Accurate versions of up_ndelay etc. should just be done by the arch (hence the name up_*), because that is the place where all the architecture specific details of the timers are known. I am seeing many very strange things around tick timer, for example this doesn't make sense to me: #15938 . If someone is measuring / sleeping long times by using systick, he is doing something very wrong. The systick accuracy should not matter, but systick time should be constant. It should not vary! Instead of making systick time vary, one should have a constant defining the exact systick time after all the HW timer dependent roundings, and use that for calculations (how many ticks are needed for a certain time). Anyhow, I am glad to hear that there are improvements coming! Unfortunately I was unable to participate the workshop this year, I will look into the presentation for sure! Before that, below is a quick draft of how I imagined the timers should be architected. Again, I am not pushing any changes or anything, below is just an opinion. I really need to look into the @Fix-Point 's work, I am sure it will be good :) Here any driver could register for accurate periodic callbacks to the common TimerDriver, and the systick could be just one of these. |
|
I don't think we should revert #15929 based on what @jlaitine has said. Sleeping for a tick too long is one thing, but delays should never return before the delay is over. I would be willing to do an architecture-specific implementation of Anyways, to give context on the bigger picture for why I'm reverting this:
So ideally, there would be a way to modify the delay in TLDR; the busy-wait is bad and prone to user-error. The old method sleeps longer than it should and isn't great, but it fixes things and a new method is coming soon. Delay functions should never wake up early, but can wake up late. |
|
Can this PR be merged? And we can open an issue to track this sub-optimal solution so it can be replaced when @Fix-Point has their changes merged? |
|
Just please verify that 1) the udelay busyloops only appriximately the correct time and 2) tick based timeouts don't randomly timeout too early. These would be fatal bugs, and were the reasons why all those previous fixes were done. Otherwise, I don't have a religion on this; I am just tired of debugging issues caused by misbehaving basic timers... |
Okay, in that case I will mark this as a draft PR so merging can be held until I have some proper verification that the reverted commit won't cause timeouts to happen too early. @jlaitine would you want to see some kind of empirical test on multiple architectures, or just some empirical test on the simulator + logical verification that this 'accurate' method can never time out too early? I am not as well-versed on the systick drift issue as you would be, I'm sure. If you have any pointers to where to look for the underlying reason of the systick drift, I'm all ears. I figured the system tick wouldn't have such an issue but I guess that was naive. |
I'll try to do my best to help in testing. For the previous related PRs I did some simple tests, I'll try to find and gather them and try this PR out on some risc-v and arm64 targets at least (as those are the most important for me at the time being) - and report how it behaves. It will take a day or two before I can put effort in this, as I am currently in the middle of something else. Anyhow, I'll try to be more helpful in the future. |
xiaoxiang781216
left a comment
There was a problem hiding this comment.
@linguini1 @jlaitine after #17345, the accuracy of oneshot time doesn't depend on CONFIGUSEC_PER_TICK anymore, so this patch could move forward.
|
Waiting to merge for @jlaitine's feedback/blessing. |
Thanks for pinging, I will run some tests with real HW today and let you know! |
|
This is working fine on MPFS (risc-v 64-bit ) and IMX9 (arm64) platforms, thanks for giving me time to look into this! I finally also got time to check how the count-based oneshot driver works, and now I believe the implementation for systick is correct (#17345). The earlier timer drift issue is not a problem any more with this approach. |
Summary
This reverts the removal of ndelay_accurate from #14450, since as mentioned in #17011, this fails to consider the
simarchitecture where CONFIG_BOARD_LOOPSPERMSEC was set to 0 because of reliance on the accurate implementations of the up_delay functions. All the commit did was remove a more accurate implementation in favour of a less accurate one.Impact
Fixes delays not being respected at all (0 delay) on simulation configurations.
Testing
I made the following modification to the
cpuhogapplication:On the
masterbranch of NuttX, when running thecpuhogapplication insim, we see that thetv_secmember of the timespect structure has not increased by one second after the delay. The sim architecture does not respect delays at all (since LOOPSPERMSEC is 0 in almost every sim config, so busy-wait never runs).Now I run the application again with the changes from this PR.
BOARD_LOOPSPERMSECis still configured to 0, so busy-waits would not be respected.Here is the log output, which shows that after calling
up_mdelayfor 1001ms, thetv_secmember of the timespec structure has increased by one second, as expected: