Skip to content

drivers/timers/arch_alarm: Revert removal of ndelay_accurate#17221

Merged
xiaoxiang781216 merged 1 commit into
apache:masterfrom
linguini1:revert-ndelay-accurate
Nov 25, 2025
Merged

drivers/timers/arch_alarm: Revert removal of ndelay_accurate#17221
xiaoxiang781216 merged 1 commit into
apache:masterfrom
linguini1:revert-ndelay-accurate

Conversation

@linguini1
Copy link
Copy Markdown
Contributor

@linguini1 linguini1 commented Oct 21, 2025

Summary

This reverts the removal of ndelay_accurate from #14450, since as mentioned in #17011, this fails to consider the sim architecture where CONFIG_BOARD_LOOPSPERMSEC was set to 0 because of reliance on the accurate implementations of the up_delay functions. All the commit did was remove a more accurate implementation in favour of a less accurate one.

Impact

Fixes delays not being respected at all (0 delay) on simulation configurations.

Testing

I made the following modification to the cpuhog application:

diff --git a/examples/cpuhog/cpuhog_main.c b/examples/cpuhog/cpuhog_main.c
index 6a45ddad0..fccccd8ce 100644
--- a/examples/cpuhog/cpuhog_main.c
+++ b/examples/cpuhog/cpuhog_main.c
@@ -86,10 +86,17 @@ int main(int argc, FAR char *argv[])
 {
   int id = -1;
   char buf[256];
   int fd = -1;

+  struct timespec before;
+  struct timespec after;
+  clock_gettime(CLOCK_MONOTONIC, &before);
+  up_mdelay(1001);
+  clock_gettime(CLOCK_MONOTONIC, &after);
+  printf("Before: %lu, after: %lu\n", before.tv_sec, after.tv_sec);
+
   if (!g_state.initialized)
     {
       sem_init(&g_state.sem, 0, 1);
       mkfifo(CPUHOG_FIFO_FNAME, 0666);
       g_state.count = 0;

On the master branch of NuttX, when running the cpuhog application in sim, we see that the tv_sec member of the timespect structure has not increased by one second after the delay. The sim architecture does not respect delays at all (since LOOPSPERMSEC is 0 in almost every sim config, so busy-wait never runs).

NuttShell (NSH) NuttX-12.11.0
nsh> cpuhog
Before: 1, after: 1
cpuhog initialized
cpuhog 0: consumer

Now I run the application again with the changes from this PR. BOARD_LOOPSPERMSEC is still configured to 0, so busy-waits would not be respected.

Here is the log output, which shows that after calling up_mdelay for 1001ms, the tv_sec member of the timespec structure has increased by one second, as expected:

NuttShell (NSH) NuttX-12.11.0
nsh> cpuhog
Before: 2, after: 3
cpuhog initialized
cpuhog 0: consumer

@github-actions github-actions Bot added Area: Drivers Drivers issues Area: OS Components OS Components issues Size: M The size of the change in this PR is medium labels Oct 21, 2025
This reverts the removal of ndelay_accurate from apache#14450, since as
mentioned in apache#17011, this fails to consider the `sim` architecture
where CONFIG_BOARD_LOOPSPERMSEC was set to 0 because of reliance on the
accurate implementations of the up_delay functions. All the commit did
was remove a more accurate implementation in favour of a less accurate
one.

Signed-off-by: Matteo Golin <matteo.golin@gmail.com>
@linguini1 linguini1 force-pushed the revert-ndelay-accurate branch from db1aa9e to 88b302d Compare October 21, 2025 01:01
@acassis
Copy link
Copy Markdown
Contributor

acassis commented Oct 21, 2025

@jlaitine FYI

Copy link
Copy Markdown
Contributor

@cederom cederom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @linguini1 for the PR and @xiaoxiang781216 for the hints in #17011 :-)

One remark here, if we revert this commit there may be a short window of time where timers will be again inaccurate until fixes come from #17011? Or is it safe / necessary to merge in order to proceed with #17011 ?

@linguini1
Copy link
Copy Markdown
Contributor Author

Thank you @linguini1 for the PR and @xiaoxiang781216 for the hints in #17011 :-)

One remark here, if we revert this commit there may be a short window of time where timers will be again inaccurate until fixes come from #17011? Or is it safe / necessary to merge in order to proceed with #17011 ?

This change should make the timers more accurate, by adding back the accurate ndelay. It was less accurate after this removal because the implementation exclusively used busy-waiting, which is worse. The only difference is I imagine busy-waiting has better granularity than the accurate ndelay on some systems where the tick is large, assuming the busy-wait is not interrupted by other tasks, etc. I think if we want to improve the accuracy of ndelay then its implementation needs to change to something that's not busy waiting. We could maybe introduce some ifdef so if the architecture doesn't have good granularity on ndelay, it can default back to the busy-wait implementation? Might need a mailing list discussion after this is merged.

Ultimately I think both ways have trade-offs, but the PR #14450 only says "Fixes many issues where up_udelay is used", not really citing any drivers/functionality that broke. However, concretely, removing this accurate ndelay function breaks delays entirely on sim, so no delays are respected.

@jlaitine
Copy link
Copy Markdown
Contributor

jlaitine commented Oct 22, 2025

Thank you @linguini1 for the PR and @xiaoxiang781216 for the hints in #17011 :-)
One remark here, if we revert this commit there may be a short window of time where timers will be again inaccurate until fixes come from #17011? Or is it safe / necessary to merge in order to proceed with #17011 ?

This change should make the timers more accurate, by adding back the accurate ndelay. It was less accurate after this removal because the implementation exclusively used busy-waiting, which is worse. The only difference is I imagine busy-waiting has better granularity than the accurate ndelay on some systems where the tick is large, assuming the busy-wait is not interrupted by other tasks, etc. I think if we want to improve the accuracy of ndelay then its implementation needs to change to something that's not busy waiting. We could maybe introduce some ifdef so if the architecture doesn't have good granularity on ndelay, it can default back to the busy-wait implementation? Might need a mailing list discussion after this is merged.

Ultimately I think both ways have trade-offs, but the PR #14450 only says "Fixes many issues where up_udelay is used", not really citing any drivers/functionality that broke. However, concretely, removing this accurate ndelay function breaks delays entirely on sim, so no delays are respected.

Did you even read the summary of PR #14450 ? It doesn't "only say" what you claim, but explain the problem that the PR fixes. The ndelay_accurate was not very accurate when it quantized the delay to systick length. This caused e.g. udelay(1) to delay for close to 10000 microseconds instead of busylooping for 1 microsecond, on a system with 10ms tick. For any hardware driver initialization, where one needed to wait for a few hundred nanoseconds, the udelay(1) was commontly used. With the "accurate" ndelay everything just blew up.

So when you revert that, please make sure that it doesn't happen again.

I made a fix for mpfs to use architecture specific delays, which really check the time from a timer when busylooping, and that is of course the way to go. But not using the current time from oneshot_current, which calculated it from the tick! For other platforms, such as arm64, the issue will be back.

@cederom
Copy link
Copy Markdown
Contributor

cederom commented Oct 22, 2025

Thanks @jlaitine for the valuable feedback! Yes @linguini1 is working on fixing this timing issues that you mention globally.. as you too know this area could you please take a look at #17011 and provide some hints is this the right direction? :-)

@linguini1
Copy link
Copy Markdown
Contributor Author

Did you even read the summary of PR #14450 ? It doesn't "only say" what you claim, but explain the problem that the PR fixes. The ndelay_accurate was not very accurate when it quantized the delay to systick length. This caused e.g. udelay(1) to delay for close to 10000 microseconds instead of busylooping for 1 microsecond, on a system with 10ms tick. For any hardware driver initialization, where one needed to wait for a few hundred nanoseconds, the udelay(1) was commontly used. With the "accurate" ndelay everything just blew up.

Sorry @jlaitine , I didn't mean to imply that's all your PR said. I should have worded a bit better. What I mean is, your PR points out this flaw with the "accurate timer", namely that it's not very accurate for short sleep times where the busy wait excels. I agree with your assessment entirely. But, the PR doesn't mention any affected drivers or subsystems that broke (to my knowledge, writing this on the bus from memory), rather just that the delay time is suboptimal. However, in this PR, I do reintroduce a sleep method that is suboptimal for short delays like you mentioned, but it's actively fixing the SIM architecture which does not respect delays at all, broken.

So when you revert that, please make sure that it doesn't happen again.

You're right, I want a better solution than just reverting. Right now that seems like the lesser evil to me (worse short delay times for respected delays on sim). Do you have any ideas for how we can improve the implementation in general that would help both scenarios (i.e more accurate than both the old implementation and the busy wait)?

I made a fix for mpfs to use architecture specific delays, which really check the time from a timer when busylooping, and that is of course the way to go. But not using the current time from oneshot_current, which calculated it from the tick! For other platforms, such as arm64, the issue will be back.

arm64 should also have a good, accurate timer like mpfs to fix this, right? Do you think that @anchao 's recent replacement of up_delay functions with nxsched_delay would resolve this problem?

Please let me know what you think, I really value your input and I didn't mean to be reductive about your PR.

@xiaoxiang781216
Copy link
Copy Markdown
Contributor

xiaoxiang781216 commented Oct 23, 2025

Thank you @linguini1 for the PR and @xiaoxiang781216 for the hints in #17011 :-)
One remark here, if we revert this commit there may be a short window of time where timers will be again inaccurate until fixes come from #17011? Or is it safe / necessary to merge in order to proceed with #17011 ?

This change should make the timers more accurate, by adding back the accurate ndelay. It was less accurate after this removal because the implementation exclusively used busy-waiting, which is worse. The only difference is I imagine busy-waiting has better granularity than the accurate ndelay on some systems where the tick is large, assuming the busy-wait is not interrupted by other tasks, etc. I think if we want to improve the accuracy of ndelay then its implementation needs to change to something that's not busy waiting. We could maybe introduce some ifdef so if the architecture doesn't have good granularity on ndelay, it can default back to the busy-wait implementation? Might need a mailing list discussion after this is merged.
Ultimately I think both ways have trade-offs, but the PR #14450 only says "Fixes many issues where up_udelay is used", not really citing any drivers/functionality that broke. However, concretely, removing this accurate ndelay function breaks delays entirely on sim, so no delays are respected.

Did you even read the summary of PR #14450 ? It doesn't "only say" what you claim, but explain the problem that the PR fixes. The ndelay_accurate was not very accurate when it quantized the delay to systick length. This caused e.g. udelay(1) to delay for close to 10000 microseconds instead of busylooping for 1 microsecond, on a system with 10ms tick. For any hardware driver initialization, where one needed to wait for a few hundred nanoseconds, the udelay(1) was commontly used. With the "accurate" ndelay everything just blew up.

The accuracy lose is introduced by #15929. Before this PR, the accuracy is same as the hardware timer.

So when you revert that, please make sure that it doesn't happen again.

I made a fix for mpfs to use architecture specific delays, which really check the time from a timer when busylooping, and that is of course the way to go. But not using the current time from oneshot_current, which calculated it from the tick! For other platforms, such as arm64, the issue will be back.

it's a very bad idea to add tick variant in #7033, @Fix-Point take a long time to fix this problem and provide a better solution, which summary in this year workshop, please watch it:
https://www.youtube.com/watch?v=tqWwKLCD0dU&t=19710s

so, it's better to revert your change(both #15929 and #14450) to unblock other arch(e.g. sim and many other arch/chip) which never implement tick variant.
@Fix-Point is preparing the final solution which provide the same accuracy as hardware and the efficient implementation.

@jlaitine
Copy link
Copy Markdown
Contributor

So the intent is to now put back the systick drift, which will randomly break the hardware drivers by waking up the tick based sleeps too early? The change #15929 was done because in I2C and SPI drivers on MPFS platforms the nxsem_tickwait_uninterruptible randombly woke up too early (asked to sleep at least for 1 tick but randomly woke up on the next tick start). And the same issue occured also on arm64, and on every platform I tried to use.

If currently the only broken architecture is "sim", why not add specific up_ndelay and up_udelay for "sim", and not break every other arch? It is easy to add a architecture specific variant of those functions (like I did in #16485 for mpfs).

Accurate versions of up_ndelay etc. should just be done by the arch (hence the name up_*), because that is the place where all the architecture specific details of the timers are known.

I am seeing many very strange things around tick timer, for example this doesn't make sense to me: #15938 . If someone is measuring / sleeping long times by using systick, he is doing something very wrong. The systick accuracy should not matter, but systick time should be constant. It should not vary! Instead of making systick time vary, one should have a constant defining the exact systick time after all the HW timer dependent roundings, and use that for calculations (how many ticks are needed for a certain time).

Anyhow, I am glad to hear that there are improvements coming! Unfortunately I was unable to participate the workshop this year, I will look into the presentation for sure!

Before that, below is a quick draft of how I imagined the timers should be architected. Again, I am not pushing any changes or anything, below is just an opinion. I really need to look into the @Fix-Point 's work, I am sure it will be good :)

Arch/TimerSource
- provide accurate up_ndelay etc.
- registers to TimerDriver (at boot)
- provides timer compare interrupt for TimerDriver

    ^
    |
    |

Common/TimerDriver
- list of timer sources (or just one per architecture)
- sorted queue of bucketed callbacks
- selects TimerSource and sets it to interrupt at next callback bucket expiration
- at interrupt, calls all callbacks of the bucket
- provides coarse up_ndelay etc. if accurate ones are not available for the architecture
- provides interface for registering oneshot and periodic callbacks

    ^
    |
    |

Common/SystickDriver
- registers a periodic callback to TimerDriver


Here any driver could register for accurate periodic callbacks to the common TimerDriver, and the systick could be just one of these.

@linguini1
Copy link
Copy Markdown
Contributor Author

I don't think we should revert #15929 based on what @jlaitine has said. Sleeping for a tick too long is one thing, but delays should never return before the delay is over.

I would be willing to do an architecture-specific implementation of up_delay for sim, but I guess I'm still concerned that if sim has been broken by this change for all this time, I wonder if anything else broke and has been unnoticed.

Anyways, to give context on the bigger picture for why I'm reverting this:

  1. CONFIG_BOARD_LOOPSPERMSEC controls all the busy wait delays, which before sched/sleep: replace all Signal-based sleep implement to Scheduled sleep #17204 were pretty much everywhere in the code
  2. CONFIG_BOARD_LOOPSPERMSEC had some small default value, so users who forgot to calibrate their board and pick a sane value (or users who didn't know the option existed) would have very incorrect delay times
  3. I introduced PR arch: Remove default value for BOARD_LOOPSPERMSEC #17011 to have an invalid default value that would get caught at compile time, so the user is always notified to calibrate their board
  4. Some boards shouldn't need this option at all (like sim setting it to 0) since they have timer-based delay implementations which are better
  5. Therefore, I want to put back the timer-based delay in arch_alarm.c since those architectures should be able to have a delay which isn't busy-wait based (since busy-waiting is bad performance-wise and accuracy-wise). This narrows the amount of architectures relying on this bad busy-loop, without needing lots of unique implementations per-architecture.

So ideally, there would be a way to modify the delay in arch_alarm.c so that it's timer-based and is better than the flawed old version of ndelay_accurate you removed before. I believe this is the new method that Xiang Xiao is talking about, but it's not upstreamed just yet. In the meantime, I put back the old way to fix sim (and possibly other architectures) to get closer to the goal of removing reliance on the busy-wait. I think drivers using up_delay functions should not be sensitive to the delay being longer than anticipated (within reason), but having an implementation that is more accurate would improve performance. That's I guess on hold until this new fix Xiang mentioned is upstreamed.

TLDR; the busy-wait is bad and prone to user-error. The old method sleeps longer than it should and isn't great, but it fixes things and a new method is coming soon. Delay functions should never wake up early, but can wake up late.

@linguini1
Copy link
Copy Markdown
Contributor Author

Can this PR be merged? And we can open an issue to track this sub-optimal solution so it can be replaced when @Fix-Point has their changes merged?

@jlaitine
Copy link
Copy Markdown
Contributor

Just please verify that 1) the udelay busyloops only appriximately the correct time and 2) tick based timeouts don't randomly timeout too early. These would be fatal bugs, and were the reasons why all those previous fixes were done.

Otherwise, I don't have a religion on this; I am just tired of debugging issues caused by misbehaving basic timers...

@linguini1
Copy link
Copy Markdown
Contributor Author

Just please verify that 1) the udelay busyloops only appriximately the correct time and 2) tick based timeouts don't randomly timeout too early. These would be fatal bugs, and were the reasons why all those previous fixes were done.

Otherwise, I don't have a religion on this; I am just tired of debugging issues caused by misbehaving basic timers...

Okay, in that case I will mark this as a draft PR so merging can be held until I have some proper verification that the reverted commit won't cause timeouts to happen too early. @jlaitine would you want to see some kind of empirical test on multiple architectures, or just some empirical test on the simulator + logical verification that this 'accurate' method can never time out too early? I am not as well-versed on the systick drift issue as you would be, I'm sure. If you have any pointers to where to look for the underlying reason of the systick drift, I'm all ears. I figured the system tick wouldn't have such an issue but I guess that was naive.

@linguini1 linguini1 marked this pull request as draft October 28, 2025 18:54
@jlaitine
Copy link
Copy Markdown
Contributor

jlaitine commented Oct 29, 2025

Just please verify that 1) the udelay busyloops only appriximately the correct time and 2) tick based timeouts don't randomly timeout too early. These would be fatal bugs, and were the reasons why all those previous fixes were done.
Otherwise, I don't have a religion on this; I am just tired of debugging issues caused by misbehaving basic timers...

Okay, in that case I will mark this as a draft PR so merging can be held until I have some proper verification that the reverted commit won't cause timeouts to happen too early. @jlaitine would you want to see some kind of empirical test on multiple architectures, or just some empirical test on the simulator + logical verification that this 'accurate' method can never time out too early? I am not as well-versed on the systick drift issue as you would be, I'm sure. If you have any pointers to where to look for the underlying reason of the systick drift, I'm all ears. I figured the system tick wouldn't have such an issue but I guess that was naive.

I'll try to do my best to help in testing. For the previous related PRs I did some simple tests, I'll try to find and gather them and try this PR out on some risc-v and arm64 targets at least (as those are the most important for me at the time being) - and report how it behaves. It will take a day or two before I can put effort in this, as I am currently in the middle of something else. Anyhow, I'll try to be more helpful in the future.

Copy link
Copy Markdown
Contributor

@xiaoxiang781216 xiaoxiang781216 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linguini1 @jlaitine after #17345, the accuracy of oneshot time doesn't depend on CONFIGUSEC_PER_TICK anymore, so this patch could move forward.

@linguini1 linguini1 marked this pull request as ready for review November 24, 2025 23:30
@linguini1
Copy link
Copy Markdown
Contributor Author

Waiting to merge for @jlaitine's feedback/blessing.

@jlaitine
Copy link
Copy Markdown
Contributor

Waiting to merge for @jlaitine's feedback/blessing.

Thanks for pinging, I will run some tests with real HW today and let you know!

@jlaitine
Copy link
Copy Markdown
Contributor

This is working fine on MPFS (risc-v 64-bit ) and IMX9 (arm64) platforms, thanks for giving me time to look into this!

I finally also got time to check how the count-based oneshot driver works, and now I believe the implementation for systick is correct (#17345). The earlier timer drift issue is not a problem any more with this approach.

@xiaoxiang781216 xiaoxiang781216 merged commit 3edf6de into apache:master Nov 25, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Drivers Drivers issues Area: OS Components OS Components issues Size: M The size of the change in this PR is medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants