Skip to content

DAOS-18552 pool: Fix a PS start-stop race#17564

Merged
daltonbohning merged 1 commit intomasterfrom
liw/pool-svc-stop-wa
Mar 12, 2026
Merged

DAOS-18552 pool: Fix a PS start-stop race#17564
daltonbohning merged 1 commit intomasterfrom
liw/pool-svc-stop-wa

Conversation

@liw
Copy link
Contributor

@liw liw commented Feb 17, 2026

The following race happened during a pool create operation, triggered by abnormally slow VMs:

ds_rsvc_start
  start
    pool_svc_alloc_cb
      ds_pool_lookup: OK
....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
  insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a PS to the hash table if the ds_pool is stopping, so that ds_pool_stop won't hang. Manual testing shows that such a pool create operation will now retry and succeed transparently.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'rebuild/container_rf.py:RbldContRfTest.test_rebuild_with_container_rf - pool create failed: DER_BUSY(-1012): Device or resource busy'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18552

@liw liw force-pushed the liw/pool-svc-stop-wa branch from 8d0122c to 4ccf463 Compare February 17, 2026 06:43
The following race happened during a pool create operation, triggered by
abnormally slow VMs:

  ds_rsvc_start
    start
      pool_svc_alloc_cb
        ds_pool_lookup: OK
  ....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
    insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.

Signed-off-by: Li Wei <liwei@hpe.com>
@liw liw force-pushed the liw/pool-svc-stop-wa branch from 4ccf463 to 0ec1e95 Compare February 17, 2026 07:40
@liw liw marked this pull request as ready for review February 18, 2026 00:16
@liw liw requested review from a team as code owners February 18, 2026 00:16
@liw liw requested review from kccain and liuxuezhao February 18, 2026 00:17
rc = rsvc_class(class)->sc_insert(svc);
if (rc != 0) {
D_DEBUG(DB_MD, "%s: sc_insert: " DF_RC "\n", svc->s_name, DP_RC(rc));
goto err_svc_started;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unusual goto, though I see the reasoning (do not duplicate stop call line of code here, and do not generate inaccurate D_DEBUG log when it is sc_insert that failed rather than d_hash_rec_insert)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit unusual, though I occasionally use this method. If I revise this PR next time, I'll change this to just duplicate the stop call.

@liw liw requested a review from wangshilong March 2, 2026 11:06
@liw liw requested a review from a team March 4, 2026 04:05
@daltonbohning daltonbohning removed the request for review from a team March 6, 2026 16:18
@daltonbohning
Copy link
Contributor

Removing gatekeeper until merge approval is granted

@liw liw requested a review from a team March 12, 2026 14:05
@liw
Copy link
Contributor Author

liw commented Mar 12, 2026

Looks like this PR is now approved for 2.8.

@daltonbohning daltonbohning merged commit 7b9f55d into master Mar 12, 2026
42 checks passed
@daltonbohning daltonbohning deleted the liw/pool-svc-stop-wa branch March 12, 2026 14:29
liw added a commit that referenced this pull request Mar 13, 2026
The following race happened during a pool create operation, triggered by
abnormally slow VMs:

  ds_rsvc_start
    start
      pool_svc_alloc_cb
        ds_pool_lookup: OK
  ....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
    insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.

Signed-off-by: Li Wei <liwei@hpe.com>
daltonbohning pushed a commit that referenced this pull request Mar 18, 2026
The following race happened during a pool create operation, triggered by
abnormally slow VMs:

  ds_rsvc_start
    start
      pool_svc_alloc_cb
        ds_pool_lookup: OK
  ....VM slowness causes start timeout, which triggers stop....
                            ds_pool_stop
                              pool->sp_stopping = 1
                              ds_pool_svc_stop: none
    insert
                              wait for ds_pool references: hang

This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.

Signed-off-by: Li Wei <liwei@hpe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants