DAOS-18552 pool: Fix a PS start-stop race#17564
Conversation
|
Ticket title is 'rebuild/container_rf.py:RbldContRfTest.test_rebuild_with_container_rf - pool create failed: DER_BUSY(-1012): Device or resource busy' |
8d0122c to
4ccf463
Compare
The following race happened during a pool create operation, triggered by
abnormally slow VMs:
ds_rsvc_start
start
pool_svc_alloc_cb
ds_pool_lookup: OK
....VM slowness causes start timeout, which triggers stop....
ds_pool_stop
pool->sp_stopping = 1
ds_pool_svc_stop: none
insert
wait for ds_pool references: hang
This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.
Signed-off-by: Li Wei <liwei@hpe.com>
4ccf463 to
0ec1e95
Compare
| rc = rsvc_class(class)->sc_insert(svc); | ||
| if (rc != 0) { | ||
| D_DEBUG(DB_MD, "%s: sc_insert: " DF_RC "\n", svc->s_name, DP_RC(rc)); | ||
| goto err_svc_started; |
There was a problem hiding this comment.
unusual goto, though I see the reasoning (do not duplicate stop call line of code here, and do not generate inaccurate D_DEBUG log when it is sc_insert that failed rather than d_hash_rec_insert)
There was a problem hiding this comment.
Yeah, it's a bit unusual, though I occasionally use this method. If I revise this PR next time, I'll change this to just duplicate the stop call.
|
Removing gatekeeper until merge approval is granted |
|
Looks like this PR is now approved for 2.8. |
The following race happened during a pool create operation, triggered by
abnormally slow VMs:
ds_rsvc_start
start
pool_svc_alloc_cb
ds_pool_lookup: OK
....VM slowness causes start timeout, which triggers stop....
ds_pool_stop
pool->sp_stopping = 1
ds_pool_svc_stop: none
insert
wait for ds_pool references: hang
This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.
Signed-off-by: Li Wei <liwei@hpe.com>
The following race happened during a pool create operation, triggered by
abnormally slow VMs:
ds_rsvc_start
start
pool_svc_alloc_cb
ds_pool_lookup: OK
....VM slowness causes start timeout, which triggers stop....
ds_pool_stop
pool->sp_stopping = 1
ds_pool_svc_stop: none
insert
wait for ds_pool references: hang
This patch is a quick fix that prevents ds_rsvc_start from inserting a
PS to the hash table if the ds_pool is stopping, so that ds_pool_stop
won't hang. Manual testing shows that such a pool create operation will
now retry and succeed transparently.
Signed-off-by: Li Wei <liwei@hpe.com>
The following race happened during a pool create operation, triggered by abnormally slow VMs:
This patch is a quick fix that prevents ds_rsvc_start from inserting a PS to the hash table if the ds_pool is stopping, so that ds_pool_stop won't hang. Manual testing shows that such a pool create operation will now retry and succeed transparently.
Steps for the author:
After all prior steps are complete: