Conversation
WalkthroughIncreased RPC timeout values across multiple remote calls from 5000 ms to 60000 ms in the data store service client. Updated retry logic to use a configurable retry limit and corrected an error message in the flush data closure handler. Adjusted logging level for PersistKV success paths. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes The changes involve multiple timeout adjustments following a consistent pattern (reducing review complexity), but require careful verification of controller handling semantics and the dynamic retry limit logic introduced in the closure handler, which adds interpretive burden. Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
data_store_service_client_closure.h (1)
1680-1736: BatchWriteRecordsClosure does not retry on NOT_OWNER; add retry like other closures.Currently need_retry is only set for transport failures. When the server returns REQUESTED_NODE_NOT_OWNER (both local and RPC-success paths), we log/handle but never retry. Other closures (Read/Flush/Delete/Drop/Scan) do retry on NOT_OWNER. This can drop writes during leadership changes.
Proposed minimal fix:
@@ - else - { - // TODO(lzx): handle error. - result_ = response_.result(); - } + else + { + result_ = response_.result(); + auto err_code = static_cast<::EloqDS::remote::DataStoreError>(result_.error_code()); + if (err_code == ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) { + ds_service_client_->HandleShardingError(result_); + need_retry = true; + } + } @@ - { - auto err_code = static_cast<::EloqDS::remote::DataStoreError>( - result_.error_code()); - - if (err_code == - ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) - { - ds_service_client_->HandleShardingError(result_); - // TODO(lzx): retry. - } - } + { + auto err_code = static_cast<::EloqDS::remote::DataStoreError>(result_.error_code()); + if (err_code == ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) { + ds_service_client_->HandleShardingError(result_); + need_retry = true; + } + }Optionally clear state before retry for symmetry with other closures:
- self_guard.Release(); + self_guard.Release(); + response_.Clear(); + cntl_.Reset();
🧹 Nitpick comments (2)
data_store_service_client_closure.h (1)
1750-1756: Unify timeout source — remove hardcoded 5000 ms here.PrepareRequest still sets cntl_.set_timeout_ms(5000) while callers set 60000 ms before dispatch. This dual-setting is confusing and easy to regress. Let callers own the timeout (consistent with other closures) or set it to 60000 here.
Apply one of the following diffs (preferred: remove local setting):
- cntl_.Reset(); - cntl_.set_timeout_ms(5000); + cntl_.Reset(); + // Timeout is set by the caller in DataStoreServiceClient::BatchWriteRecordsInternal.Or, if you want the closure to own it:
- cntl_.Reset(); - cntl_.set_timeout_ms(5000); + cntl_.Reset(); + cntl_.set_timeout_ms(60000);data_store_service_client.cpp (1)
3216-3216: RPC timeouts aligned to 60000 ms — consistent with PR goal.Read/DeleteRange/FlushData/DropTable/ScanNext/ScanClose/BatchWriteRecords now use 60s timeouts. Looks good.
One follow-up: BatchWriteRecordsClosure::PrepareRequest still sets 5000 ms; remove or align to avoid conflicting settings (see header comment).
Also applies to: 3270-3270, 3318-3318, 3370-3370, 3438-3438, 3491-3491, 3810-3811
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
data_store_service_client.cpp(8 hunks)data_store_service_client_closure.h(2 hunks)
🔇 Additional comments (2)
data_store_service_client_closure.h (1)
981-983: Corrected error message — good catch.The log now correctly references FlushData. No further action.
data_store_service_client.cpp (1)
406-406: Raised success log to INFO.Acceptable; visibility increase seems intentional.
Summary by CodeRabbit
Bug Fixes
Improvements