Set timeout ts of rpc request to 1 minute by lzxddz · Pull Request #117 · eloqdata/store_handler

lzxddz · 2025-10-24T07:42:31Z

Summary by CodeRabbit

Bug Fixes
- Corrected error message in data flush operation failure reporting for better diagnostics.
Improvements
- Extended operation timeout thresholds to improve reliability for longer-running remote operations.
- Enhanced retry mechanism with configurable retry limits for better operational flexibility.

coderabbitai · 2025-10-24T07:42:44Z

Walkthrough

Increased RPC timeout values across multiple remote calls from 5000 ms to 60000 ms in the data store service client. Updated retry logic to use a configurable retry limit and corrected an error message in the flush data closure handler. Adjusted logging level for PersistKV success paths.

Changes

Cohort / File(s)	Summary
RPC Timeout and Logging Updates `data_store_service_client.cpp`	Increased timeout values from 5000 ms to 60000 ms for ReadInternal, DeleteRangeInternal, FlushDataInternal, DropTableInternal, ScanNextInternal, and ScanCloseInternal. Changed PersistKV success logging from DLOG(INFO) to LOG(INFO). Added local brpc::Controller with 60000 ms timeout in BatchWriteRecordsInternal.
Closure Handler Updates `data_store_service_client_closure.h`	Corrected error message in FlushDataClosure::Run from "Failed for DeleteRange RPC request" to "Failed for FlushData RPC request". Broadened retry condition in BatchWriteRecordsClosure::Run from `retry_count_ < 2` to `retry_count_ < ds_service_client_->retry_limit_`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes involve multiple timeout adjustments following a consistent pattern (reducing review complexity), but require careful verification of controller handling semantics and the dynamic retry limit logic introduced in the closure handler, which adds interpretive burden.

Possibly related PRs

Send out batchwrite req concurrenctly and wait all at once #83: Main PR with batch-write and BatchWriteRecordsInternal modifications that directly overlap with these RPC timeout and retry limit adjustments.

Poem

🐰 A rabbit hops through the timeout gate,
From five to sixty—no more to wait!
Retries now flexible, errors now clear,
The data flows faster, without a fear! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Set timeout ts of rpc request to 1 minute" directly relates to the main change in the changeset, which is increasing RPC timeout values from 5000 ms to 60000 ms (1 minute) across multiple remote calls in data_store_service_client.cpp. The title is specific and clear enough that reviewers would understand the primary objective of the PR is to extend RPC request timeouts. While the changeset includes secondary modifications such as logging adjustments, retry logic changes, and error message corrections, the title appropriately emphasizes the most significant and central change. The abbreviation "ts" is slightly informal, but the overall message remains clear and related to the actual changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update_timeout_ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

data_store_service_client_closure.h (1)

1680-1736: BatchWriteRecordsClosure does not retry on NOT_OWNER; add retry like other closures.

Currently need_retry is only set for transport failures. When the server returns REQUESTED_NODE_NOT_OWNER (both local and RPC-success paths), we log/handle but never retry. Other closures (Read/Flush/Delete/Drop/Scan) do retry on NOT_OWNER. This can drop writes during leadership changes.

Proposed minimal fix:

@@
-            else
-            {
-                // TODO(lzx): handle error.
-                result_ = response_.result();
-            }
+            else
+            {
+                result_ = response_.result();
+                auto err_code = static_cast<::EloqDS::remote::DataStoreError>(result_.error_code());
+                if (err_code == ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) {
+                    ds_service_client_->HandleShardingError(result_);
+                    need_retry = true;
+                }
+            }
@@
-        {
-            auto err_code = static_cast<::EloqDS::remote::DataStoreError>(
-                result_.error_code());
-
-            if (err_code ==
-                ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER)
-            {
-                ds_service_client_->HandleShardingError(result_);
-                // TODO(lzx): retry.
-            }
-        }
+        {
+            auto err_code = static_cast<::EloqDS::remote::DataStoreError>(result_.error_code());
+            if (err_code == ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) {
+                ds_service_client_->HandleShardingError(result_);
+                need_retry = true;
+            }
+        }

Optionally clear state before retry for symmetry with other closures:

-            self_guard.Release();
+            self_guard.Release();
+            response_.Clear();
+            cntl_.Reset();

🧹 Nitpick comments (2)

data_store_service_client_closure.h (1)
1750-1756: Unify timeout source — remove hardcoded 5000 ms here.

PrepareRequest still sets cntl_.set_timeout_ms(5000) while callers set 60000 ms before dispatch. This dual-setting is confusing and easy to regress. Let callers own the timeout (consistent with other closures) or set it to 60000 here.

Apply one of the following diffs (preferred: remove local setting):
-        cntl_.Reset();
-        cntl_.set_timeout_ms(5000);
+        cntl_.Reset();
+        // Timeout is set by the caller in DataStoreServiceClient::BatchWriteRecordsInternal.
Or, if you want the closure to own it:
-        cntl_.Reset();
-        cntl_.set_timeout_ms(5000);
+        cntl_.Reset();
+        cntl_.set_timeout_ms(60000);
data_store_service_client.cpp (1)

3216-3216: RPC timeouts aligned to 60000 ms — consistent with PR goal.

Read/DeleteRange/FlushData/DropTable/ScanNext/ScanClose/BatchWriteRecords now use 60s timeouts. Looks good.

One follow-up: BatchWriteRecordsClosure::PrepareRequest still sets 5000 ms; remove or align to avoid conflicting settings (see header comment).

Also applies to: 3270-3270, 3318-3318, 3370-3370, 3438-3438, 3491-3491, 3810-3811

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d17e25 and 3e0ae87.

📒 Files selected for processing (2)

data_store_service_client.cpp (8 hunks)
data_store_service_client_closure.h (2 hunks)

🔇 Additional comments (2)

data_store_service_client_closure.h (1)

981-983: Corrected error message — good catch.

The log now correctly references FlushData. No further action.

data_store_service_client.cpp (1)

406-406: Raised success log to INFO.

Acceptable; visibility increase seems intentional.

lzxddz added 3 commits October 23, 2025 20:03

change rpc timeout ts to 60s

0b51e74

update log content

af9d4bd

update

3e0ae87

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

xiexiaoy approved these changes Oct 24, 2025

View reviewed changes

lzxddz mentioned this pull request Oct 24, 2025

[Test] Test 3 eloqdoc nodes using ycsb. eloqdata/eloqdoc#263

Closed

lzxddz self-assigned this Oct 24, 2025

lzxddz merged commit 804a34d into main Oct 27, 2025
1 check passed

lzxddz deleted the update_timeout_ts branch October 27, 2025 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set timeout ts of rpc request to 1 minute#117

Set timeout ts of rpc request to 1 minute#117
lzxddz merged 3 commits intomainfrom
update_timeout_ts

lzxddz commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lzxddz commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lzxddz commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading