Skip to content

Set timeout ts of rpc request to 1 minute#117

Merged
lzxddz merged 3 commits intomainfrom
update_timeout_ts
Oct 27, 2025
Merged

Set timeout ts of rpc request to 1 minute#117
lzxddz merged 3 commits intomainfrom
update_timeout_ts

Conversation

@lzxddz
Copy link
Copy Markdown
Collaborator

@lzxddz lzxddz commented Oct 24, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Corrected error message in data flush operation failure reporting for better diagnostics.
  • Improvements

    • Extended operation timeout thresholds to improve reliability for longer-running remote operations.
    • Enhanced retry mechanism with configurable retry limits for better operational flexibility.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Oct 24, 2025

Walkthrough

Increased RPC timeout values across multiple remote calls from 5000 ms to 60000 ms in the data store service client. Updated retry logic to use a configurable retry limit and corrected an error message in the flush data closure handler. Adjusted logging level for PersistKV success paths.

Changes

Cohort / File(s) Summary
RPC Timeout and Logging Updates
data_store_service_client.cpp
Increased timeout values from 5000 ms to 60000 ms for ReadInternal, DeleteRangeInternal, FlushDataInternal, DropTableInternal, ScanNextInternal, and ScanCloseInternal. Changed PersistKV success logging from DLOG(INFO) to LOG(INFO). Added local brpc::Controller with 60000 ms timeout in BatchWriteRecordsInternal.
Closure Handler Updates
data_store_service_client_closure.h
Corrected error message in FlushDataClosure::Run from "Failed for DeleteRange RPC request" to "Failed for FlushData RPC request". Broadened retry condition in BatchWriteRecordsClosure::Run from retry_count_ < 2 to retry_count_ < ds_service_client_->retry_limit_.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes involve multiple timeout adjustments following a consistent pattern (reducing review complexity), but require careful verification of controller handling semantics and the dynamic retry limit logic introduced in the closure handler, which adds interpretive burden.

Possibly related PRs

Poem

🐰 A rabbit hops through the timeout gate,
From five to sixty—no more to wait!
Retries now flexible, errors now clear,
The data flows faster, without a fear! 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Set timeout ts of rpc request to 1 minute" directly relates to the main change in the changeset, which is increasing RPC timeout values from 5000 ms to 60000 ms (1 minute) across multiple remote calls in data_store_service_client.cpp. The title is specific and clear enough that reviewers would understand the primary objective of the PR is to extend RPC request timeouts. While the changeset includes secondary modifications such as logging adjustments, retry logic changes, and error message corrections, the title appropriately emphasizes the most significant and central change. The abbreviation "ts" is slightly informal, but the overall message remains clear and related to the actual changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update_timeout_ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
data_store_service_client_closure.h (1)

1680-1736: BatchWriteRecordsClosure does not retry on NOT_OWNER; add retry like other closures.

Currently need_retry is only set for transport failures. When the server returns REQUESTED_NODE_NOT_OWNER (both local and RPC-success paths), we log/handle but never retry. Other closures (Read/Flush/Delete/Drop/Scan) do retry on NOT_OWNER. This can drop writes during leadership changes.

Proposed minimal fix:

@@
-            else
-            {
-                // TODO(lzx): handle error.
-                result_ = response_.result();
-            }
+            else
+            {
+                result_ = response_.result();
+                auto err_code = static_cast<::EloqDS::remote::DataStoreError>(result_.error_code());
+                if (err_code == ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) {
+                    ds_service_client_->HandleShardingError(result_);
+                    need_retry = true;
+                }
+            }
@@
-        {
-            auto err_code = static_cast<::EloqDS::remote::DataStoreError>(
-                result_.error_code());
-
-            if (err_code ==
-                ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER)
-            {
-                ds_service_client_->HandleShardingError(result_);
-                // TODO(lzx): retry.
-            }
-        }
+        {
+            auto err_code = static_cast<::EloqDS::remote::DataStoreError>(result_.error_code());
+            if (err_code == ::EloqDS::remote::DataStoreError::REQUESTED_NODE_NOT_OWNER) {
+                ds_service_client_->HandleShardingError(result_);
+                need_retry = true;
+            }
+        }

Optionally clear state before retry for symmetry with other closures:

-            self_guard.Release();
+            self_guard.Release();
+            response_.Clear();
+            cntl_.Reset();
🧹 Nitpick comments (2)
data_store_service_client_closure.h (1)

1750-1756: Unify timeout source — remove hardcoded 5000 ms here.

PrepareRequest still sets cntl_.set_timeout_ms(5000) while callers set 60000 ms before dispatch. This dual-setting is confusing and easy to regress. Let callers own the timeout (consistent with other closures) or set it to 60000 here.

Apply one of the following diffs (preferred: remove local setting):

-        cntl_.Reset();
-        cntl_.set_timeout_ms(5000);
+        cntl_.Reset();
+        // Timeout is set by the caller in DataStoreServiceClient::BatchWriteRecordsInternal.

Or, if you want the closure to own it:

-        cntl_.Reset();
-        cntl_.set_timeout_ms(5000);
+        cntl_.Reset();
+        cntl_.set_timeout_ms(60000);
data_store_service_client.cpp (1)

3216-3216: RPC timeouts aligned to 60000 ms — consistent with PR goal.

Read/DeleteRange/FlushData/DropTable/ScanNext/ScanClose/BatchWriteRecords now use 60s timeouts. Looks good.

One follow-up: BatchWriteRecordsClosure::PrepareRequest still sets 5000 ms; remove or align to avoid conflicting settings (see header comment).

Also applies to: 3270-3270, 3318-3318, 3370-3370, 3438-3438, 3491-3491, 3810-3811

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d17e25 and 3e0ae87.

📒 Files selected for processing (2)
  • data_store_service_client.cpp (8 hunks)
  • data_store_service_client_closure.h (2 hunks)
🔇 Additional comments (2)
data_store_service_client_closure.h (1)

981-983: Corrected error message — good catch.

The log now correctly references FlushData. No further action.

data_store_service_client.cpp (1)

406-406: Raised success log to INFO.

Acceptable; visibility increase seems intentional.

@lzxddz lzxddz self-assigned this Oct 24, 2025
@lzxddz lzxddz merged commit 804a34d into main Oct 27, 2025
1 check passed
@lzxddz lzxddz deleted the update_timeout_ts branch October 27, 2025 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants