Skip to content

feat(sdk): support per-request timeout to TCP/QUIC/WebSocket clients#3429

Open
chengxilo wants to merge 2 commits into
apache:masterfrom
chengxilo:client-timeout
Open

feat(sdk): support per-request timeout to TCP/QUIC/WebSocket clients#3429
chengxilo wants to merge 2 commits into
apache:masterfrom
chengxilo:client-timeout

Conversation

@chengxilo

Copy link
Copy Markdown
Contributor

Which issue does this PR address?

Closes #3419

Rationale

Clients had no deadline on individual request, so a stalled server could block a caller forever.

What changed?

send_raw would be blocked forever if server doesn't response. I added a request_timeout field (IggyDuration, default 300s) to TcpClientConfig, QuicClientConfig, and WebSocketClientConfig. Each client's send_raw now wraps its I/O in tokio::time::timeout

Local Execution

  • Passed
  • Pre-commit hooks ran

AI Usage

  1. Claude Opus 4.6
  2. implementation
  3. manually reviewed
  4. Yes

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Thanks for the PR. It is labeled S-waiting-on-review and queued for review.

Slash commands (own line, regular comment) move it around the queue:

  • /ready - back to S-waiting-on-review after addressing feedback
  • /author - flip to S-waiting-on-author while you finish changes
  • /request-review @user-or-team - request a reviewer

See CONTRIBUTING.md for details.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 6, 2026
@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.03846% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.23%. Comparing base (d254604) to head (07fba5f).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
core/sdk/src/quic/quic_client.rs 58.33% 11 Missing and 4 partials ⚠️
core/sdk/src/clients/client_builder.rs 0.00% 12 Missing ⚠️
core/sdk/src/websocket/websocket_client.rs 78.57% 6 Missing and 6 partials ⚠️
core/sdk/src/tcp/tcp_client.rs 80.00% 5 Missing and 6 partials ⚠️
core/common/src/types/args/mod.rs 25.00% 1 Missing and 2 partials ⚠️
core/sdk/src/client_provider.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3429      +/-   ##
============================================
- Coverage     74.43%   72.23%   -2.20%     
  Complexity      943      943              
============================================
  Files          1245     1245              
  Lines        121293   116658    -4635     
  Branches      97599    92992    -4607     
============================================
- Hits          90285    84269    -6016     
- Misses        28054    29208    +1154     
- Partials       2954     3181     +227     
Components Coverage Δ
Rust Core 72.88% <74.03%> (-2.69%) ⬇️
Java SDK 58.44% <ø> (ø)
C# SDK 69.28% <ø> (-0.64%) ⬇️
Python SDK 81.06% <ø> (ø)
PHP SDK 83.57% <ø> (ø)
Node SDK 91.22% <ø> (-0.13%) ⬇️
Go SDK 40.20% <ø> (ø)
Files with missing lines Coverage Δ
core/common/src/error/iggy_error.rs 100.00% <ø> (ø)
...es/configuration/quic_config/quic_client_config.rs 100.00% <100.00%> (ø)
...guration/quic_config/quic_client_config_builder.rs 36.60% <100.00%> (+7.60%) ⬆️
...ypes/configuration/tcp_config/tcp_client_config.rs 100.00% <100.00%> (ø)
...figuration/tcp_config/tcp_client_config_builder.rs 71.55% <100.00%> (+3.51%) ⬆️
...ration/websocket_config/websocket_client_config.rs 62.35% <100.00%> (+0.90%) ⬆️
...ebsocket_config/websocket_client_config_builder.rs 37.96% <100.00%> (+7.75%) ⬆️
core/sdk/src/client_provider.rs 48.38% <66.66%> (+0.36%) ⬆️
core/common/src/types/args/mod.rs 45.51% <25.00%> (-0.54%) ⬇️
core/sdk/src/tcp/tcp_client.rs 75.08% <80.00%> (+0.93%) ⬆️
... and 3 more

... and 117 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

server_name: "localhost".to_string(),
auto_login: AutoLogin::Disabled,
heartbeat_interval: IggyDuration::from_str("5s").unwrap(),
request_timeout: IggyDuration::from_str("300s").unwrap(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300s is IMHO overkill. make it 10s or so.

Err(IggyError::NotConnected)
};

tokio::time::timeout(request_timeout.get_duration(), io)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on timeout the io future is dropped while it still holds the stream lock, mid-write or after-write-before-read. tcp and websocket use one persistent stream with no request id, so responses are matched purely by arrival order. RequestTimeout isn't in the send_raw_with_response reconnect set, so the stream is kept and reused: the next send_raw reads the previous request's late response, or if the timeout hit mid-write, a partial frame is left on the wire. result is a permanent response/frame desync until the process restarts. quic avoids this because it opens a fresh open_bi stream per request.

adding RequestTimeout to the reconnect set alone isn't enough - the !reconnection.enabled and (under vsr) auto_login == Disabled early returns fire before disconnect(), which is the only thing that resets the stream. simplest robust fix: on timeout, reset the stream (*stream.lock().await = None) inside send_raw before returning, independent of the reconnect path.

}
};

tokio::time::timeout(self.config.request_timeout.get_duration(), io)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same desync as the tcp path - the timeout drops io while it holds the stream lock, no request id on the wire, RequestTimeout not in the reconnect set, so the stream is reused and every later response is shifted by one. worse here because this io runs inline with no tokio::spawn, so it's also exposed to caller cancellation, not just the internal timeout. needs the same fix: reset the stream on timeout before returning.

let mut consensus_session = consensus_session
.lock()
.expect("consensus session mutex poisoned");
crate::vsr::encode_request_header(&mut consensus_session, code, &payload)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under the vsr feature this advances the shared session request counter before the write happens. if the timeout fires after the id is taken but before the request lands, the counter has moved but the server never saw the request - the next replicated request uses id n+1, the primary sees a gap and silently drops it (and everything after), bricking replicated writes on that session. the counter lives in the shared ConsensusSession, not the stream, so quic is not immune to this one even though it dodges the frame desync. same teardown fixes it: reset the vsr session on timeout (the disconnect path's reset_vsr_session) inside send_raw, since the early returns skip it.

Err(IggyError::NotConnected)
};

tokio::time::timeout(request_timeout.get_duration(), io)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IggyDuration::from_str maps "0", "unlimited", "disabled" and "none" to Duration::ZERO, and tokio::time::timeout(ZERO, io) elapses on the first pending poll - so setting request_timeout to any of those makes every request fail instantly instead of disabling the timeout. same at the tcp and websocket sites. guard with is_zero() (already on IggyDuration) and skip the timeout wrap when it's zero.

tls_validate_certificate: true,
reconnection: connection_string.options().reconnection().to_owned(),
heartbeat_interval: connection_string.options().heartbeat_interval(),
request_timeout: IggyDuration::from_str("300s").unwrap(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hardcodes 300s instead of reading from the connection string like the field right above it (heartbeat_interval uses connection_string.options()). same in the quic and websocket From<ConnectionString> impls. so request_timeout is settable via the builder and cli but silently pinned to 300s for anyone configuring through a connection string. either parse it from the options or document that it's not supported there.

let request_timeout = self.config.request_timeout;
#[cfg(feature = "vsr")]
let consensus_session = self.consensus_session.clone();
// SAFETY: we run code holding the `stream` lock in a task so we can't be cancelled while holding the lock.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is no longer accurate - the new tokio::time::timeout below drops the lock-holding io from inside the task, which is exactly the cancellation it says can't happen. spawn only shields against caller cancellation. worth correcting when the timeout teardown lands. same at the quic site.

keep_alive_interval: args.quic_keep_alive_interval,
max_idle_timeout: args.quic_max_idle_timeout,
validate_certificate: args.quic_validate_certificate,
request_timeout: IggyDuration::from_str(&args.request_timeout).unwrap(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_str(&args.request_timeout).unwrap() panics on a bad --request-timeout value since it's a free-form string with no validation. also at lines 152 and 186. matches the existing pattern for the other duration args here so it's not new, but it adds another panic point on user input - a clap value_parser on the arg would reject bad values cleanly instead.

tls_ca_file: None,
tls_validate_certificate: true,
heartbeat_interval: IggyDuration::from_str("5s").unwrap(),
request_timeout: IggyDuration::from_str("300s").unwrap(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up this is a behavior change - before this, requests had no deadline, now they default to 300s. anything that legitimately takes longer than 300s on a connection now returns RequestTimeout (and trips the desync above). worth a changelog / release-note callout so users who relied on unbounded requests aren't surprised.

/// Interval of heartbeats sent by the client
pub heartbeat_interval: IggyDuration,
/// Per-request timeout for send/receive operations.
pub request_timeout: IggyDuration,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a pub field here is a breaking change for anyone constructing this struct with a literal without ..Default::default(), since it isn't #[non_exhaustive]. fine for now while the crate is pre-1.0, but if you plan to keep adding config fields it's worth marking the three client config structs #[non_exhaustive] once and steering people to the builders.

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support per-request timeout to TCP/QUIC/WebSocket clients

2 participants