feat(sdk): support per-request timeout to TCP/QUIC/WebSocket clients#3429
feat(sdk): support per-request timeout to TCP/QUIC/WebSocket clients#3429chengxilo wants to merge 2 commits into
Conversation
|
Thanks for the PR. It is labeled Slash commands (own line, regular comment) move it around the queue:
See CONTRIBUTING.md for details. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3429 +/- ##
============================================
- Coverage 74.43% 72.23% -2.20%
Complexity 943 943
============================================
Files 1245 1245
Lines 121293 116658 -4635
Branches 97599 92992 -4607
============================================
- Hits 90285 84269 -6016
- Misses 28054 29208 +1154
- Partials 2954 3181 +227
🚀 New features to boost your workflow:
|
| server_name: "localhost".to_string(), | ||
| auto_login: AutoLogin::Disabled, | ||
| heartbeat_interval: IggyDuration::from_str("5s").unwrap(), | ||
| request_timeout: IggyDuration::from_str("300s").unwrap(), |
There was a problem hiding this comment.
300s is IMHO overkill. make it 10s or so.
| Err(IggyError::NotConnected) | ||
| }; | ||
|
|
||
| tokio::time::timeout(request_timeout.get_duration(), io) |
There was a problem hiding this comment.
on timeout the io future is dropped while it still holds the stream lock, mid-write or after-write-before-read. tcp and websocket use one persistent stream with no request id, so responses are matched purely by arrival order. RequestTimeout isn't in the send_raw_with_response reconnect set, so the stream is kept and reused: the next send_raw reads the previous request's late response, or if the timeout hit mid-write, a partial frame is left on the wire. result is a permanent response/frame desync until the process restarts. quic avoids this because it opens a fresh open_bi stream per request.
adding RequestTimeout to the reconnect set alone isn't enough - the !reconnection.enabled and (under vsr) auto_login == Disabled early returns fire before disconnect(), which is the only thing that resets the stream. simplest robust fix: on timeout, reset the stream (*stream.lock().await = None) inside send_raw before returning, independent of the reconnect path.
| } | ||
| }; | ||
|
|
||
| tokio::time::timeout(self.config.request_timeout.get_duration(), io) |
There was a problem hiding this comment.
same desync as the tcp path - the timeout drops io while it holds the stream lock, no request id on the wire, RequestTimeout not in the reconnect set, so the stream is reused and every later response is shifted by one. worse here because this io runs inline with no tokio::spawn, so it's also exposed to caller cancellation, not just the internal timeout. needs the same fix: reset the stream on timeout before returning.
| let mut consensus_session = consensus_session | ||
| .lock() | ||
| .expect("consensus session mutex poisoned"); | ||
| crate::vsr::encode_request_header(&mut consensus_session, code, &payload)? |
There was a problem hiding this comment.
under the vsr feature this advances the shared session request counter before the write happens. if the timeout fires after the id is taken but before the request lands, the counter has moved but the server never saw the request - the next replicated request uses id n+1, the primary sees a gap and silently drops it (and everything after), bricking replicated writes on that session. the counter lives in the shared ConsensusSession, not the stream, so quic is not immune to this one even though it dodges the frame desync. same teardown fixes it: reset the vsr session on timeout (the disconnect path's reset_vsr_session) inside send_raw, since the early returns skip it.
| Err(IggyError::NotConnected) | ||
| }; | ||
|
|
||
| tokio::time::timeout(request_timeout.get_duration(), io) |
There was a problem hiding this comment.
IggyDuration::from_str maps "0", "unlimited", "disabled" and "none" to Duration::ZERO, and tokio::time::timeout(ZERO, io) elapses on the first pending poll - so setting request_timeout to any of those makes every request fail instantly instead of disabling the timeout. same at the tcp and websocket sites. guard with is_zero() (already on IggyDuration) and skip the timeout wrap when it's zero.
| tls_validate_certificate: true, | ||
| reconnection: connection_string.options().reconnection().to_owned(), | ||
| heartbeat_interval: connection_string.options().heartbeat_interval(), | ||
| request_timeout: IggyDuration::from_str("300s").unwrap(), |
There was a problem hiding this comment.
this hardcodes 300s instead of reading from the connection string like the field right above it (heartbeat_interval uses connection_string.options()). same in the quic and websocket From<ConnectionString> impls. so request_timeout is settable via the builder and cli but silently pinned to 300s for anyone configuring through a connection string. either parse it from the options or document that it's not supported there.
| let request_timeout = self.config.request_timeout; | ||
| #[cfg(feature = "vsr")] | ||
| let consensus_session = self.consensus_session.clone(); | ||
| // SAFETY: we run code holding the `stream` lock in a task so we can't be cancelled while holding the lock. |
There was a problem hiding this comment.
this comment is no longer accurate - the new tokio::time::timeout below drops the lock-holding io from inside the task, which is exactly the cancellation it says can't happen. spawn only shields against caller cancellation. worth correcting when the timeout teardown lands. same at the quic site.
| keep_alive_interval: args.quic_keep_alive_interval, | ||
| max_idle_timeout: args.quic_max_idle_timeout, | ||
| validate_certificate: args.quic_validate_certificate, | ||
| request_timeout: IggyDuration::from_str(&args.request_timeout).unwrap(), |
There was a problem hiding this comment.
from_str(&args.request_timeout).unwrap() panics on a bad --request-timeout value since it's a free-form string with no validation. also at lines 152 and 186. matches the existing pattern for the other duration args here so it's not new, but it adds another panic point on user input - a clap value_parser on the arg would reject bad values cleanly instead.
| tls_ca_file: None, | ||
| tls_validate_certificate: true, | ||
| heartbeat_interval: IggyDuration::from_str("5s").unwrap(), | ||
| request_timeout: IggyDuration::from_str("300s").unwrap(), |
There was a problem hiding this comment.
heads up this is a behavior change - before this, requests had no deadline, now they default to 300s. anything that legitimately takes longer than 300s on a connection now returns RequestTimeout (and trips the desync above). worth a changelog / release-note callout so users who relied on unbounded requests aren't surprised.
| /// Interval of heartbeats sent by the client | ||
| pub heartbeat_interval: IggyDuration, | ||
| /// Per-request timeout for send/receive operations. | ||
| pub request_timeout: IggyDuration, |
There was a problem hiding this comment.
adding a pub field here is a breaking change for anyone constructing this struct with a literal without ..Default::default(), since it isn't #[non_exhaustive]. fine for now while the crate is pre-1.0, but if you plan to keep adding config fields it's worth marking the three client config structs #[non_exhaustive] once and steering people to the builders.
Which issue does this PR address?
Closes #3419
Rationale
Clients had no deadline on individual request, so a stalled server could block a caller forever.
What changed?
send_rawwould be blocked forever if server doesn't response. I added a request_timeout field (IggyDuration, default 300s) to TcpClientConfig, QuicClientConfig, and WebSocketClientConfig. Each client'ssend_rawnow wraps its I/O intokio::time::timeoutLocal Execution
AI Usage