Skip to content

redis: handle EXEC transient errors#973

Merged
bootjp merged 5 commits into
mainfrom
fix/jepsen-redis
Jun 18, 2026
Merged

redis: handle EXEC transient errors#973
bootjp merged 5 commits into
mainfrom
fix/jepsen-redis

Conversation

@bootjp

@bootjp bootjp commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

  • promote transient leader failures from proxied Redis EXEC to a top-level NOTLEADER error
  • keep ordinary per-command EXEC errors inside the EXEC result array, matching Redis protocol behavior
  • teach the Jepsen Redis workload to record EXEC array errors as :info instead of parsing them as read values
  • make DynamoDB single-item one-phase dedup default-on, with ELASTICKV_DYNAMODB_ONEPHASE_DEDUP=0 / WithDynamoOnePhaseTxnDedup(false) as rollback switches
  • keep scheduled Jepsen coverage split between pinned dedup-on and explicit dedup-off control baselines

Root Cause

The scheduled Redis Jepsen run hit leader churn during MULTI/EXEC. A transient leadership error could surface as an EXEC array element, and the Jepsen client then passed that exception object into parse-read, crashing the worker instead of recording an operation outcome.

DynamoDB one-phase dedup was still opt-in even though the probe-aware FSM reader has already shipped everywhere. That left the safe retry path behind an unnecessary adapter option.

Validation

  • lein test
  • go test ./adapter -run 'Test(IsTransientLeaderRedisError|WriteRedisError|HandleProxyTxnError|HandleProxyTxnCommandError)$'
  • go test ./adapter -run 'TestRedis_MultiExec|TestRedisTxnMULTIEXECRetriesOnCoordinatorConflict|TestExecDedup'
  • go test ./adapter -run 'TestItemWriteDedup'
  • go test ./adapter -run 'Test(Dynamo|ItemWriteDedup)'
  • git diff --check

Note: go test ./adapter was started as a broader smoke test but produced no output for over 5 minutes, so it was interrupted; the targeted DynamoDB/default-on coverage above completed successfully.

Summary by CodeRabbit

  • New Features

    • DynamoDB single-item write deduplication now enabled by default, reducing unnecessary transaction retries
    • Redis adapter enhanced to properly report and handle leadership-related errors to clients
  • Tests

    • Added test coverage for deduplication default behavior and configuration flexibility
    • Added tests validating Redis error handling for leadership failures

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 4 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c129bcfe-7ba9-4ba1-be5a-a0937d3fdd6d

📥 Commits

Reviewing files that changed from the base of the PR and between 10157b2 and eef57cc.

📒 Files selected for processing (1)
  • adapter/dynamodb_onephase_dedup_test.go
📝 Walkthrough

Walkthrough

Flips DynamoDB one-phase transaction deduplication from opt-in (== "1") to default-on (!= "0"), adds env-opt-out and option-override tests, and updates CI workflows and design docs. Separately, fixes Redis NOTLEADER double-prefixing, promotes per-command pipeline leader errors to top-level Redis replies, and refactors Jepsen workload error handling with new helper functions and tests.

Changes

DynamoDB One-Phase Dedup Default-On

Layer / File(s) Summary
Default-on behavior change in NewDynamoDBServer
adapter/dynamodb.go, adapter/dynamodb_item_write.go
onePhaseTxnDedup initialization changed from == "1" to != "0"; doc comments updated for default-on and rollback switch semantics in both files.
Tests for default-on, env opt-out, option-override
adapter/dynamodb_onephase_dedup_test.go
newDedupItemWriteServer refactored to use direct option; TestItemWriteDedup_DefaultOn, TestItemWriteDedup_EnvOptOut, TestItemWriteDedup_OptionOverridesEnv added.
CI workflow env-var control and design doc
.github/workflows/jepsen-test-scheduled-dedup.yml, .github/workflows/jepsen-test-scheduled.yml, docs/design/2026_06_03_partial_dynamodb_onephase_dedup.md
Dedup-scheduled workflow comments updated to pinned-on semantics; general scheduled workflow adds ELASTICKV_DYNAMODB_ONEPHASE_DEDUP=0 for legacy baseline; design doc revised for default-on, M1/M2 milestones, and decision log.

Redis NOTLEADER Error Promotion

Layer / File(s) Summary
writeRedisError fix and per-command leader error promotion
adapter/redis.go
writeRedisError preserves existing NOTLEADER prefix; handleProxyTxnCommandError added to scan pipeline command handles and promote transient leader errors; handleProxyTxnError extended to treat transient leader errors as terminal.
Redis error prefix and proxy transaction tests
adapter/redis_error_prefix_test.go
captureConn gains WriteArray capture; no-double-prefix regression case added; TestHandleProxyTxnError and TestHandleProxyTxnCommandError added.
Jepsen Redis workload refactor and tests
jepsen/src/elastickv/redis_workload.clj, jepsen/test/elastickv/redis_workload_test.clj
Private helpers extracted for exception detection, error labeling, and result completion; invoke! delegates to apply-multi-op!/apply-single-op!; new test file validates read parsing and NOTLEADER-to-:info conversion.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant proxyTransactionToLeader
  participant handleProxyTxnError
  participant handleProxyTxnCommandError
  participant writeRedisError

  Client->>proxyTransactionToLeader: MULTI/EXEC pipeline
  proxyTransactionToLeader->>proxyTransactionToLeader: execute pipelined commands
  proxyTransactionToLeader->>handleProxyTxnError: check pipeline-level error
  alt transient leader error
    handleProxyTxnError->>writeRedisError: write NOTLEADER reply (no double-prefix)
    writeRedisError-->>Client: -NOTLEADER <msg>
  else TxFailedErr
    handleProxyTxnError-->>Client: null-array
  else no terminal error
    proxyTransactionToLeader->>handleProxyTxnCommandError: scan per-command result handles
    alt per-command transient leader error
      handleProxyTxnCommandError->>writeRedisError: promote to top-level NOTLEADER
      writeRedisError-->>Client: -NOTLEADER <msg>
    else ordinary command error
      handleProxyTxnCommandError-->>proxyTransactionToLeader: stop=false
      proxyTransactionToLeader-->>Client: EXEC array response
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • bootjp/elastickv#753: Directly overlaps with this PR's writeRedisError and adapter/redis_error_prefix_test.go changes around prefixing transient leader errors with NOTLEADER.
  • bootjp/elastickv#789: Both PRs modify adapter/redis.go and adapter/redis_error_prefix_test.go to ensure gRPC-wrapped transient leader failures are correctly classified as top-level NOTLEADER replies.
  • bootjp/elastickv#920: Directly related as the prior implementation of onePhaseTxnDedup gating in adapter/dynamodb.go that this PR flips to default-on.

Suggested labels

codex

🐇 The dedup gate swings open wide,
No more "gated off" to hide!
NOTLEADER errors, no double crown,
Jepsen helpers refactored down.
Default-on, rollback to zero,
This rabbit's a dedup hero! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'redis: handle EXEC transient errors' directly describes the main technical change: improved handling of transient errors in Redis EXEC operations, which is the primary focus of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/jepsen-redis

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves error handling for Redis transactions by promoting transient leadership failures returned by proxied EXEC targets to top-level EXEC errors, preventing clients from treating leadership churn as command data. It also prevents double-prefixing of NOTLEADER errors and updates the Jepsen workload to correctly parse and label transaction errors. Feedback on the changes suggests removing an unreachable nil guard (cmd == nil) in handleProxyTxnCommandError to avoid dead defensive code.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread adapter/redis.go
Comment on lines +2062 to +2066
for _, cmd := range cmds {
if cmd == nil {
continue
}
err := cmd.Err()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository's general rules, we should avoid adding dead defensive code, such as unreachable nil guards, if upstream checks or callers already guarantee that the variables are non-nil. Since execTxPipeline always populates cmds with non-nil *redis.Cmd instances, the cmd == nil check is unreachable and can be safely removed.

	for _, cmd := range cmds {
		err := cmd.Err()
References
  1. Avoid adding dead defensive code, such as unreachable nil guards, if upstream checks or callers already guarantee that the variables are non-nil.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

対応しました。handleProxyTxnCommandError から到達不能な cmd == nil guard を削除しました。

Validation:

  • go test ./adapter -run 'Test(IsTransientLeaderRedisError|WriteRedisError|HandleProxyTxnError|HandleProxyTxnCommandError)$'
  • git diff --check

bootjp commented Jun 18, 2026

Copy link
Copy Markdown
Owner Author

DynamoDB single-item one-phase dedup を default-on に変更し、ELASTICKV_DYNAMODB_ONEPHASE_DEDUP=0 / WithDynamoOnePhaseTxnDedup(false) を rollback switch として残しました。通常 scheduled Jepsen は explicit dedup-off baseline、dedup workflow は pinned-on に更新しています。

Validation:

  • go test ./adapter -run 'TestItemWriteDedup'
  • go test ./adapter -run 'Test(Dynamo|ItemWriteDedup)'
  • git diff --check

@codex review

bootjp commented Jun 18, 2026

Copy link
Copy Markdown
Owner Author

Gemini の inline finding に対応し、handleProxyTxnCommandError の到達不能な cmd == nil guard を削除しました。

Validation:

  • go test ./adapter -run 'Test(IsTransientLeaderRedisError|WriteRedisError|HandleProxyTxnError|HandleProxyTxnCommandError)$'
  • git diff --check

@codex review

@bootjp bootjp marked this pull request as ready for review June 18, 2026 11:59

bootjp commented Jun 18, 2026

Copy link
Copy Markdown
Owner Author

PR を ready for review に切り替えました。最新 HEAD は 10157b2153a9c2051f3caae1144180ce000d3993 です。

@codex review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
adapter/dynamodb_onephase_dedup_test.go (1)

250-266: ⚡ Quick win

Consider adding test for option-override-to-disable.

The new tests cover default-on, env opt-out, and option-override-to-enable. For completeness, consider adding a test case where the env var enables dedup (or is unset) but the option disables it:

func TestItemWriteDedup_OptionOverridesEnvToDisable(t *testing.T) {
	t.Setenv("ELASTICKV_DYNAMODB_ONEPHASE_DEDUP", "1")
	server := NewDynamoDBServer(nil, store.NewMVCCStore(), newDedupTestCoordinator(store.NewMVCCStore(), 0, false), WithDynamoOnePhaseTxnDedup(false))
	require.False(t, server.onePhaseTxnDedup)
}

This validates that the option can override in both directions, not just enable when env disables.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@adapter/dynamodb_onephase_dedup_test.go` around lines 250 - 266, Add a new
test function called TestItemWriteDedup_OptionOverridesEnvToDisable to complete
the test coverage for bidirectional option override behavior. In this test, set
the environment variable ELASTICKV_DYNAMODB_ONEPHASE_DEDUP to "1" to enable
dedup, then create a NewDynamoDBServer with the
WithDynamoOnePhaseTxnDedup(false) option to override and disable it. Assert that
server.onePhaseTxnDedup is False, validating that the option parameter can
override the environment variable to disable dedup, not just enable it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@adapter/dynamodb_onephase_dedup_test.go`:
- Around line 250-266: Add a new test function called
TestItemWriteDedup_OptionOverridesEnvToDisable to complete the test coverage for
bidirectional option override behavior. In this test, set the environment
variable ELASTICKV_DYNAMODB_ONEPHASE_DEDUP to "1" to enable dedup, then create a
NewDynamoDBServer with the WithDynamoOnePhaseTxnDedup(false) option to override
and disable it. Assert that server.onePhaseTxnDedup is False, validating that
the option parameter can override the environment variable to disable dedup, not
just enable it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cefa2c14-ec60-47bb-a4c8-fbf602cfa329

📥 Commits

Reviewing files that changed from the base of the PR and between ddf29af and 10157b2.

📒 Files selected for processing (10)
  • .github/workflows/jepsen-test-scheduled-dedup.yml
  • .github/workflows/jepsen-test-scheduled.yml
  • adapter/dynamodb.go
  • adapter/dynamodb_item_write.go
  • adapter/dynamodb_onephase_dedup_test.go
  • adapter/redis.go
  • adapter/redis_error_prefix_test.go
  • docs/design/2026_06_03_partial_dynamodb_onephase_dedup.md
  • jepsen/src/elastickv/redis_workload.clj
  • jepsen/test/elastickv/redis_workload_test.clj

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

Reviewed commit: 10157b2153

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

bootjp commented Jun 18, 2026

Copy link
Copy Markdown
Owner Author

CodeRabbit の nitpick に対応し、WithDynamoOnePhaseTxnDedup(false) が env ELASTICKV_DYNAMODB_ONEPHASE_DEDUP=1 を上書きして disable できることをテスト追加しました。

Validation:

  • go test ./adapter -run 'TestItemWriteDedup'
  • git diff --check

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

Reviewed commit: eef57ccb1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp bootjp merged commit ebdf8eb into main Jun 18, 2026
9 checks passed
@bootjp bootjp deleted the fix/jepsen-redis branch June 18, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant