Skip to content

cloudidentity: retry GroupMembership on 409 operation aborted (#23743)#71

Open
jbbqqf wants to merge 11 commits into
mainfrom
feat/23743-cloudidentity-membership-retry-409
Open

cloudidentity: retry GroupMembership on 409 operation aborted (#23743)#71
jbbqqf wants to merge 11 commits into
mainfrom
feat/23743-cloudidentity-membership-retry-409

Conversation

@jbbqqf
Copy link
Copy Markdown
Owner

@jbbqqf jbbqqf commented May 9, 2026

Summary

Retry google_cloud_identity_group_membership Create on 409 "The operation was aborted." — a race condition Cloud Identity raises when several memberships are added against the same group concurrently. The same retry predicate is already applied to google_iap_client for the same class of error.

Fixes hashicorp/terraform-provider-google#23743 — see hashicorp/terraform-provider-google#23743

Why

The reporter on issue #23743 described a CI pipeline that creates four service accounts and adds each to the same Cloud Identity group; ~75% of attempts fail with:

Error: Error creating GroupMembership: googleapi: Error 409: The operation was aborted.

The Cloud Identity backend serializes membership writes against the group's roster; concurrent writes hit a documented "operation was aborted" race that is transient. The existing fix for the identical-shape race on google_iap_client is the transport_tpg.IapClient409Operation retry predicate (matches 409 + body operation was aborted, case-insensitive). Wiring that same predicate into GroupMembership is the minimal correct change.

Note: a different 409 path exists — code 4003 "already exists" — which can surface when the provider's first POST succeeded server-side but the response was lost; the existing create_ignore_already_exists virtual field is intended for that path. This PR does not change behaviour there.

GCP API reference:

What changed

This is an mmv1-generated resource. One YAML touched:

 mmv1/products/cloudidentity/GroupMembership.yaml | 3 +++

The diff adds:

error_retry_predicates:

  - 'transport_tpg.IapClient409Operation'

The blank line and the transport_tpg. prefix match the existing pattern in mmv1/products/iap/Client.yaml.

Edge cases tested

# Scenario HCL excerpt Expected Verified by
1 Single membership (control) one resource per apply succeeds, no retry unchanged path
2 Concurrent membership writes (typical race) for_each of N service accounts on one group retry path triggers; eventual success predicate is the same one already battle-tested for google_iap_client (issue thread on TPG and IAP repo for that retry)
3 Edge: 409 'already exists' (code 4003) rerun after partial success unchanged — predicate body match is operation was aborted, not already exists; still routed through create_ignore_already_exists virtual field grep on retry predicate body

Test protocol

Test Result Notes
YAML diff review pass mirrors mmv1/products/iap/Client.yaml
Predicate body match confirmed IapClient409Operation matches gerr.Code == 409 && lower(body).Contains("operation was aborted") — same string the issue reporter quotes

This is a transport-retry change on a hand-written-test resource; reproducing the race deterministically requires a second concurrent provider invocation against the same Cloud Identity group, which our smoke harness can't easily orchestrate. The retry predicate itself is unit-tested upstream (error_retry_predicates_test.go) and the predicate function is unchanged. The change is fully captured by the YAML diff.

Resources

Disclosure

This PR was implemented with assistance from Claude Code as part of a focused contribution batch. The diff was reviewed manually against the existing IAP precedent and the linked issue body. The author (a human) reviewed the diff before opening this PR.

Note for reviewer

IapClient409Operation is named for IAP because that is where it was first introduced, but the function body is generic ("409 + 'operation was aborted'"). A small rename pass to e.g. Concurrency409Aborted could be done as a follow-up; not in scope for this PR.

jcromanu and others added 11 commits May 8, 2026 16:43
…hip (#23743)

Users report frequent 409 'operation was aborted' errors on
google_cloud_identity_group_membership when several memberships are
created concurrently against the same group (the same race the Cloud
Identity API has documented as concurrent-roster mutation). The current
behaviour bubbles the 409 up to the user without a retry, breaking
CI pipelines that batch-add service accounts.

Wire the existing IapClient409Operation retry predicate into the
GroupMembership resource via error_retry_predicates. The predicate
matches 409 + body 'operation was aborted' (case-insensitive) and is
the same retry already applied to google_iap_client, where the same
race exists.

This does not affect the 409 'already exists' (code 4003) path that
the create_ignore_already_exists virtual field is meant to handle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cloud_identity_group_membership giving "Error 409: The operation was aborted." very frequently

8 participants