Uns 611 clubbed notification dispatch by kirtimanmishrazipstack · Pull Request #1951 · Zipstack/unstract

kirtimanmishrazipstack · 2026-05-07T18:18:31Z

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

I have read and understood the Contribution Guidelines.

…o CORS (#1938) * UN-3439 [FIX] Accept wildcard subdomain origins in SocketIO and Django CORS Production socket connections were failing for `*.env.us-central.unstract.com` because python-socketio does exact-string comparison on `cors_allowed_origins`, so a literal `*` pattern silently rejected every real subdomain. - Add `CORS_ALLOWED_ORIGIN_REGEXES` derived from `WEB_APP_ORIGIN_URL_WITH_WILD_CARD`. - Wire SocketIO via `_RegexOrigin` whose `__eq__` does the regex match — single list entry covers all wildcard subdomains, no library subclass needed. - Normalize `WEB_APP_ORIGIN_URL` through `urlparse` so trailing slashes / paths in env are stripped (also fixes the `…com//oauth-status/` double-slash). - Add startup guard for malformed env values. Resolves item #1 of UN-3439. Items #2/#3 (decoupling indexing from Socket.io, fallback) are owned separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3439 [FIX] Address PR review: canonical origin, fullmatch, unhashable RegexOrigin, tests Addresses five review comments on #1938: 1. coderabbitai (Major) — RFC 6454 canonicalization. Browsers serialize `Origin` headers with a lowercase host and no explicit default ports; `parsed_url.netloc` preserved both, so `https://APP.EXAMPLE.COM:443` would silently fail to match the browser's `https://app.example.com`. Switch to `parsed_url.hostname` + drop default ports, and reject non-http(s) schemes at startup. 2. greptile (P2) — `re.fullmatch` instead of `re.match`. With `re.match` plus `$`, a candidate ending in `\n` matches because `$` is allowed before an optional trailing newline. `fullmatch` removes the ambiguity. 3. self — `_RegexOrigin.__hash__` violated `a == b ⇒ hash(a) == hash(b)` (one fixed pattern hash vs. many matching strings). Today this is masked because python-socketio uses linear `__eq__` on a list, but if the allow-list is ever wrapped in a set, every legitimate subdomain would silently be rejected — exactly the failure mode UN-3439 closes. Make instances unhashable so the contract can't be broken. 4. self — No regression tests. Add `backend/utils/tests/test_cors_origin.py` (33 cases) covering: regex match/no-match, lookalike spoofing, scheme mismatch, trailing-newline rejection, non-string equality protocol, unhashability, ReDoS bounds, URL normalization (case, default ports, trailing slash, paths, queries), startup-guard rejections (empty, no-scheme, non-browser-scheme, no-host), and end-to-end via the same `RegexOrigin` path SocketIO uses. 5. self — Over-clever wildcard-to-regex builder. The `split('*').join(re.escape, ...)` construction generalised to N wildcards but the input has exactly one; replace with a direct rf-string that's self-evident on review. Refactor for testability: extract `RegexOrigin` and `normalize_web_app_origin` into `backend/utils/cors_origin.py` (Django-free, importable from settings and tests). Settings now delegates to one helper call; `log_events.py` imports `RegexOrigin`. No behavioural change beyond what each comment fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3439 [FIX] Address SonarCloud quality gate The Sonar quality gate failed with C reliability + 5 security hotspots, all on the new test file: - S905 (Bug, Major) — `{ro}` flagged as no-side-effect statement (Sonar doesn't see the implicit `__hash__` call). Drove the C reliability rating. Fix: use `len({ro})` so the side effect is via an explicit function call; test still asserts the same `TypeError`. - S5727 (Code Smell, Critical) — `assert ro != None` is tautological and doesn't exercise `__eq__`. Switch to `(ro == None) is False` which directly tests that `NotImplemented` falls back to identity-equality. - S5332 × 5 (Hotspots) — `http://` and `ftp://` literals in test data. These are intentional inputs proving the rejection logic. Annotate with `# NOSONAR` and an explanatory comment so the hotspots can be marked reviewed. No production code changed; tests still 33/33 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3439 [FIX] Remove last S5727 code smell — test __eq__ via dunder Sonar S5727 correctly inferred that ``ro == None`` is statically always False (NotImplemented falls back to identity), making the assertion look tautological. The intent is to lock the protocol contract: ``__eq__`` must return the ``NotImplemented`` sentinel for non-strings. Test that directly via ``ro.__eq__(None) is NotImplemented`` instead of going through ``==``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3439 [FIX] Address remaining CodeRabbit nits — port validation, ReDoS bound Two minor follow-ups from the second CodeRabbit pass: - `parsed.port` is a property that raises ValueError on malformed/out-of-range inputs (e.g. `:abc`, `:99999`). That bypassed our normalized config-error message and surfaced as a stack trace. Wrap the access and re-raise with the same actionable text. Adds two test cases (`https://example.com:abc`, `https://example.com:99999`) to lock the new behaviour. - The 50ms ReDoS timing bound is too tight for noisy CI runners. Loosen to 500ms — still orders of magnitude below what catastrophic backtracking would produce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Change csp to report only * [HOTFIX] Bool-parse ENABLE_HIGHLIGHT_API_DEPLOYMENT env var (v0.161.4) (#1939) [HOTFIX] Bool-parse ENABLE_HIGHLIGHT_API_DEPLOYMENT env var (#1937) [FIX] Bool-parse ENABLE_HIGHLIGHT_API_DEPLOYMENT env var os.environ.get returns the raw string when the variable is set, so ENABLE_HIGHLIGHT_API_DEPLOYMENT="False" was truthy in Python (any non-empty string is truthy). Wrap in CommonUtils.str_to_bool so "False" / "false" / "0" actually evaluate to False. The setting is consumed by the cloud configuration plugin's spec default (ConfigSpec.default in plugins/configuration/cloud_config.py) on cloud and on-prem builds. With this fix, an admin who explicitly sets the env var to a falsy string sees highlight data stripped as expected. Co-authored-by: vishnuszipstack <117254672+vishnuszipstack@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Deepak K <89829542+Deepak-Kesavan@users.noreply.github.com> Co-authored-by: vishnuszipstack <117254672+vishnuszipstack@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@jaseemjaskp

…ation workflow (#1941) * UN-3448 [FIX] Add --system flag to uv pip install in uv-lock-automation workflow Modern uv requires uv pip install to run inside a virtual environment OR with the explicit --system flag. The workflow currently has neither, so it errors out: error: No virtual environment found for Python 3.12.9; run `uv venv` to create an environment, or pass `--system` to install into a non-virtual environment This breaks every PR that touches a pyproject.toml (the workflow's paths filter triggers on those). Last successful run was 2026-04-01, before a behaviour change in uv or astral-sh/setup-uv@v7. The --system flag is exactly what the error message suggests and is correct here — we install pip into the runner's system Python; the downstream uv-lock.sh script creates its own venvs as needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3448 [FIX] Remove vestigial `uv pip install` line per review Per @jaseemjaskp's review: the pre-step `uv pip install ... pip` does nothing useful for this workflow. The downstream uv-lock.sh script uses uv sync at line 74, which manages its own venvs internally and never invokes pip directly: $ grep -rn 'pip' docker/scripts/uv-lock-gen/ docker/scripts/uv-lock-gen/uv-lock.sh:2:set -o pipefail Only match is pipefail (shell option), no real pip references. Removing the line entirely is cleaner than papering over with --system. The line was likely copy-pasted from a sibling workflow that legitimately needed pip in the system Python. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [HOTFIX] Use importlib.util.find_spec for pluggable worker discovery (#1918) * [FIX] Use importlib.util.find_spec for pluggable worker discovery _verify_pluggable_worker_exists() previously checked for the literal file `pluggable_worker/<name>/worker.py` on disk, which breaks when the plugin has been compiled to a .so (Nuitka, Cython, or any C extension) — the module is perfectly importable but the pre-check rejects it because only the .py extension is considered. Replace the filesystem check with importlib.util.find_spec(), which is Python's standard way to ask "is this module resolvable by the import system?". It honors every registered finder — source .py, compiled .so, bytecode .pyc, namespace packages, zipimports — so the function now matches what its docstring claims: verifying the module can be loaded, not that a specific file extension is present. Behavior is preserved for existing deployments: - Images with no `pluggable_worker/<name>/` subpackage → find_spec raises ModuleNotFoundError (ImportError subclass) → returns False. - Images with source .py → find_spec resolves the .py → returns True. - Images with compiled .so → find_spec resolves the .so → returns True. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [FIX] Handle ValueError from find_spec in pluggable worker verification Greptile-flagged edge case: importlib.util.find_spec() can raise ValueError (not just ImportError) when sys.modules has a partially initialised module entry with __spec__ = None from a prior failed import. Broaden the except to catch both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [FIX] Resolve api-deployment worker directory from enum import path worker.py:452 did worker_type.value.replace("-", "_") to derive the on-disk dir name. All WorkerType enum values already use underscores, so the replace was a no-op; for API_DEPLOYMENT whose dir is "api-deployment" (hyphen), it resolved to "api_deployment" and the os.path.exists() check failed. Boot then logged a spurious "❌ Worker directory not found: /app/api_deployment" at ERROR level. The task registration path (builder + celery autodiscover via to_import_path) is unaffected, so this was purely log noise — but noise at ERROR level that masks real failures in log scans. Fix: derive the directory from the authoritative to_import_path() which already handles the hyphen case (api_deployment -> api-deployment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [HOTFIX] Add IAM Role / Instance Profile auth mode to AWS Bedrock adapter (#1944) * [FEAT] Allow Bedrock to fall through to boto3's default credential chain Match the S3/MinIO connector pattern: when AWS access keys are left blank on the Bedrock LLM and embedding adapter forms, drop them from the kwargs dict so boto3's default credential chain handles authentication. This unlocks IAM role / instance profile / IRSA / AWS Profile scenarios on hosts that already have ambient AWS credentials (e.g. EKS workers with IRSA, EC2 with an instance profile). - llm1/static/bedrock.json: clarify access-key descriptions to mention IRSA and instance profile (already non-required at v0.163.2 base). - embedding1/static/bedrock.json: drop aws_access_key_id and aws_secret_access_key from top-level required; same description fix; expose aws_profile_name for parity with the LLM form. - base1.py: AWSBedrockLLMParameters and AWSBedrockEmbeddingParameters now strip empty access-key values from the validated kwargs before returning, so empty strings don't override boto3's default chain. AWSBedrockEmbeddingParameters fields gain explicit None defaults and an aws_profile_name field. Backward-compatible: existing adapters with access keys filled in continue to work unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [FEAT] Add Authentication Type selector to Bedrock adapter form Add an explicit `auth_type` selector with two options, making the auth choice clear to users: - "Access Keys" (default): existing flow, keys required - "IAM Role / Instance Profile (on-prem AWS only)": no fields; relies on boto3's default credential chain (IRSA on EKS, task role on ECS, instance profile on EC2). Description on the selector explicitly notes this option is only for AWS-hosted Unstract deployments. The form-only auth_type field is stripped before LiteLLM validation in both AWSBedrockLLMParameters.validate() and AWSBedrockEmbeddingParameters. validate(). Empty access keys continue to be stripped so boto3 falls through to the default chain even when the access_keys arm is selected without values (matches the S3/MinIO connector pattern). Backward-compatible: legacy adapters without auth_type behave as "Access Keys" mode (the default), and existing keys are forwarded unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [REVIEW] Address Bedrock auth_type review feedback Fixes the P0/P1 issues raised by greptile-apps and jaseemjaskp on PR #1944. Behaviour fixes: - Stale-key leak in IAM Role mode: switching an existing adapter from Access Keys to IAM Role would carry truthy stored access keys through the strip-empty-only loop, so boto3 silently authenticated with the old long-lived credentials instead of falling through to the host's IRSA / instance-profile identity. Both LLM and embedding paths were affected. - Silent acceptance of unknown auth_type: a typo (e.g. "access_key") or a malformed payload from a non-UI client passed through the dict comprehension untouched, with no enum guard. - Cross-field validation gap: explicit Access Keys mode with blank or whitespace-only values silently fell through to the default credential chain instead of surfacing the misconfiguration. Implementation: - Add a module-level _resolve_bedrock_aws_credentials helper used by both AWSBedrockLLMParameters.validate() and AWSBedrock EmbeddingParameters.validate(), so the auth-type contract is expressed once. - Validates auth_type against an allowlist (None | "access_keys" | "iam_role"); raises ValueError on anything else. - iam_role: unconditionally drops aws_access_key_id and aws_secret_access_key. - access_keys (explicit): requires non-blank values; raises ValueError if either is empty or whitespace-only. - Legacy (auth_type absent): retains the lenient strip behaviour so pre-PR adapter configurations continue to deserialise unchanged. - Restore aws_region_name as required (no `= None` default) on AWSBedrockEmbeddingParameters; only credentials may legitimately be absent. - Drop the orphan aws_profile_name field from embedding1/static/bedrock.json: it was added for parity with the LLM form but lives outside the auth_type oneOf and contradicts the selector's "no further input" semantics. The LLM form already had aws_profile_name pre-PR and is left alone for backwards compatibility. Tests: - New tests/test_bedrock_adapter.py covers 15 cases across LLM and embedding adapters: legacy-no-auth-type, explicit access_keys with valid/blank/whitespace keys, iam_role with stale/no keys, unknown auth_type rejection, cross-field validation, and preservation of unrelated params (model_id, aws_profile_name, region, thinking). Skipped (P2 nice-to-have): - Comment-scope clarification, MinIO reference rewording, validate-mutates-caller'\''s-dict, and the LLM form description nit about aws_profile_name visibility. These don'\''t change behaviour and can be addressed in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --------- Co-authored-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Athul <89829560+athul-rs@users.noreply.github.com>

…fy-on-API-deployment-failures

coderabbitai · 2026-05-07T18:18:40Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 85f0737f-49b7-494f-b12d-71a49b0d4a9a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch UNS-611-clubbed-notification-dispatch

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sonarqubecloud · 2026-05-07T18:20:04Z

Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

greptile-apps · 2026-05-07T18:24:26Z

Greptile Summary

This PR introduces "clubbed" (batched) notification dispatch for pipelines and API deployments. When a Notification is set to BATCHED delivery mode, execution events are buffered in a new NotificationBuffer table and flushed as a single aggregated webhook message per (org, webhook_url, auth_sig) group on a configurable cadence, rather than firing one request per execution.

Adds the NotificationBuffer model, a DeliveryMode enum, and a partial DB index on PENDING rows; a new migration covers both the new table and the delivery_mode column on Notification.
Two new internal endpoints (buffer/enqueue/, buffer/process/) handle worker-side buffering and backend-side flush; a new process_notification_buffer.py script and a scheduler.sh update drive the flush loop alongside the existing log-history consumer.
Frontend gains a delivery-mode selector on the notification form (with auto-IMMEDIATE when "notify on failures" is checked) and a batch-interval admin UI in Platform Settings.

Confidence Score: 3/5

The clubbed dispatch path has a window where a transient DB error after broker publish can leave rows PENDING while notifications have already been sent, leading to duplicates on the next flush tick.

The _dispatch_group function calls celery_app.send_task() inside transaction.atomic() before rows are marked DISPATCHED. A transient DB failure in the .update() call rolls back the transaction while the broker has already received the task — rows stay PENDING and are re-dispatched on the next tick, sending duplicate notifications. The settings serializer also accepts zero or negative intervals with no validation, collapsing the batch window for every future enqueue.

backend/notification_v2/internal_api_views.py (the _dispatch_group flush transaction) and backend/notification_v2/serializers.py (the settings serializer validation) warrant a second look before merging.

Important Files Changed

Filename	Overview
backend/notification_v2/internal_api_views.py	Core flush logic: adds `enqueue_notification_buffer` and `process_notification_buffer` endpoints; `_dispatch_group` calls `celery_app.send_task()` inside `transaction.atomic()` — a duplicate-dispatch race exists if the subsequent `.update()` fails before the transaction commits.
backend/notification_v2/serializers.py	Adds `NotificationSettingsSerializer` and `delivery_mode` field; `club_interval_seconds` is missing a `min_value=1` guard, allowing a zero or negative interval to be persisted and silently collapse batch windows.
backend/notification_v2/helper.py	Adds `enqueue`, `dispatch_with_delivery_mode`, `split_by_delivery_mode`, and auth-sig helpers; implementation is sound, BATCHED routing correctly branches before reaching the immediate-dispatch path.
workers/shared/patterns/notification/helper.py	Adds `_route_notification` / `_enqueue_to_buffer` branching; BATCHED failures are silently dropped by design — no metric/counter emitted beyond an error log, which can obscure backend outages.
backend/notification_v2/models.py	Adds `delivery_mode` to `Notification` and new `NotificationBuffer` model with correct partial index on `(org, url, auth_sig, flush_after)` scoped to PENDING; schema looks correct.
backend/notification_v2/tasks.py	New `mark_buffer_dead_letter` Celery errback task; signature correctly matches Celery's link_error `(request, exc, traceback)` positional convention with `buffer_row_ids` pre-bound as a kwarg.
backend/notification_v2/clubbed_renderer.py	New renderer for batched messages; `build_envelope` caps at `MAX_BATCH_SIZE=500` matching `_PROCESS_BUFFER_CAP`; Slack display truncation at 25 events with overflow footer looks correct.
backend/notification_v2/migrations/0003_add_notification_buffer.py	Migration adds `delivery_mode` column and `NotificationBuffer` table with partial index; defaults and choices align with model and enum definitions.
workers/log_consumer/process_notification_buffer.py	Thin HTTP wrapper that triggers the backend's buffer flush endpoint every scheduler tick; non-fatal failure handling is correct — logs and returns False without crashing the scheduler loop.
workers/log_consumer/scheduler.sh	Adds a second task slot for the notification buffer flush alongside log history; `run_task` helper catches individual failures without aborting the loop; `-e` removed from `set` flags intentionally.

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
backend/notification_v2/internal_api_views.py:429-472
**Celery task dispatched inside `transaction.atomic()` — duplicate notifications on commit failure**

`celery_app.send_task()` is called while the `transaction.atomic()` block is still open. If `send_task` succeeds but the subsequent `.update(status=DISPATCHED)` raises (e.g. a transient DB error), the entire transaction rolls back: rows remain `PENDING` while the broker has already received — and may have already executed — the webhook task. On the very next flush tick these rows are re-selected as `PENDING` and dispatched again, sending duplicate notifications.

The standard fix is to emit the Celery task only after the transaction commits, via `transaction.on_commit()`. One safe restructuring: perform the `.update()` first inside the transaction, then schedule the task via `on_commit` so that the broker is only notified once the DB state is durable.

### Issue 2 of 4
backend/notification_v2/serializers.py:11-13
No `min_value` guard on `club_interval_seconds`. A caller can `PATCH` a value of `0` or a negative integer, which passes serializer validation and is stored in the configuration table. `get_org_club_interval_seconds` returns it as-is, so `flush_after = now + timedelta(seconds=0_or_negative)` — the new row is immediately past its window and will be dispatched on the very next tick, effectively bypassing batching. The rendered message would also display `interval_minutes: 0`.

```suggestion
    # Backend ConfigSpec accepts any int; a floor of 1 prevents a zero or
    # negative value from collapsing the batch window to "immediate".
    club_interval_seconds = serializers.IntegerField(min_value=1)
```

### Issue 3 of 4
workers/shared/patterns/notification/helper.py:56-85
**Silent BATCHED notification drop on backend unavailability**

When `_enqueue_to_buffer` returns `False`, `_route_notification` returns immediately with no fallback and no metric. The execution event is lost — it won't appear in any future clubbed message nor as an immediate notification. The comment acknowledges this is intentional ("we DON'T fall back"), but there is no counter, no structured metric log, and no dead-letter record created on the worker side. During a prolonged backend outage, BATCHED notifications can silently disappear with only a single `logger.error` line as evidence.

### Issue 4 of 4
backend/notification_v2/internal_api_views.py:499-506
**`_PROCESS_BUFFER_CAP` referenced before it is defined in the module**

`_PROCESS_BUFFER_CAP` is used in `_dispatch_group` (line ~406) but defined at the module level only after that function's definition. Python resolves it at call time rather than definition time, so this doesn't raise a `NameError` at import, but any reader or static analyzer that encounters the reference first will be confused. Moving the constant above `_dispatch_group` (or at the top of the file with the other module-level constants) is the conventional placement.

_{Reviews (1): Last reviewed commit: "batch notification" | Re-trigger Greptile}

greptile-apps · 2026-05-07T18:24:30Z

+        try:
+            celery_app.send_task(
+                "send_webhook_notification",
+                args=[
+                    first_notification.url,
+                    body,
+                    headers,
+                    settings.NOTIFICATION_TIMEOUT,
+                ],
+                kwargs={
+                    "max_retries": first_notification.max_retries,
+                    "retry_delay": 10,
+                    "platform": platform,
+                },
+                queue="notifications",
+                link_error=celery_app.signature(
+                    "notification_v2.mark_buffer_dead_letter",
+                    kwargs={"buffer_row_ids": buffer_ids},
+                ),
+            )
+        except Exception:
+            # Broker hiccup — leave rows PENDING for the next tick rather
+            # than mark them DEAD_LETTER. `exception` keeps stack context.
+            logger.exception(
+                "Broker dispatch failed for group org=%s url_hash=%s",
+                org_id,
+                webhook_url_hash(webhook_url),
+            )
+            return 0, 0
+
+        now = timezone.now()
+        NotificationBuffer.objects.filter(id__in=buffer_ids).update(
+            status=BufferStatus.DISPATCHED.value,
+            dispatched_at=now,
+        )
+        logger.info(
+            "metric=notification_batch_dispatched_total platform=%s result=success "
+            "org_id=%s webhook_url_hash=%s rows=%d",
+            platform,
+            org_id,
+            webhook_url_hash(webhook_url),
+            len(rows),
+        )
+        return len(rows), len(rows)


Celery task dispatched inside transaction.atomic() — duplicate notifications on commit failure

celery_app.send_task() is called while the transaction.atomic() block is still open. If send_task succeeds but the subsequent .update(status=DISPATCHED) raises (e.g. a transient DB error), the entire transaction rolls back: rows remain PENDING while the broker has already received — and may have already executed — the webhook task. On the very next flush tick these rows are re-selected as PENDING and dispatched again, sending duplicate notifications.

The standard fix is to emit the Celery task only after the transaction commits, via transaction.on_commit(). One safe restructuring: perform the .update() first inside the transaction, then schedule the task via on_commit so that the broker is only notified once the DB state is durable.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/notification_v2/internal_api_views.py Line: 429-472 Comment: **Celery task dispatched inside `transaction.atomic()` — duplicate notifications on commit failure** `celery_app.send_task()` is called while the `transaction.atomic()` block is still open. If `send_task` succeeds but the subsequent `.update(status=DISPATCHED)` raises (e.g. a transient DB error), the entire transaction rolls back: rows remain `PENDING` while the broker has already received — and may have already executed — the webhook task. On the very next flush tick these rows are re-selected as `PENDING` and dispatched again, sending duplicate notifications. The standard fix is to emit the Celery task only after the transaction commits, via `transaction.on_commit()`. One safe restructuring: perform the `.update()` first inside the transaction, then schedule the task via `on_commit` so that the broker is only notified once the DB state is durable. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-07T18:24:30Z

+    # No min/max here: mfbt is silent on bounds. Backend ConfigSpec accepts
+    # any int; constraining is a follow-up if/when product gives a number.
+    club_interval_seconds = serializers.IntegerField()


No min_value guard on club_interval_seconds. A caller can PATCH a value of 0 or a negative integer, which passes serializer validation and is stored in the configuration table. get_org_club_interval_seconds returns it as-is, so flush_after = now + timedelta(seconds=0_or_negative) — the new row is immediately past its window and will be dispatched on the very next tick, effectively bypassing batching. The rendered message would also display interval_minutes: 0.

Suggested change

# No min/max here: mfbt is silent on bounds. Backend ConfigSpec accepts

# any int; constraining is a follow-up if/when product gives a number.

club_interval_seconds = serializers.IntegerField()

# Backend ConfigSpec accepts any int; a floor of 1 prevents a zero or

# negative value from collapsing the batch window to "immediate".

club_interval_seconds = serializers.IntegerField(min_value=1)

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/notification_v2/serializers.py Line: 11-13 Comment: No `min_value` guard on `club_interval_seconds`. A caller can `PATCH` a value of `0` or a negative integer, which passes serializer validation and is stored in the configuration table. `get_org_club_interval_seconds` returns it as-is, so `flush_after = now + timedelta(seconds=0_or_negative)` — the new row is immediately past its window and will be dispatched on the very next tick, effectively bypassing batching. The rendered message would also display `interval_minutes: 0`. ```suggestion # Backend ConfigSpec accepts any int; a floor of 1 prevents a zero or # negative value from collapsing the batch window to "immediate". club_interval_seconds = serializers.IntegerField(min_value=1) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-07T18:24:31Z

+            timeout=10,
+        )
+        logger.info(
+            "Enqueued BATCHED notification %s for pipeline %s execution %s",
+            notification["id"],
+            payload.pipeline_id,
+            payload.execution_id,
+        )
+        return True
+    except Exception as e:
+        logger.error(
+            "Failed to enqueue BATCHED notification %s for pipeline %s: %s",
+            notification["id"],
+            payload.pipeline_id,
+            e,
+        )
+        return False
+
+
+def _route_notification(
+    api_client: Any,
+    notification: dict[str, Any],
+    payload: NotificationPayload,
+) -> None:
+    """IMMEDIATE -> existing worker queue; BATCHED -> backend enqueue endpoint.
+
+    Defaults to IMMEDIATE when delivery_mode is missing so older backend
+    builds (pre-UNS-611) keep working unchanged.
+    """
+    if notification.get("notification_type") != "WEBHOOK":


Silent BATCHED notification drop on backend unavailability

When _enqueue_to_buffer returns False, _route_notification returns immediately with no fallback and no metric. The execution event is lost — it won't appear in any future clubbed message nor as an immediate notification. The comment acknowledges this is intentional ("we DON'T fall back"), but there is no counter, no structured metric log, and no dead-letter record created on the worker side. During a prolonged backend outage, BATCHED notifications can silently disappear with only a single logger.error line as evidence.

Prompt To Fix With AI

This is a comment left during a code review. Path: workers/shared/patterns/notification/helper.py Line: 56-85 Comment: **Silent BATCHED notification drop on backend unavailability** When `_enqueue_to_buffer` returns `False`, `_route_notification` returns immediately with no fallback and no metric. The execution event is lost — it won't appear in any future clubbed message nor as an immediate notification. The comment acknowledges this is intentional ("we DON'T fall back"), but there is no counter, no structured metric log, and no dead-letter record created on the worker side. During a prolonged backend outage, BATCHED notifications can silently disappear with only a single `logger.error` line as evidence. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-07T18:24:33Z

+        .annotate(earliest_flush=Min("flush_after"))
+        .filter(earliest_flush__lte=now)
+    )
+
+    dispatched_groups = 0
+    dispatched_rows = 0
+    for group in groups:
+        try:


_PROCESS_BUFFER_CAP referenced before it is defined in the module

_PROCESS_BUFFER_CAP is used in _dispatch_group (line ~406) but defined at the module level only after that function's definition. Python resolves it at call time rather than definition time, so this doesn't raise a NameError at import, but any reader or static analyzer that encounters the reference first will be confused. Moving the constant above _dispatch_group (or at the top of the file with the other module-level constants) is the conventional placement.

Prompt To Fix With AI

This is a comment left during a code review. Path: backend/notification_v2/internal_api_views.py Line: 499-506 Comment: **`_PROCESS_BUFFER_CAP` referenced before it is defined in the module** `_PROCESS_BUFFER_CAP` is used in `_dispatch_group` (line ~406) but defined at the module level only after that function's definition. Python resolves it at call time rather than definition time, so this doesn't raise a `NameError` at import, but any reader or static analyzer that encounters the reference first will be confused. Moving the constant above `_dispatch_group` (or at the top of the file with the other module-level constants) is the conventional placement. How can I resolve this? If you propose a fix, please make it concise.

muhammad-ali-e and others added 6 commits May 5, 2026 10:48

Merge branch 'main' of github.com:Zipstack/unstract into UN-3056-Noti…

968fdbd

…fy-on-API-deployment-failures

batch notification

796e630

kirtimanmishrazipstack marked this pull request as ready for review May 7, 2026 18:19

kirtimanmishrazipstack merged commit aad0fa9 into UN-3056-Notify-on-API-deployment-failures May 7, 2026
4 checks passed

kirtimanmishrazipstack deleted the UNS-611-clubbed-notification-dispatch branch May 7, 2026 18:19

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uns 611 clubbed notification dispatch#1951

Uns 611 clubbed notification dispatch#1951
kirtimanmishrazipstack merged 6 commits into
UN-3056-Notify-on-API-deployment-failuresfrom
UNS-611-clubbed-notification-dispatch

kirtimanmishrazipstack commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Uh oh!

greptile-apps Bot commented May 7, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kirtimanmishrazipstack commented May 7, 2026

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate failed

Uh oh!

greptile-apps Bot commented May 7, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented May 7, 2026 •

edited

Loading