Skip to content

Support delaying definition recompute to the background#1495

Merged
qtomlinson merged 44 commits intoclearlydefined:masterfrom
qtomlinson:qt/recompute-picked
Apr 4, 2026
Merged

Support delaying definition recompute to the background#1495
qtomlinson merged 44 commits intoclearlydefined:masterfrom
qtomlinson:qt/recompute-picked

Conversation

@qtomlinson
Copy link
Copy Markdown
Collaborator

@qtomlinson qtomlinson commented Mar 17, 2026

Based on Application Insights logging, "definition not available" events occur at 7K–10K per hour (data window: 12–18 hours). These definitions are currently computed on demand, increasing response time.

This PR adds a delayed compute path: when a definition is missing from the blob store, the service immediately returns an empty placeholder (a schema-valid definition with no tools), queues a background compute job, and triggers harvest if needed.

UpgradeHandler is extended into RecomputeHandler using a policy-composed architecture with two independent paths:

  • Upgrade path (upgradePolicy): evaluates existing definitions against schema rules and queues upgrade work when stale. Unchanged from existing DefinitionVersionCheck / DefinitionQueueUpgrader behavior. Queue errors are swallowed — returning a stale definition that will be retried on the next request is acceptable. This is the existing upgrade workflow.
  • Compute path (computePolicy): handles requests where no stored definition exists(MissingDefinitionComputePolicy)
    • OnDemandComputePolicy — computes synchronously (existing behavior, default).

    • DelayedComputePolicy — returns a placeholder immediately and enqueues background compute. Queue errors propagate to the caller — a missing definition with no queued work is worse than a failed request.

      Delayed queue processing uses missing-only semantics (SkipUpgradePolicy) so the background processor backfills missing definitions only, not stale ones. A shared in-memory cache is passed to both processing paths to prevent duplicate concurrent compute.

The provider is selected via DEFINITION_UPGRADE_PROVIDER: onDemand (default) or delayed. Backward-compatible aliases versionCheck and upgradeQueue are preserved.

Class names and environment variables related to upgrade are minimally changed and can be updated in a follow-up if needed.

Test Cases

In addition to the unit tests and integration tests added, the following end to end test cases were completed.

Step Test Result
Pass 0: Default file-based dev setup, on-demand compute (Regression)
0.1 Default dev: full harvest + recompute (no harvest data) PASS
0.2 Default dev: fast recompute (harvest data exists) PASS
0.3 Default dev: existing definition returned as-is PASS
0.4 Default dev: force recompute works PASS
0.5 Default dev: stale schema triggers blocking recompute PASS
Pass 1: Azurite on-demand compute (Regression)
1.1 On-demand: full harvest + recompute PASS
1.2 On-demand: fast recompute (harvest exists) PASS
1.3 On-demand: force recompute works PASS
Pass 2: Azurite upgrade-queue mode (Regression)
2.1 Queue mode: current definition returned as-is PASS
2.2 Queue mode: stale definition triggers upgrade queue PASS
2.3 Queue mode: upgrade processor recomputes PASS
2.4 Queue mode: stale def + queue unavailable → stale returned (not error) PASS
Pass 3: Azurite delayed compute path (new)
3.1 Delayed: missing definition returns placeholder + message queued PASS
3.2 Delayed: background processor computes real definition PASS
3.3 Delayed: full harvest + recompute via background PASS
3.4 Delayed: existing definition not re-queued PASS
3.5 Delayed: force=true bypasses delayed path PASS
3.6 Delayed: stale → upgrade queue, not compute queue PASS
Pass 4: Operational resilience (Azurite, new)
4.1 Operational: queue unavailable → request fails PASS
4.2 Operational: restart - queued message not lost PASS
4.3 Operational: duplicate requests - one compute runs PASS

@qtomlinson qtomlinson changed the title Qt/recompute picked Support delaying definition recompute in the background Mar 17, 2026
- Rename queueComputePolicy files/classes to delayedComputePolicy / DelayedComputePolicy
- Update imports, tests, and tsconfig includes to new names
- Add upgrade.service.delayed alias to delayed recompute factory
- Keep upgradeQueue for backward compatibility (deprecated in typings)
- Add existingDefinitionPolicy to treat existing definitions as valid on compute queue processing
- Wire delayed compute setupProcessing to use existingDefinitionPolicy
- Add tests for skip-on-existing and recompute-on-missing compute-queue behavior
- Include new policy files in tsconfig includes
- Create a shared in-memory lock cache in RecomputeHandler
- Pass shared cache to both upgrade and delayed compute setupProcessing paths
- Extend process.setup to accept optional sharedCache parameter
- Keep existingDefinitionPolicy for compute missing-only semantics
- Align JS and d.ts signatures for optional shared cache injection
@qtomlinson qtomlinson force-pushed the qt/recompute-picked branch from c0d5a98 to 195f946 Compare March 28, 2026 04:15
The previous name implied a domain concept about definition existence.
The class is a null-behavior upgrade validator that skips staleness
checks entirely — SkipUpgradePolicy better expresses that intent.

- Rename files and class in providers/upgrade/
- Update import in delayedComputePolicy
- Update tsconfig includes
defaultFactory only uses logger from its options object; any other
properties (e.g. queue passed from app wiring) were silently ignored.
Destructure { logger } in the parameter to make this explicit at the
call boundary, and narrow the type declaration accordingly.
…terface

The UpgradeHandler interface declares validate returning Promise<Definition | null>.
SkipUpgradePolicy was returning undefined on the falsy path instead of null,
causing a minor type mismatch. Changed return to null to align with the interface.
- Convert OnDemandComputePolicy from plain object to named class
  so getPolicyName() returns a meaningful name instead of "Object"
- Remove createOnDemandComputePolicy and createDelayedComputePolicy
  factory functions as they were thin wrappers over new
- Update recomputeHandler and tests to use constructors directly
The compute queue processor uses SkipUpgradePolicy, which ignores
definition contents entirely. Only coordinates are read from the
message, so the empty _meta field was dead weight.
Pass a plain DefinitionVersionChecker instead of relying on setup()'s
default parameter, making it clear the processor intentionally uses
a version checker (not the queue upgrader itself) to avoid re-queuing
stale definitions during background processing.
…andler

RecomputeHandler is the authoritative owner of the shared cache, so its TTL
constant belongs there. DefinitionUpgrader.defaultTtlSeconds made private
(_defaultTtlSeconds) as it is an internal fallback only. Removes the implicit
coupling where changing the upgrader TTL would silently affect the handler cache.
- DelayedComputePolicy constructor now requires options (removes misleading default stub)
- OnDemandComputePolicy.setupProcessing is now async (consistent with interface and sibling class)
- SkipUpgradePolicy drops currentSchema getter/setter (irrelevant since it skips staleness checks)
- Update .d.ts files to match
@qtomlinson qtomlinson changed the title Support delaying definition recompute in the background Support delaying definition recompute to the background Mar 31, 2026
- Resolve conflict in definitionServiceTest.ts: keep recomputeHandler
  with full mock (initialize, setupProcessing, compute) while adopting
  TS types from master's upgradeHandler refactor
- Add types/chai-as-promised.d.ts to fix rejectedWith TS type errors
  introduced by master's test-to-TypeScript migration
…ation

The Function type is required to match chai-as-promised's API signature
for error constructors passed to rejectedWith.
Replace direct DefinitionVersionChecker/DefinitionQueueUpgrader
instantiation with defaultFactory/delayedFactory to test the real
production wiring. Move on-demand-specific tests out of the shared
helper since they don't apply to the queued path. Extract duplicated
passthrough recompute handler into createPassthroughRecomputeHandler.
Internal methods _queueCompute and _constructMessage should not be
part of the public type surface.
… Promise<Definition>

No implementation returns undefined — _cast() at the call site would crash if it did.
…tlSeconds from public API

sharedCache: The 'shared' qualifier is an implementation detail of
RecomputeHandler, which creates one cache and passes it to both policies to
coordinate locking. The receiving layers (setupProcessing, setup,
DefinitionUpgrader) don't care whether the cache is shared — cache is the
right name there.

defaultTtlSeconds: _defaultTtlSeconds is a private field in DefinitionUpgrader.
It was incorrectly declared as a public static in process.d.ts. Removed from
the type declaration; the default TTL remains in the implementation.
@qtomlinson qtomlinson force-pushed the qt/recompute-picked branch from 29d46d8 to b80f4c9 Compare April 1, 2026 23:17
@qtomlinson qtomlinson marked this pull request as ready for review April 1, 2026 23:20
@qtomlinson qtomlinson requested a review from JamieMagee April 1, 2026 23:21
Comment thread business/definitionService.js
Comment thread providers/upgrade/process.js
Comment thread providers/upgrade/defVersionCheck.js Outdated
Comment thread providers/upgrade/delayedComputePolicy.js
Comment thread business/definitionService.js
Comment thread providers/upgrade/azureQueueConfig.js Outdated
Comment thread providers/upgrade/recomputeHandler.js
Comment thread tsconfig.json
Comment on lines +183 to +191
"providers/upgrade/computePolicy.d.ts",
"providers/upgrade/skipUpgradePolicy.d.ts",
"providers/upgrade/skipUpgradePolicy.js",
"providers/upgrade/onDemandComputePolicy.d.ts",
"providers/upgrade/onDemandComputePolicy.js",
"providers/upgrade/delayedComputePolicy.d.ts",
"providers/upgrade/delayedComputePolicy.js",
"providers/upgrade/recomputeHandler.d.ts",
"providers/upgrade/recomputeHandler.js",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ Thank you!

Comment thread providers/index.d.ts Outdated
Copy link
Copy Markdown
Contributor

@JamieMagee JamieMagee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade/compute policy split is the right call here. I like that defaultFactory reproduces existing behavior without touching any of the new code paths, so delayed is purely opt-in.

I'd want the unbounded while(lock) spin in process.js addressed before merge. If a compute hangs, that loop stalls the queue processor forever. Also, the delayed path enqueues a new message for every request hitting the same missing coordinate. The consumer deduplicates, but at 7K-10K req/hr you're generating a lot of queue trash that gets dequeued, locked, checked, and thrown away. Even a quick computeLock.get() check before enqueuing would help.

Suppress duplicate queue messages for the same coordinate within a single
service instance. Before enqueuing, check an in-memory cache keyed by
coordinates.toString(); if already enqueued, skip and return the placeholder.

Cache is memory-only (not Redis): the goal is burst suppression within one
instance, not cross-instance dedup. Cross-instance correctness is already
handled by the consumer side — the shared Redis definition cache serializes
processing, and SkipUpgradePolicy skips recompute when a definition already
exists. Adding Redis here would introduce hot-path latency for a benefit
already provided elsewhere.

TTL is 5 minutes (matching the spin-lock TTL), well above the p95 compute
window (~6s). Cache key uses coordinates.toString() without toLowerCase() —
casing normalization is definitionService's concern only.

Cache is injectable via options.enqueueCache for testing.
The delayed compute path is meant to defer on-demand behavior to the
background, not introduce new retry semantics. The on-demand path computes
exactly once — if it fails, the next request retries. The delayed path
should mirror that: single attempt, demand-driven recovery.

Previously, a processing error left the message in the queue to be retried
up to 5 times (Azure visibility timeout). Now the message is always deleted
in a finally block. If no definition exists, the next HTTP request will
re-enqueue.
- Drop `void [...]` in defVersionCheck.setupProcessing; `_` prefix is sufficient
  to signal unused parameters
- Add @deprecated JSDoc + TODO to versionCheck and upgradeQueue legacy aliases
  in providers/index.d.ts
- Add comment in definitionService.js explaining force=true bypasses curations
  intentionally
- Add comment in recomputeHandler.setupProcessing clarifying it resolves after
  the first polling batch; queue consumers continue running in the background
The compute queue is expected to handle ~10K requests/hour when delayed
compute is enabled.

Azure Storage Queue caps getMessages at 32 per call, so there is no benefit
to making this configurable. Hardcoding to 32 maximizes throughput and
removes the accidental dependency on DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE,
which was inherited from the upgrade queue default and would have silently
under-provisioned the compute queue in any environment that sets that var.
Based on compute stats (p50: 245ms, p85: ~1min, p95: 16.35min over 20 days;
p95: 11min over 7 days), the previous 10-min visibilityTimeout and 5-min TTLs
were too short for the tail of the distribution.

- visibilityTimeout (both queues): 10 min → 20 min, covers p95 compute duration
- Spin lock TTL (DefinitionUpgrader): 5 min → 20 min, aligned with visibilityTimeout
  to prevent same-instance duplicate computes while a slow compute is still running
- Pre-enqueue cache TTL (DelayedComputePolicy): 5 min → 20 min, consistent with
  the compute window all three values are anchored to
@qtomlinson
Copy link
Copy Markdown
Collaborator Author

Follow-up: consolidate _upgradeLock into computeLock by refactoring computeAndStoreIfNecessary to accept a policy predicate.

@qtomlinson qtomlinson merged commit 17df9ab into clearlydefined:master Apr 4, 2026
4 checks passed
@qtomlinson qtomlinson deleted the qt/recompute-picked branch April 4, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants