Support delaying definition recompute to the background#1495
Support delaying definition recompute to the background#1495qtomlinson merged 44 commits intoclearlydefined:masterfrom
Conversation
… compute policy path
- Rename queueComputePolicy files/classes to delayedComputePolicy / DelayedComputePolicy - Update imports, tests, and tsconfig includes to new names - Add upgrade.service.delayed alias to delayed recompute factory - Keep upgradeQueue for backward compatibility (deprecated in typings)
- Add existingDefinitionPolicy to treat existing definitions as valid on compute queue processing - Wire delayed compute setupProcessing to use existingDefinitionPolicy - Add tests for skip-on-existing and recompute-on-missing compute-queue behavior - Include new policy files in tsconfig includes
- Create a shared in-memory lock cache in RecomputeHandler - Pass shared cache to both upgrade and delayed compute setupProcessing paths - Extend process.setup to accept optional sharedCache parameter - Keep existingDefinitionPolicy for compute missing-only semantics - Align JS and d.ts signatures for optional shared cache injection
c0d5a98 to
195f946
Compare
The previous name implied a domain concept about definition existence. The class is a null-behavior upgrade validator that skips staleness checks entirely — SkipUpgradePolicy better expresses that intent. - Rename files and class in providers/upgrade/ - Update import in delayedComputePolicy - Update tsconfig includes
defaultFactory only uses logger from its options object; any other
properties (e.g. queue passed from app wiring) were silently ignored.
Destructure { logger } in the parameter to make this explicit at the
call boundary, and narrow the type declaration accordingly.
…terface The UpgradeHandler interface declares validate returning Promise<Definition | null>. SkipUpgradePolicy was returning undefined on the falsy path instead of null, causing a minor type mismatch. Changed return to null to align with the interface.
- Convert OnDemandComputePolicy from plain object to named class so getPolicyName() returns a meaningful name instead of "Object" - Remove createOnDemandComputePolicy and createDelayedComputePolicy factory functions as they were thin wrappers over new - Update recomputeHandler and tests to use constructors directly
The compute queue processor uses SkipUpgradePolicy, which ignores definition contents entirely. Only coordinates are read from the message, so the empty _meta field was dead weight.
Pass a plain DefinitionVersionChecker instead of relying on setup()'s default parameter, making it clear the processor intentionally uses a version checker (not the queue upgrader itself) to avoid re-queuing stale definitions during background processing.
…andler RecomputeHandler is the authoritative owner of the shared cache, so its TTL constant belongs there. DefinitionUpgrader.defaultTtlSeconds made private (_defaultTtlSeconds) as it is an internal fallback only. Removes the implicit coupling where changing the upgrader TTL would silently affect the handler cache.
- DelayedComputePolicy constructor now requires options (removes misleading default stub) - OnDemandComputePolicy.setupProcessing is now async (consistent with interface and sibling class) - SkipUpgradePolicy drops currentSchema getter/setter (irrelevant since it skips staleness checks) - Update .d.ts files to match
- Resolve conflict in definitionServiceTest.ts: keep recomputeHandler with full mock (initialize, setupProcessing, compute) while adopting TS types from master's upgradeHandler refactor - Add types/chai-as-promised.d.ts to fix rejectedWith TS type errors introduced by master's test-to-TypeScript migration
…ation The Function type is required to match chai-as-promised's API signature for error constructors passed to rejectedWith.
Replace direct DefinitionVersionChecker/DefinitionQueueUpgrader instantiation with defaultFactory/delayedFactory to test the real production wiring. Move on-demand-specific tests out of the shared helper since they don't apply to the queued path. Extract duplicated passthrough recompute handler into createPassthroughRecomputeHandler.
Internal methods _queueCompute and _constructMessage should not be part of the public type surface.
… Promise<Definition> No implementation returns undefined — _cast() at the call site would crash if it did.
…tlSeconds from public API sharedCache: The 'shared' qualifier is an implementation detail of RecomputeHandler, which creates one cache and passes it to both policies to coordinate locking. The receiving layers (setupProcessing, setup, DefinitionUpgrader) don't care whether the cache is shared — cache is the right name there. defaultTtlSeconds: _defaultTtlSeconds is a private field in DefinitionUpgrader. It was incorrectly declared as a public static in process.d.ts. Removed from the type declaration; the default TTL remains in the implementation.
29d46d8 to
b80f4c9
Compare
| "providers/upgrade/computePolicy.d.ts", | ||
| "providers/upgrade/skipUpgradePolicy.d.ts", | ||
| "providers/upgrade/skipUpgradePolicy.js", | ||
| "providers/upgrade/onDemandComputePolicy.d.ts", | ||
| "providers/upgrade/onDemandComputePolicy.js", | ||
| "providers/upgrade/delayedComputePolicy.d.ts", | ||
| "providers/upgrade/delayedComputePolicy.js", | ||
| "providers/upgrade/recomputeHandler.d.ts", | ||
| "providers/upgrade/recomputeHandler.js", |
JamieMagee
left a comment
There was a problem hiding this comment.
The upgrade/compute policy split is the right call here. I like that defaultFactory reproduces existing behavior without touching any of the new code paths, so delayed is purely opt-in.
I'd want the unbounded while(lock) spin in process.js addressed before merge. If a compute hangs, that loop stalls the queue processor forever. Also, the delayed path enqueues a new message for every request hitting the same missing coordinate. The consumer deduplicates, but at 7K-10K req/hr you're generating a lot of queue trash that gets dequeued, locked, checked, and thrown away. Even a quick computeLock.get() check before enqueuing would help.
Suppress duplicate queue messages for the same coordinate within a single service instance. Before enqueuing, check an in-memory cache keyed by coordinates.toString(); if already enqueued, skip and return the placeholder. Cache is memory-only (not Redis): the goal is burst suppression within one instance, not cross-instance dedup. Cross-instance correctness is already handled by the consumer side — the shared Redis definition cache serializes processing, and SkipUpgradePolicy skips recompute when a definition already exists. Adding Redis here would introduce hot-path latency for a benefit already provided elsewhere. TTL is 5 minutes (matching the spin-lock TTL), well above the p95 compute window (~6s). Cache key uses coordinates.toString() without toLowerCase() — casing normalization is definitionService's concern only. Cache is injectable via options.enqueueCache for testing.
The delayed compute path is meant to defer on-demand behavior to the background, not introduce new retry semantics. The on-demand path computes exactly once — if it fails, the next request retries. The delayed path should mirror that: single attempt, demand-driven recovery. Previously, a processing error left the message in the queue to be retried up to 5 times (Azure visibility timeout). Now the message is always deleted in a finally block. If no definition exists, the next HTTP request will re-enqueue.
- Drop `void [...]` in defVersionCheck.setupProcessing; `_` prefix is sufficient to signal unused parameters - Add @deprecated JSDoc + TODO to versionCheck and upgradeQueue legacy aliases in providers/index.d.ts - Add comment in definitionService.js explaining force=true bypasses curations intentionally - Add comment in recomputeHandler.setupProcessing clarifying it resolves after the first polling batch; queue consumers continue running in the background
The compute queue is expected to handle ~10K requests/hour when delayed compute is enabled. Azure Storage Queue caps getMessages at 32 per call, so there is no benefit to making this configurable. Hardcoding to 32 maximizes throughput and removes the accidental dependency on DEFINITION_UPGRADE_DEQUEUE_BATCH_SIZE, which was inherited from the upgrade queue default and would have silently under-provisioned the compute queue in any environment that sets that var.
Based on compute stats (p50: 245ms, p85: ~1min, p95: 16.35min over 20 days; p95: 11min over 7 days), the previous 10-min visibilityTimeout and 5-min TTLs were too short for the tail of the distribution. - visibilityTimeout (both queues): 10 min → 20 min, covers p95 compute duration - Spin lock TTL (DefinitionUpgrader): 5 min → 20 min, aligned with visibilityTimeout to prevent same-instance duplicate computes while a slow compute is still running - Pre-enqueue cache TTL (DelayedComputePolicy): 5 min → 20 min, consistent with the compute window all three values are anchored to
|
Follow-up: consolidate _upgradeLock into computeLock by refactoring computeAndStoreIfNecessary to accept a policy predicate. |
Based on Application Insights logging, "definition not available" events occur at 7K–10K per hour (data window: 12–18 hours). These definitions are currently computed on demand, increasing response time.
This PR adds a delayed compute path: when a definition is missing from the blob store, the service immediately returns an empty placeholder (a schema-valid definition with no tools), queues a background compute job, and triggers harvest if needed.
UpgradeHandleris extended intoRecomputeHandlerusing a policy-composed architecture with two independent paths:upgradePolicy): evaluates existing definitions against schema rules and queues upgrade work when stale. Unchanged from existingDefinitionVersionCheck/DefinitionQueueUpgraderbehavior. Queue errors are swallowed — returning a stale definition that will be retried on the next request is acceptable. This is the existing upgrade workflow.computePolicy): handles requests where no stored definition exists(MissingDefinitionComputePolicy)OnDemandComputePolicy— computes synchronously (existing behavior, default).DelayedComputePolicy— returns a placeholder immediately and enqueues background compute. Queue errors propagate to the caller — a missing definition with no queued work is worse than a failed request.Delayed queue processing uses missing-only semantics (
SkipUpgradePolicy) so the background processor backfills missing definitions only, not stale ones. A shared in-memory cache is passed to both processing paths to prevent duplicate concurrent compute.The provider is selected via
DEFINITION_UPGRADE_PROVIDER:onDemand(default) ordelayed. Backward-compatible aliasesversionCheckandupgradeQueueare preserved.Class names and environment variables related to upgrade are minimally changed and can be updated in a follow-up if needed.
Test Cases
In addition to the unit tests and integration tests added, the following end to end test cases were completed.