feat(tdbg): schedule dedup and force-CAN commands#19
Conversation
34829cb to
74f51c5
Compare
ad6744a to
404b1c9
Compare
| attrs := resp.History.Events[0].GetWorkflowExecutionStartedEventAttributes() | ||
| if attrs == nil { | ||
| return errors.New("first event is not WorkflowExecutionStarted") | ||
| } |
There was a problem hiding this comment.
I think we'd actually want to look at the latest memo, not the first (so basically the top event that has a full schedule in it). Otherwise, we'd potentially miss an update, aside from the inadvertent accumulation of duplicates.
There was a problem hiding this comment.
The memo isn't enough, right? I would look from back to front for the last update signal payload. (And then replay any patch signals over that)
| // query size limit), deduplicates the spec, and (if execute) deletes the broken | ||
| // schedule and recreates it with the clean spec. Use when the workflow is too | ||
| // degraded to process an update signal. | ||
| func RunDedupRecreate(ctx context.Context, cl sdkclient.Client, namespace, scheduleID, outDir string, execute bool) error { |
There was a problem hiding this comment.
I'm assuming that the way we'll run this is with a small hand-evaluated file of schedules to target, versus "entire namespace", correct? One thing that might be useful as a failsafe would be to check and see if the schedule "looks" stuck (e.g., if the top events aren't "timer started" after "WFT completion"). I don't want to go overkill on a purpose-built tool like this, but I think the usefulness of the failsafe scales with the volume of input we plan to feed into it. WDYT?
There was a problem hiding this comment.
Yes, we'd only run this one a subset of schedules. Easiest indicator will be if it can't respond to a describe schedule request. For debup without recreate and execute it will iterate over all schedules in a namespace and describe them. it will return an error (probably not one that's easy to parse) if it can not describe the schedule. We can use recreate on those.
There was a problem hiding this comment.
we can use the piped input to run this on a larger subset.
Adds schedutil, a CLI tool with two commands:
dedup - prints the current spec as JSON, deduplicates
StructuredCalendar and Interval entries client-side,
then sends an UpdateSchedule. Works without any
server-side fix.
force-can - sends force-continue-as-new to the scheduler workflow.
Three targeting modes:
--schedule-id <id> single schedule
--ids-file <file> file of IDs, one per line ('-' for stdin)
(neither) all schedules in the namespace
Namespace-wide and file modes require --yes; without it the command
lists affected schedules and exits.
Flags and env vars mirror tdbg (TEMPORAL_CLI_ADDRESS, TLS certs,
TEMPORAL_CLI_NAMESPACE, TEMPORAL_CONTEXT_TIMEOUT).
Co-authored-by: Lina Jodoin <lina.jodoin@temporal.io>
…n mode Without --execute the command always describes each schedule, writes before/after JSON files to a temp directory, and exits without applying any changes. With --execute the same files are written and changes are applied. - Renames --yes to --execute - RunDedup writes <ns>_<sid>-before.json and <ns>_<sid>-after.json to a shared temp dir regardless of execute flag - RunForceCAN prints what would be signalled in dry-run mode - ForEachSchedule drops the yes/execute param — callers own that logic - Adds dry-run integration tests for both dedup and force-can
Removes --ids-file. Without --schedule-id, reads IDs one per line from stdin if piped, otherwise operates on all schedules in the namespace. Standard Unix behavior, no extra flag needed.
When a schedule workflow is too degraded to respond to queries or signals (spec too large), --recreate reads the schedule state directly from workflow history, deduplicates the spec, deletes the broken workflow, and recreates the schedule fresh. Reads StartScheduleArgs from the WorkflowExecutionStarted event using payloads.Decode (binary/protobuf encoding used by the frontend). Deduplicates StructuredCalendar and Interval entries using proto.Equal directly on the proto types — no normalization needed since all entries from workflow history are in identical wire form. Preserves the full schedule state (action, policies, paused status, remaining actions) since StartScheduleArgs carries the current mutable state via ContinueAsNew arguments. Adds unit tests for the proto dedup helpers and functional tests for both dry-run and execute modes.
2cd70e1 to
0a0533b
Compare
| attrs := resp.History.Events[0].GetWorkflowExecutionStartedEventAttributes() | ||
| if attrs == nil { | ||
| return errors.New("first event is not WorkflowExecutionStarted") | ||
| } |
There was a problem hiding this comment.
The memo isn't enough, right? I would look from back to front for the last update signal payload. (And then replay any patch signals over that)
What changed?
schedule dedupandschedule force-cansubcommands totdbg.tools/schedutil/(library) andtools/tdbg/schedule_dedup_commands.go(CLI wiring). All operations go through theworkflowservicegRPC API directly — no SDK client.deduptargets a single schedule (--schedule-id), piped IDs from stdin, or all schedules in the namespace. Without--executeit writes before/after JSON artifacts and exits without modifying anything.dedup --recreatehandles schedules too degraded to process an update: readsStartScheduleArgsfrom workflow history, deduplicates proto-levelStructuredCalendarentries, then (with--execute) deletes and recreates the schedule viaCreateSchedule. High watermark resets to now; actions that would have fired during the degraded period will not fire.force-cansends aforce-continue-as-newsignal to the scheduler workflow. Supports the same targeting model and dry-run toggle asdedup.proto.Equalafter normalizing range defaults (End,Step) to matchcleanSpecsemantics. Entries that differ only in proto default representation are treated as equal; entries with differentCommentfields are treated as distinct.FlagExecuteandFlagRecreatetotools/tdbg/flags.go.How did you test it?
tools/schedutil/schedutil_test.go)tests/schedutil_test.go)Sample commands