CLI tool that takes a GitHub issue number, SSHs into a remote GPU machine, and launches an agent (Claude/Codex/Cursor/pi) to autonomously investigate and fix the bug. The agent produces a report and a diff that you can review and turn into a PR.
No install step needed — just use uv run:
cd pt_job_queue
uv run ptq --helpFor development (tests, web dashboard):
uv run --extra dev pytest
uv run ptq webAssumes you have uv installed; otherwise run curl -LsSf https://astral.sh/uv/install.sh | sh.
git clone git@github.com:drisspg/pt_job_queue.git
# This is the one intentional local PyTorch build for the seed workspace.
uv run ptq setup --local --build
uv run ptq run -m "tell me a story" --agent codexTo build the seed workspace from a non-default base, pass the target ref explicitly:
uv run ptq setup --local --build --onto upstream/viable/strictFor new PyTorch issues, prefer an issue-scoped workspace instead of reusing the shared local seed workspace. This keeps ghstack/rebase/build work for one issue from mutating generated ignored files that another issue's fast-path provisioning expects to match.
ISSUE=143260
WS="$HOME/.ptq_workspaces/pytorch-$ISSUE"
uv run ptq setup --local --workspace "$WS" --build
uv run ptq run --issue "$ISSUE" --local --workspace "$WS" --agent pi --no-follow
uv run ptq takeover "$ISSUE"Use --build intentionally for a fresh isolated PyTorch workspace: it creates the built base venv that later worktree provisioning relies on for the fast clone path. Do not let an unprepared or mismatched workspace discover that problem by falling back during ptq worktree.
If the issue job/worktree already exists, skip setup and worktree creation; enter the recorded workspace instead:
uv run ptq list
uv run ptq takeover JOB_IDptq takeover reads the workspace from the job record, so you do not repeat --workspace there.
# Remote GPU machine (auto-detects CUDA version)
uv run ptq setup my-gpu-box --build
# Remote with explicit CUDA version
uv run ptq setup my-gpu-box --cuda cu130 --build
# Local (for testing/development)
uv run ptq setup --local --cpu --buildThis creates a workspace with:
- A
uv-managed venv with PyTorch nightly transformer_nuggetsinstalled fromhttps://github.com/drisspg/transformer_nuggetsonce torch is importable- A pytorch source clone at the matching nightly commit
- Helper scripts for applying fixes to site-packages
When --build is used, setup resets the seed checkout to origin/main by default, then performs a full checkout nuke before editable install (git clean -dfx + submodule sync/update) to avoid stale CMake/Ninja graphs after upstream file moves. Use --onto REF to build from another ref, for example --onto upstream/viable/strict.
Speed up C++ rebuilds: Install system NCCL to skip building it from source (~5 min savings per rebuild):
sudo apt install -y libnccl-devThen add to ~/.ptq/config.toml:
[build.env]
USE_SYSTEM_NCCL = "1"# On a remote machine
uv run ptq worktree flex-attn --machine my-gpu-box
# Locally (default when no --machine)
uv run ptq worktree my-fix --local
# Only when debugging provisioning output
uv run ptq worktree stride-fix --machine my-gpu-box -vCreates a PyTorch git worktree with a ready-to-use venv, without launching an agent. Useful when you want to work in the worktree yourself or defer agent launch. Run ptq setup ... first — worktree assumes the workspace already exists and has a compatible built base venv. The command prints the shell command to enter the worktree.
Before creating a new worktree, run uv run ptq list; if a job or named worktree already exists for the issue, use uv run ptq takeover JOB_ID instead. -v/--verbose only streams provisioning output. It does not make worktree creation faster and should not be part of the default new-issue path.
Later, launch an agent in the same worktree by name:
uv run ptq run flex-attn -m "optimize the CPU codegen"The worktree shows up in ptq list and can be cleaned with ptq clean like any other job.
# On a remote machine
uv run ptq run --issue 174923 --machine my-gpu-box
# Locally
uv run ptq run --issue 174923 --local
# Run in background (don't stream output)
uv run ptq run --issue 174923 --machine my-gpu-box --no-follow
# Ad-hoc task (no issue, just a message)
uv run ptq run --machine my-gpu-box -m "Optimize the flex attention CPU codegen"
# Issue + extra context
uv run ptq run --issue 174923 --machine my-gpu-box -m "Focus on the stride logic"
# Use a preset template from the prompt library
ptq run --issue 174923 --machine my-gpu-box -p diagnose_and_plan
# Preset + extra instructions (appends your -m text)
ptq run --issue 174923 --machine my-gpu-box -p fix_and_verify -m "focus only on scaled_mm path"
# Use a different agent
uv run ptq run --issue 174923 --machine my-gpu-box --agent cursor --model gpt-5.3-codex-xhigh-fast
# Use first-class thinking control when the backend supports it
uv run ptq run --agent pi --model openai-codex/gpt-5.5 --thinking high -m "triage the repro"The agent will:
- Reproduce the bug using a repro script extracted from the issue
- Read pytorch source to find the root cause
- Apply a minimal Python-only fix
- Test the fix by copying edits to site-packages and re-running the repro
- Write
report.mdandfix.diff
Re-running the same issue reuses the existing worktree and preserves prior edits. Each run gets its own log (claude-1.log, claude-2.log, ...). Different issues run concurrently via separate git worktrees. Fresh workspaces still need an explicit ptq setup ... first.
uv run ptq web
# or on a custom port
uv run ptq web --port 9000The web UI lets you:
- Launch jobs (issue-based or ad-hoc) with agent/model/thinking/machine selection
- Fill the message box from a built-in prompt library for
Repro Only,Diagnose And Plan, andFix And Verify - Monitor live logs via streaming
- View reports, diffs, and worklogs
- Follow up on stopped jobs with steering messages
- Take Over — copies an SSH command that drops you into the job's worktree with the venv activated
- Create PRs directly from the UI
Add a screenshot at
docs/assets/web-ui.pngand this README will render it automatically.
The prompt library is backed by ~/.ptq/config.toml.
Per-agent model defaults live there too. For backends with first-class reasoning controls, you can set thinking separately from the model:
[models.pi]
default = "openai-codex/gpt-5.5"
thinking = "high"
[models.claude]
default = "opus"
thinking = "high"
[models.codex]
default = "gpt-5.4"
thinking = "high"Cursor currently encodes reasoning level in the model name itself, so PTQ continues to treat Cursor thinking as model-driven.
- Built-ins are always available and can be overridden under
[prompt_library.builtin.<name>] - User presets can be added under
[prompt_library.custom.<name>]
List everything available from CLI with:
ptq presets# Peek at the agent's worklog
uv run ptq peek 174923
# Peek with recent log activity
uv run ptq peek 174923 --log 30
# List all jobs with running/stopped status
uv run ptq list
# Watch PR jobs and CI follow-up status
uv run ptq monitor
uv run ptq monitor --watch
# Run bounded CI supervision over monitor rows and print worker prompts
uv run ptq supervise
uv run ptq supervise --prompts
uv run ptq supervise --watch --interval 300
# Open the monitor in a two-pane Herdr workspace
uv run ptq monitor --herdr
# Open an interactive Herdr workspace for a job
uv run ptq open JOB_IDThe agent maintains a worklog.md with entries after each significant step, so you can check progress without streaming the full output.
ptq monitor is a mergedog-style PR monitor for PTQ-created PR jobs and stopped jobs that look ready for PR creation. It shows one row per relevant job, classifies the next phase, colors the PR column as open/draft/closed state, prints ptq takeover JOB_ID-equivalent shell-entry commands for opening Herdr workspaces, marks actively merging PRs as landing even if checks are red, and points failing CI review rows at the local triage helper. On terminals with OSC-8 hyperlink support, the Issue and PR cells are clickable GitHub links:
~/dotfiles/scripts/github_ci_triage PR_URLWhen a landing attempt stopped and Dr. CI clearly reports only unrelated, flaky, or broken-trunk failures, the monitor prints a PyTorchBot merge-ignore command instead:
gh pr comment PR_URL --body '@pytorchbot merge -i'Use uv run ptq monitor --all to include all jobs, even if they do not have a recorded PR or ready PR artifacts yet. uv run ptq monitor --watch uses Rich's alternate-screen live view so resizing a Herdr pane or terminal redraws cleanly instead of leaving wrapped table fragments in scrollback. uv run ptq open JOB_ID creates an interactive Herdr workspace using uv run ptq takeover JOB_ID as the source of truth for where to start.
uv run ptq supervise adds a read-only triage layer above the raw monitor table. For failing CI rows it fetches the latest Dr. CI comment, runs ~/dotfiles/scripts/github_ci_triage PR_URL, saves the transcript under agent_space/supervisor/JOB_ID/, and classifies the row as needs fix, unrelated CI, merge-ignore candidate, or needs human review. merge-ignore candidate only implies @pytorchbot merge -i for PRs that are actively landing or whose landing attempt just stopped. The monitor/supervisor use explicit Dr. CI/HUD AI verdict text when available, but not badge images alone. Use uv run ptq supervise --prompts to print a worker prompt that tells a Pi/subagent exactly how to gather logs, apply the trust boundary, and classify each failure without editing code or posting comments.
Each PTQ job also gets a prime.md handoff file in the job directory. For a fresh manual Pi in an opened job workspace, start from the job directory and load @prime.md; it points the subagent at PTQ_CONTEXT.md, system_prompt.md, worklog.md, report.md, and the source repo AGENTS.md before editing.
The main driver skill lives at .agents/skills/driver/SKILL.md, and .pi/prompts/driver.md provides /driver in interactive Pi. Use /driver in your primary Herdr driver pane to coordinate PTQ workspaces.
The monitor operator skill lives at .agents/skills/monitor/SKILL.md, and .pi/prompts/monitor.md provides /monitor in interactive Pi. In the operator pane, use /monitor or start Pi with --skill .agents/skills/monitor so it uses the PTQ monitor workflow, CI triage helper, and optional HUD checks.
# By issue number (uses most recent job)
uv run ptq results 174923
# By full job ID
uv run ptq results 20260214-174923Fetches report.md, fix.diff, worklog.md, and the run log from the remote.
uv run ptq apply 174923 --pytorch-path ~/meta/pytorchCreates a branch ptq/{issue_number}, applies the diff, and prints next steps for creating a PR.
# Check status of a specific job
uv run ptq status 174923
# Kill a specific agent
uv run ptq kill 174923
# Kill all agents on a machine (tracked + zombie processes)
uv run ptq prune my-gpu-box
# Kill all local agents
uv run ptq prune --local# Remove all jobs on a machine
uv run ptq clean my-gpu-box
# Keep the 3 most recent
uv run ptq clean my-gpu-box --keep 3
# Clean local workspace
uv run ptq clean --localRemoves job directories and prunes git worktrees.
| Flag | Command | Default | Description |
|---|---|---|---|
--cuda |
setup | auto-detect | CUDA tag (cu124, cu126, cu128, cu130) |
--cpu |
setup | Use CPU-only PyTorch (macOS/testing) | |
--machine |
run, worktree | Remote machine hostname | |
--local |
setup, run, worktree, clean, prune | Use local workspace instead of SSH | |
--follow/--no-follow |
run | follow | Stream agent output to terminal |
--agent |
run | claude | Agent (claude, codex, cursor, pi) |
--model |
run | opus | Model name (agent-specific) |
--thinking |
run | agent default | Reasoning/thinking level when supported by the agent |
--max-turns |
run | 100 | Max agent turns |
-m/--message |
run | Ad-hoc task or extra context for an issue | |
-p/--preset |
run | Prompt preset key/title from prompt library | |
--workspace |
setup, run, worktree, prune | ~/ptq_workspace |
Custom workspace path |
--onto |
setup, rebase | origin/main |
Target ref for resetting the seed checkout or rebasing a job |
--keep |
clean | 0 | Number of recent jobs to keep |
--log |
peek | 0 | Number of log lines to show |
- Add a
[repos.<name>]section to~/.ptq/config.toml:
[repos.torchtitan]
github_repo = "pytorch/torchtitan"
clone_url = "https://github.com/pytorch/torchtitan.git"
dir_name = "torchtitan"
smoke_test_import = "torchtitan"
repro_import_hint = "import torchtitan"- Create prompt templates in
prompts/:prompts/investigate_<name>.md— issue investigation promptprompts/adhoc_<name>.md— freeform task prompt
The prompt templates are where the real work is — they teach the agent about the repo's build system, directory layout, debugging tools, and testing conventions. See the existing investigate.md and investigate_torchtitan.md for examples.
Optional profile fields (all default to false/null):
| Field | Description |
|---|---|
uses_custom_worktree_tool |
Use tools/create_worktree.py instead of git worktree add |
needs_cpp_build |
Run C++ rebuild after worktree creation |
lint_cmd |
Lint command to run before PRs |
pt_job_queue/
├── pyproject.toml
├── ptq/
│ ├── cli.py # Thin Typer CLI adapter
│ ├── ssh.py # SSH/SCP + local subprocess backends
│ ├── issue.py # GitHub issue fetching via gh
│ ├── agent.py # Prompt construction + text utilities
│ ├── agents.py # Agent protocol + claude/codex/cursor/pi
│ ├── config.py # Config loading (~/.ptq/config.toml)
│ ├── workspace.py # Remote workspace setup
│ ├── domain/
│ │ ├── models.py # JobRecord, RunRequest, JobStatus, errors
│ │ └── policies.py # Job ID generation
│ ├── infrastructure/
│ │ ├── job_repository.py # JSON persistence (~/.ptq/jobs.json)
│ │ └── backends.py # Backend factory functions
│ ├── application/
│ │ ├── run_service.py # Launch/rerun orchestration
│ │ ├── worktree_service.py # Worktree + venv provisioning
│ │ ├── job_service.py # Status/kill/clean/list
│ │ ├── artifact_service.py # Results fetching + diff apply
│ │ └── pr_service.py # PR creation workflow
│ └── web/
│ ├── app.py # FastAPI app factory
│ ├── deps.py # Template + status helpers
│ ├── routes.py # Thin web route adapter
│ ├── static/style.css # Dark-theme styles
│ └── templates/ # Jinja2 templates (Pico CSS + htmx)
├── prompts/
│ ├── investigate.md # PyTorch issue investigation prompt
│ ├── adhoc.md # PyTorch freeform task prompt
│ ├── investigate_torchtitan.md # TorchTitan issue investigation prompt
│ └── adhoc_torchtitan.md # TorchTitan freeform task prompt
└── scripts/
└── rebuild.sh
~/ptq_workspace/
├── .venv/ # uv-managed, PyTorch nightly
├── pytorch/ # Source clone at nightly commit
├── scripts/apply_to_site_pkgs.sh # Copies edits to site-packages
└── jobs/
└── 20260214-174923/ # Per-issue job directory
├── pytorch/ # git worktree (isolated)
├── system_prompt.md
├── repro.py
├── claude-1.log # Per-run logs
├── claude-2.log
├── worklog.md # Agent progress log
├── report.md
└── fix.diff
