Skip to content

feat(cli): llama.cpp-compatible --ngl/--device GPU selection flags#244

Merged
pekkah merged 3 commits into
masterfrom
feat/cli-ngl-device-flags
Jun 14, 2026
Merged

feat(cli): llama.cpp-compatible --ngl/--device GPU selection flags#244
pekkah merged 3 commits into
masterfrom
feat/cli-ngl-device-flags

Conversation

@pekkah

@pekkah pekkah commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Reimplements stale PR #144 fresh on current master (that branch had conflicts in hot files; the core here is a clean new GpuDevice helper).

What

Adds llama.cpp flag aliases and single-GPU device selection to both the run and image commands.

  • --ngl / --gpu-layers — aliases for the existing -g / --n-gpu-layers. -g is kept (non-breaking; cli: match llama.cpp GPU flag conventions (--ngl, --device) and align flag names #144 had dropped it).
  • --model-draft (alias of --draft-model) and --repeat-penalty (alias of --rep-penalty) — llama.cpp parity.
  • --device <auto|none|cpu|INDEX|NAMEINDEX> — e.g. 0, CUDA0, Vulkan1. Parsing lives in the new GpuDevice.Resolve:
    • pins CUDA via CUDA_VISIBLE_DEVICES (process-wide; only set when the user hasn't already constrained it),
    • returns the index for Vulkan's explicit physical-device selector,
    • none / cpu forces the CPU path (overrides --ngl),
    • single-device only — comma-separated multi-GPU lists are rejected (SharpInference has no tensor/row split).

VulkanBackend gains VulkanBackend(int deviceIndex = -1) + SelectPhysicalDevice(int) with a bounds-checked, compute-queue-checked explicit-index path. The default -1 preserves the prior discrete-GPU-preferred auto-select, so all existing callers (server, tests, FLUX) are byte-for-byte unchanged.

Verification

  • Full solution builds clean (TreatWarningsAsErrors, 0 warnings).
  • CLI smoke tests of every --device parse branch:
    • --device foo → "expected a device index…"
    • --device 0,1 → "multi-device split is not supported"
    • --device=-1 → rejected (won't mis-parse as device 1)
    • --device CUDA3 / --device none → valid, proceeds to model load
    • --ngl, --model-draft, --repeat-penalty recognized and shown in --help.

🤖 Generated with Claude Code

Reimplements the stale PR #144 fresh on master. Adds llama.cpp flag
aliases and single-GPU device selection to both `run` and `image`:

- `--ngl`/`--gpu-layers` as aliases for the existing `-g`/`--n-gpu-layers`
  (kept `-g` — non-breaking, unlike #144 which dropped it).
- `--model-draft` (alias of `--draft-model`) and `--repeat-penalty`
  (alias of `--rep-penalty`) for llama.cpp parity.
- `--device <auto|none|cpu|INDEX|NAMEINDEX>` (e.g. `0`, `CUDA0`,
  `Vulkan1`). Parsing lives in the new `GpuDevice.Resolve`: it pins CUDA
  via CUDA_VISIBLE_DEVICES (process-wide, only when the user hasn't set
  it) and returns the index for Vulkan's explicit physical-device
  selector. `none`/`cpu` forces the CPU path, overriding `--ngl`.
  Single-device only (no multi-GPU split) — comma lists are rejected.

VulkanBackend gains `VulkanBackend(int deviceIndex = -1)` +
`SelectPhysicalDevice(int)` with an explicit-index path (bounds-checked,
compute-queue-checked); the default `-1` preserves the prior
discrete-GPU-preferred auto-select, so existing callers are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the --device option to pin a single GPU or disable GPU offloading via --device none, mirroring llama.cpp behavior. It adds a GpuDevice helper to parse the option, updates the CLI commands to support the new flag and its aliases, and updates VulkanBackend to select the specified physical device. The feedback suggests simplifying the device string parsing logic in GpuDevice.cs and adding a status message in RunZImage when the upscaler falls back to CPU under --device none to match the behavior in RunFlux.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +45 to +57
// Strip an optional leading backend name (CUDA/Vulkan/GPU/…) and read the trailing index.
int i = value.Length;
while (i > 0 && char.IsDigit(value[i - 1])) i--;
var digits = value[i..];
// The part before the index must be empty or a plain backend name (letters only) —
// this rejects things like "-1" (which would otherwise parse as device 1).
bool prefixOk = true;
for (int j = 0; j < i; j++)
if (!char.IsLetter(value[j])) { prefixOk = false; break; }
if (!prefixOk || digits.Length == 0 || !int.TryParse(digits, out int index) || index < 0)
throw new InvalidOperationException(
$"--device '{device}': expected a device index (0, 1, …), a named device " +
"(CUDA0, Vulkan1), 'auto', or 'none'.");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to parse the trailing index and validate the prefix can be simplified and made more robust. Currently, the loop checks if each character in the prefix is a letter, but we can use int.TryParse on the trailing digits and check if the remaining prefix is entirely alphabetic more cleanly. Additionally, we should ensure that we handle cases where there are no digits at all or if the index parsing fails due to overflow (which is already handled by int.TryParse returning false, but we can make the error message more specific). Let's refactor this parsing logic to be more readable and robust.

        // Strip an optional leading backend name (CUDA/Vulkan/GPU/…) and read the trailing index.
        int i = value.Length;
        while (i > 0 && char.IsDigit(value[i - 1])) i--;
        var digits = value[i..];
        
        bool prefixOk = true;
        for (int j = 0; j < i; j++)
        {
            if (!char.IsLetter(value[j]))
            {
                prefixOk = false;
                break;
            }
        }

        if (!prefixOk || digits.Length == 0 || !int.TryParse(digits, out int index) || index < 0)
        {
            throw new InvalidOperationException(
                $"--device '{device}': expected a device index (0, 1, …), a named device " +
                "(CUDA0, Vulkan1), 'auto', or 'none'.");
        }

// default (-1) → CUDA → Vulkan → CPU fallback
string backendChoice = (s.Backend ?? "auto").ToLowerInvariant();
bool forceCpu = s.NGpuLayers == 0 || backendChoice == "cpu";
bool forceCpu = s.NGpuLayers == 0 || backendChoice == "cpu" || deviceNone;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In RunZImage, when deviceNone is true, the main compute backend correctly falls back to CPU. However, unlike RunFlux (which explicitly logs [dim]Upscaler backend:[/] CPU (--device none)), RunZImage silently falls back to CPU for the upscaler without printing any status message. We should add a check for deviceNone or gpu == null to print a clear status message when the upscaler falls back to CPU.

                        bool forceCpu    = s.NGpuLayers == 0 || backendChoice == 

pekkah and others added 2 commits June 14, 2026 12:19
Addresses PR review (silent-failure-hunter + pr-test-analyzer):

- ImageCommand swallowed an explicit `--device N` failure into a silent
  CPU fallback (Z-Image printed the actively-wrong "no GPU detected";
  FLUX printed nothing at all). Now the backend-creation fallback
  `catch` blocks are gated on `deviceIndex < 0` (auto-select only), so an
  explicit device error propagates to the outer handler and is reported.
  Auto-select behavior is unchanged.
- Warn when `--device none` contradicts an explicit GPU request
  (`--ngl/-g` in run, `--backend cuda|vulkan` in image) instead of
  silently dropping to CPU.
- Add SharpInference.Tests.Cli with 25 GpuDevice.Resolve cases covering
  every parse branch (incl. the `-1`-must-not-mis-parse-as-device-1
  guard and int overflow) and the CUDA_VISIBLE_DEVICES pin/preserve side
  effect. Exposed via InternalsVisibleTo; xunit collection parallelism is
  off so the env-mutating cases stay deterministic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RunZImage silently used the CPU upscaler when no GPU backend was
selected (incl. --device none), while RunFlux logged it. Print the
matching "Upscaler backend: CPU (--device none)" / "CPU" line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 14, 2026

Copy link
Copy Markdown
Owner Author

Addressed the review feedback across three commits:

Silent-failure-hunter (CRITICAL #1/#2): An explicit --device N that was invalid/absent was being swallowed into a silent CPU fallback in ImageCommand (Z-Image printed the actively-wrong "no GPU detected"; FLUX printed nothing). Fixed by gating every backend-creation fallback catch on deviceIndex < 0 — auto-select still falls back, but an explicitly-named device now propagates its error to the outer handler and is reported. Verified: image --device 9 --backend vulkanError: --device 9: only 2 Vulkan device(s) present (valid indices 0..1).

Silent-failure-hunter (#5): --device none contradicting an explicit GPU request now prints a Note: (run: vs --ngl/-g; image: vs --backend) instead of silently dropping to CPU.

pr-test-analyzer (#9): Added SharpInference.Tests.Cli (25 cases) covering every GpuDevice.Resolve parse branch — including the -1-must-not-mis-parse-as-device-1 guard and int overflow — plus the CUDA_VISIBLE_DEVICES pin/preserve side effect. Exposed via InternalsVisibleTo; collection parallelism is off so the env-mutating cases are deterministic.

gemini (ImageCommand.cs:249): Z-Image now logs the upscaler CPU fallback (Upscaler backend: CPU (--device none) / CPU) to match FLUX.

gemini (GpuDevice.cs:57): Declined — the suggested refactor is byte-for-byte identical to the current loop (same digit-strip + letters-only prefix check). The existing logic is now locked down by the 25 unit tests above (the -1/0x1F/CUDA/overflow rejection cases specifically).

silent-failure-hunter (#3, CUDA index validation): Not adding a CUDA-side device-count check. GpuDevice.Resolve runs before backend selection and can't know whether the target is CUDA or Vulkan, so a CUDA-count check there would wrongly reject a valid Vulkan-only index. After the #1/#2 fix every invalid explicit device does surface an error (Vulkan validates in-ctor; an explicit CUDA index that doesn't exist fails loudly at cublasCreate). The residual is message wording on the CUDA path, not a silent failure — left as a known limitation.

@pekkah pekkah merged commit ee1fd37 into master Jun 14, 2026
1 check passed
@pekkah pekkah deleted the feat/cli-ngl-device-flags branch June 14, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant