Skip to content

Validate Gemma 4 SWA long-context correctness against llama.cpp (cross-tool oracle) #165

Description

@pekkah

Follow-up to #164 (commit 48a5622), which fixed a latent bug: Gemma 4 SWA KV caches were window-sized but indexed absolutely, so output was silently wrong beyond the 512-token sliding window. The fix (a real pos % cacheSize ring) is validated internally — chunked-vs-per-token parity on prompts whose final-token window spans a chunk boundary (so the observed logit depends on cross-chunk + ring-wrapped reads), plus a >5K-token coherence run.

What's missing: an independent cross-tool check vs llama.cpp at long context. It was skipped during #164 because llama-cli hung in interactive mode and the chat-template token-parity matching is a known rabbit hole (see CLAUDE/memory).

Task: greedy-decode a >4096-token prompt on both sharpi (sharpi-cli.exe -g -1) and C:\p\llama.cpp-cuda\bin\llama-cli.exe with matched tokenization (verify token ids via llama-tokenize, drive llama-cli non-interactively — pipe EOF / -no-cnv confirmed effective), and compare the first ~20 generated tokens. Document the result. Low priority — internal oracle already covers the regression; this is defense-in-depth for a correctness fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions