server : support multi-modal context checkpoints by ggerganov · Pull Request #19849 · ggml-org/llama.cpp

ggerganov · 2026-02-24T11:29:12Z

alt #19747
fix #19690
fix #19858

This continues the work from #19747 to enable multi-modal checkpointing in llama-server. Avoids changes to libllama. Fixes n_past ambiguity of "number of tokens" and "next position" by tracking the position separately with a new pos_next var.

timkhronos · 2026-02-24T14:59:55Z

I think it might make more sense to post this here instead of the original pr:

@ggerganov These 3 changes should fix the incorrect checkpoint reuse I mentioned in regards to #19849

in server-task.h, add a new pos_end to checkpoint struct:

struct server_prompt_checkpoint {
    llama_pos pos_min;
    llama_pos pos_max;
    llama_pos pos_end; //full prompt extent when checkpoint was created

    std::vector<uint8_t> data;

in sever-context.cpp at checkpoint creation:

auto & cur = slot.prompt.checkpoints.emplace_back(server_prompt_checkpoint{
    /*.pos_min = */ pos_min,
    /*.pos_max = */ pos_max,
    /*.pos_end = */ slot.prompt.tokens.pos_next(),
    /*.data   = */ std::vector<uint8_t>(checkpoint_size),
});

and also in server-context.cpp, when checking if we have valid checkpoints:

const llama_pos prompt_end = slot.task->tokens.pos_next();

const auto it = std::find_if(
    slot.prompt.checkpoints.rbegin(),
    slot.prompt.checkpoints.rend(),
    [&](const auto & cur) {
        if (cur.pos_end > prompt_end) {
            return false;
        }
        // guarantee that a checkpoint will result in at least one token being processed [TAG_PROMPT_LOGITS]
        return cur.pos_min < pos_min_thold;
    }
);

ggerganov · 2026-02-25T06:09:05Z

@ngxson PTAL, I think this is good to merge.

ngxson · 2026-02-25T10:57:26Z

tools/server/server-task.h

    llama_pos pos_min;
    llama_pos pos_max;

+    int n_tokens;


should we change this to int64_t to match what's pos_next(int64_t) expects?

PR ggml-org#19849 (multi-modal context checkpoints) introduced an off-by-one in server_tokens::size_up_to_pos: max_pos + 1 overflows the sequence position by one token, causing "Invalid input batch" HTTP 500 errors on every request after the first. Change max_pos + 1 to max_pos to match the expected 0-based position semantics. Ref: ggml-org#19901

* Modify llama-memory-hybrid-iswa.cpp * Modify llama-memory-recurrent.cpp * Modify server-common.cpp * Modify server-common.h * Modify server-context.cpp * Modify server-task.h * Added comment to llama-memory-hybrid-iswa.cpp * Remove comment from server-context.cpp * Stylistic fix server-context.cpp * Fix an issue when seqrm isn't called in server-context.cpp * cont : alternative impl * cont : cleanup * cont : n_tokens -> int64_t --------- Co-authored-by: timkhronos <timkhronos@gmail.com>

timkhronos and others added 13 commits February 19, 2026 21:38

Modify llama-memory-hybrid-iswa.cpp

f2adafd

Modify llama-memory-recurrent.cpp

ba75fed

Modify server-common.cpp

6f73fee

Modify server-common.h

636e377

Modify server-context.cpp

405f74c

Modify server-task.h

d51392e

Added comment to llama-memory-hybrid-iswa.cpp

395033b

Remove comment from server-context.cpp

1f28628

Stylistic fix server-context.cpp

cf4714c

Merge branch 'ggml-org:master' into Checkpoints_With_Vision

a1206ff

Merge branch 'ggml-org:master' into Checkpoints_With_Vision

531618f

Fix an issue when seqrm isn't called in server-context.cpp

18e1fd2

cont : alternative impl

1ec9145

ggerganov force-pushed the pr/19747-alt branch from 6c5b8fa to 1ec9145 Compare February 24, 2026 11:30

ggerganov mentioned this pull request Feb 24, 2026

server: Fix multimodal context checkpointing for hybrid/recurrent models #19747

Open

github-actions bot added examples server labels Feb 24, 2026

ggerganov mentioned this pull request Feb 24, 2026

Eval bug: Qwen3.5 always re-processes the full prompt #19858

Closed

cont : cleanup

29513e6

ggerganov marked this pull request as ready for review February 24, 2026 20:19

ggerganov requested a review from ngxson as a code owner February 24, 2026 20:19

ggerganov mentioned this pull request Feb 25, 2026

server : enable multi-modal prompt caching #19877

Merged

ngxson approved these changes Feb 25, 2026

View reviewed changes

cont : n_tokens -> int64_t

80f5dd3

ggerganov merged commit d7d826b into master Feb 25, 2026
77 of 78 checks passed

ggerganov deleted the pr/19747-alt branch February 25, 2026 13:14

tiau mentioned this pull request Feb 26, 2026

Misc. bug: PR19849 causes inconsistent sequence positions #19901

Closed

shashwata2020 mentioned this pull request Feb 26, 2026

Qwen3.5-35B-A3B: KV cache reuse not supported — full prompt recompute on every request lmstudio-ai/lmstudio-bug-tracker#1563

Open

ggerganov mentioned this pull request Feb 26, 2026

server : fix ctx checkpoint restore logic #19924

Merged

ebfio mentioned this pull request Feb 26, 2026

Saving KV cache (using /slots/3?action=save API endpoint) does not work for vision-enabled models #19466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : support multi-modal context checkpoints#19849

server : support multi-modal context checkpoints#19849
ggerganov merged 15 commits intomasterfrom
pr/19747-alt

ggerganov commented Feb 24, 2026 •

edited

Loading

Uh oh!

timkhronos commented Feb 24, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

ngxson Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggerganov commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timkhronos commented Feb 24, 2026

Uh oh!

ggerganov commented Feb 25, 2026

Uh oh!

ngxson Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Feb 24, 2026 •

edited

Loading