Skip to content

server : support multi-modal context checkpoints#19849

Merged
ggerganov merged 15 commits intomasterfrom
pr/19747-alt
Feb 25, 2026
Merged

server : support multi-modal context checkpoints#19849
ggerganov merged 15 commits intomasterfrom
pr/19747-alt

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Feb 24, 2026

alt #19747
fix #19690
fix #19858

This continues the work from #19747 to enable multi-modal checkpointing in llama-server. Avoids changes to libllama. Fixes n_past ambiguity of "number of tokens" and "next position" by tracking the position separately with a new pos_next var.

@timkhronos
Copy link
Contributor

I think it might make more sense to post this here instead of the original pr:

@ggerganov These 3 changes should fix the incorrect checkpoint reuse I mentioned in regards to #19849

in server-task.h, add a new pos_end to checkpoint struct:

struct server_prompt_checkpoint {
    llama_pos pos_min;
    llama_pos pos_max;
    llama_pos pos_end; //full prompt extent when checkpoint was created

    std::vector<uint8_t> data;

in sever-context.cpp at checkpoint creation:

auto & cur = slot.prompt.checkpoints.emplace_back(server_prompt_checkpoint{
    /*.pos_min = */ pos_min,
    /*.pos_max = */ pos_max,
    /*.pos_end = */ slot.prompt.tokens.pos_next(),
    /*.data   = */ std::vector<uint8_t>(checkpoint_size),
});

and also in server-context.cpp, when checking if we have valid checkpoints:

const llama_pos prompt_end = slot.task->tokens.pos_next();

const auto it = std::find_if(
    slot.prompt.checkpoints.rbegin(),
    slot.prompt.checkpoints.rend(),
    [&](const auto & cur) {
        if (cur.pos_end > prompt_end) {
            return false;
        }
        // guarantee that a checkpoint will result in at least one token being processed [TAG_PROMPT_LOGITS]
        return cur.pos_min < pos_min_thold;
    }
);

@ggerganov ggerganov marked this pull request as ready for review February 24, 2026 20:19
@ggerganov ggerganov requested a review from ngxson as a code owner February 24, 2026 20:19
@ggerganov
Copy link
Member Author

@ngxson PTAL, I think this is good to merge.

llama_pos pos_min;
llama_pos pos_max;

int n_tokens;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change this to int64_t to match what's pos_next(int64_t) expects?

@ggerganov ggerganov merged commit d7d826b into master Feb 25, 2026
77 of 78 checks passed
@ggerganov ggerganov deleted the pr/19747-alt branch February 25, 2026 13:14
neilopet added a commit to neilopet/llama.cpp that referenced this pull request Feb 26, 2026
PR ggml-org#19849 (multi-modal context checkpoints) introduced an off-by-one
in server_tokens::size_up_to_pos: max_pos + 1 overflows the sequence
position by one token, causing "Invalid input batch" HTTP 500 errors
on every request after the first.

Change max_pos + 1 to max_pos to match the expected 0-based position
semantics.

Ref: ggml-org#19901
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* Modify llama-memory-hybrid-iswa.cpp

* Modify llama-memory-recurrent.cpp

* Modify server-common.cpp

* Modify server-common.h

* Modify server-context.cpp

* Modify server-task.h

* Added comment to llama-memory-hybrid-iswa.cpp

* Remove comment from server-context.cpp

* Stylistic fix server-context.cpp

* Fix an issue when seqrm isn't called in server-context.cpp

* cont : alternative impl

* cont : cleanup

* cont : n_tokens -> int64_t

---------

Co-authored-by: timkhronos <timkhronos@gmail.com>
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* Modify llama-memory-hybrid-iswa.cpp

* Modify llama-memory-recurrent.cpp

* Modify server-common.cpp

* Modify server-common.h

* Modify server-context.cpp

* Modify server-task.h

* Added comment to llama-memory-hybrid-iswa.cpp

* Remove comment from server-context.cpp

* Stylistic fix server-context.cpp

* Fix an issue when seqrm isn't called in server-context.cpp

* cont : alternative impl

* cont : cleanup

* cont : n_tokens -> int64_t

---------

Co-authored-by: timkhronos <timkhronos@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants