Skip to content

server : enable multi-modal prompt caching#19877

Merged
ggerganov merged 1 commit intomasterfrom
gg/server-mtmd-prompt-cache
Feb 25, 2026
Merged

server : enable multi-modal prompt caching#19877
ggerganov merged 1 commit intomasterfrom
gg/server-mtmd-prompt-cache

Conversation

@ggerganov
Copy link
Member

target #19849
cont #16391

We can now clone server_tokens so re-enable host-memory prompt caching for multi-modal cases.

@andyceo
Copy link

andyceo commented Feb 25, 2026

I switched to gg/server-mtmd-prompt-cache, do git rebase master and build app.

Confirm now prompt caching works! Note that I run with: no-mmproj-offload = on and not try without it

VRAM usage is quite unstable for qwen3.5-27B, but this is not related to that issue. Thank you!

Base automatically changed from pr/19747-alt to master February 25, 2026 13:14
@ggerganov ggerganov force-pushed the gg/server-mtmd-prompt-cache branch from f94fc71 to dc4d447 Compare February 25, 2026 13:15
@ggerganov ggerganov merged commit f20469d into master Feb 25, 2026
75 of 76 checks passed
@ggerganov ggerganov deleted the gg/server-mtmd-prompt-cache branch February 25, 2026 13:15
@BVEsun
Copy link

BVEsun commented Feb 26, 2026

Thank you very much for ggerganov effort to troubleshoot this problem

I am using windows 10 b8157 release cuda 12.4

After this PR, text prompt processing is being handled by the CPU instead of the GPU, which is causing a significant slowdown.

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants