server : host-memory prompt caching by ggerganov · Pull Request #16391 · ggml-org/llama.cpp

ggerganov · 2025-10-02T19:48:56Z

Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the llama_context if it would reduce the processing. The cache is stored in regular RAM.

The RAM size that is used for caching prompts has 2 limits:

Max size in bytes (controlled with new --cache-ram, -cram CLI arg)
Max number of cached tokens (by default, equal to --context-size)

The server logs provide detailed prompt cache information each time the cache is updated:

A small QoL improvement is that update_slots() now also logs the old and new prompt for each task around n_past (up to 10 tokens) so we can have a better understanding what caused the particular choice of the n_past value for the new task.
Setting LLAMA_SERVER_SLOTS_DEBUG=1 env will make the /slots endpoint output a more detailed output containing the prompt and the generated text of the current or last task. This is useful for debugging purposes.

Note: mtmd workarounds are starting to cause some headaches. For example server_tokens is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.

Usage

# use 8192 MiB of host RAM for caching prompts
llama-server ... --cache-ram 8192

# use as much host RAM is available (i.e. no limit)
llama-server ... -cram -1

# disable prompt caching in RAM
llama-server ... -cram 0

Server refactor

Replace server_slot members with a single server_task
Remove server_slot.n_predict
Remove prompt truncation logic (obsolete and not useful anymore)
slot.task is now const ptr to reflect that the task parameters should not change when it is passed to the slot
Bump default context checkpoints from 3 to 8

TODOs

ggerganov · 2025-10-07T17:43:54Z

Looking for some feedback of how this new logic performs in different use cases. I've been testing it with the llama.vscode agent and it significantly improves the experience since we can now use a single server slot without trashing the prompt cache.

The current implementation should work with any model (dense, MoE, SWA, SSM, etc.). I think the default settings should be good for most use cases, though we'll probably add some options to adjust cache limits if needed.

Pay attention to these new messages in the logs:

Interested in testing agentic use cases, such as Claude Code and similar, where we have a single large context with various auxilary calls (keyword extraction, summarization, etc.) interleaved. The expectation is that prompt reprocessing should be significantly reduces in such cases.

dark-penguin · 2025-11-17T05:39:13Z

Fantastic work indeed! Now local orchestrators can actually be usable!

The RAM size that is used for caching prompts has 2 limits:

Max size in bytes (controlled with new --cache-ram, -cram CLI arg)

Max number of cached tokens (by default, equal to --context-size)

What argument can I use to make "max number of cached tokens" larger than "by default, equal to --context-size"?

Just to make sure I understand everything correctly:

Let's say I have 9GB KV cache (96k tokens) that fits in VRAM, I have 64GB RAM, and I'm making 10k-sized prompts (so, a little under 1GB for each prompt).
By default, I can have 8 prompts cached, because 8GB is the default -cram size
With -cram -1, I can have 9 prompts cached, because now we are limited by "--context-size 96000"
If I can find which argument to use to remove that limit, I can have over 60 prompts cached, as long as I have enough free RAM

Is this all correct?

* minor : code style * server : fix prompt similarity calculation * server : initial host-memory prompt caching * cont * server : refactor * cont * cont : make the server task of the slot const * cont : minor [no ci] * server : cache prompts and checkpoints only for completion tasks * server : improve prompt caching logic * cont : fix check for number of cached prompts [no ci] * server : improve caching logic, add -cram CLI arg * server : print prompt mismatch info * cont : better naming [no ci] * server : improve prompt cache loading logic * server : add option to debug the slot contents (ggml-org#16482) * server : add option to debug the slot contents * Update tools/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : add option to disable prompt cache --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

github-actions bot added examples server labels Oct 2, 2025

ggerganov force-pushed the gg/prompt-cache-ext branch 2 times, most recently from 0787f03 to 5c0cec4 Compare October 3, 2025 18:49

This comment was marked as spam.

Sign in to view

ggerganov force-pushed the gg/prompt-cache-ext branch from 5c0cec4 to 1440ec5 Compare October 7, 2025 07:40

ggerganov changed the base branch from master to gg/server-checkpoints-improve October 7, 2025 07:41

github-actions bot added the python python script changes label Oct 7, 2025

ggerganov mentioned this pull request Oct 7, 2025

server : add SWA checkpoints #15293

Merged

3 tasks

ggerganov force-pushed the gg/prompt-cache-ext branch from 9de8392 to cf7dd4b Compare October 7, 2025 15:09

Base automatically changed from gg/server-checkpoints-improve to master October 8, 2025 07:57

ggerganov added 11 commits October 8, 2025 11:24

minor : code style

8d518d7

server : fix prompt similarity calculation

ca01e7f

server : initial host-memory prompt caching

668a436

cont

3234723

server : refactor

967b1e4

cont

83ce8cb

cont : make the server task of the slot const

ba8ffa7

cont : minor [no ci]

23b7f76

server : cache prompts and checkpoints only for completion tasks

c32d8b4

server : improve prompt caching logic

677b10d

cont : fix check for number of cached prompts [no ci]

264d2c3

ggerganov force-pushed the gg/prompt-cache-ext branch from 65e8991 to 264d2c3 Compare October 8, 2025 08:24

ggerganov added 2 commits October 8, 2025 15:33

server : improve caching logic, add -cram CLI arg

f42dfa4

server : print prompt mismatch info

bf10940

ggerganov marked this pull request as ready for review October 8, 2025 12:53

ggerganov requested a review from ngxson as a code owner October 8, 2025 12:53

ggerganov added 2 commits October 8, 2025 16:14

cont : better naming [no ci]

bc6e238

server : improve prompt cache loading logic

b612f7f

sxch775-work mentioned this pull request Nov 16, 2025

Misc. bug: The Llama server starts with 4 slots by default. #17300

Closed

QuantumPlayDev mentioned this pull request Nov 19, 2025

Eval bug: lama-server with GLM-4.6 GGUF + BAML structured output eventually segfaults under repeated /v1/chat/completions #17391

Closed

Zvezda117 mentioned this pull request Dec 3, 2025

llama.cpp部署Qwen3-Reranker-4B-Q4KM添加参数无效 xorbitsai/inference#4317

Closed

3 tasks

iter-next mentioned this pull request Dec 9, 2025

Misc. bug: Starting llama-server.exe with --host 0.0.0.0 results in server not working, ok if only running over 127.0.0.1, for --models-dir feature. #17862

Closed

bli1348 mentioned this pull request Dec 9, 2025

Eval bug: Qwen3-VL crashes llama-server when ecoding image slice #17881

Open

REDevArtK mentioned this pull request Dec 16, 2025

Eval bug: Llama cpp crash or stop when vulkan injected with another process like rivatunner and etc #18109

Closed

puzhengwu mentioned this pull request Dec 17, 2025

[Issue]: llama.cpp segmentation fault during Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf model load on gfx1151 with TheRock dev tarball ROCm/TheRock#2595

Closed

engrtipusultan mentioned this pull request Dec 18, 2025

Misc. bug: Requested buffer size exceeds device buffer size limit for previously working configuraitons. #18160

Closed

chatchatgpt mentioned this pull request Dec 19, 2025

Feature Request: Support NVIDIA Nemotron 3 Nano ikawrakow/ik_llama.cpp#1075

Open

4 tasks

alexp700 mentioned this pull request Dec 27, 2025

Eval bug: Crashing on Mac M3 Ultra Qwen3-VL 235B #18414

Closed

emeric254 mentioned this pull request Dec 30, 2025

bug: AMD GPU: Error: Model appears to have crashed! janhq/jan#6628

Open

3 tasks

drrros mentioned this pull request Jan 18, 2026

Bug: Unable to use GLM 4.7 in ik_llama with cline ikawrakow/ik_llama.cpp#1162

Open

openingnow mentioned this pull request Jan 23, 2026

tutorials : list for llama.cpp #13523

Open

6 tasks

darksylinc mentioned this pull request Jan 26, 2026

Eval bug: Assert in kv-cache using qwen3-vl #19116

Open

strikene mentioned this pull request Jan 30, 2026

Error when nvidia-nemotron-3-nano-30b-a3b using tool lmstudio-ai/lmstudio-bug-tracker#1443

Open

damianhunziker mentioned this pull request Jan 31, 2026

Rocm 7.2 Chaos with Memory lemonade-sdk/llamacpp-rocm#52

Open

podium868909 mentioned this pull request Feb 7, 2026

server : improve context checkpoint logic #19408

Merged

henry701 mentioned this pull request Feb 19, 2026

Eval bug: CUDA backend crash on GLM-4.7-Flash with FA on and quantized KV cache #19724

Closed

markqvist mentioned this pull request Feb 19, 2026

Eval bug: Multimodality broken in b8091 onwards #19735

Closed

ai-joe-git mentioned this pull request Feb 19, 2026

Bug: Device memory allocation of size 17368219904 failed. ikawrakow/ik_llama.cpp#1289

Closed

0xFS0CIETY mentioned this pull request Feb 20, 2026

model: Add PaddleOCR-VL model support #18825

Merged

amakropoulos mentioned this pull request Feb 20, 2026

Eval bug: Server Fails with HTTP 400 (Context Size Exceeded) Instead of Truncating Chat History #17284

Open

rogerdcarvalho mentioned this pull request Feb 23, 2026

bug: Windows on ARM not running at full possible speed withcatai/node-llama-cpp#556

Open

5 tasks

russ0616 mentioned this pull request Feb 24, 2026

Misc. bug: Llama-3.1-8B-Instruct crash in Vulkan Intel LNL iGPU #19844

Closed

ggerganov mentioned this pull request Feb 25, 2026

server : enable multi-modal prompt caching #19877

Merged

snapo mentioned this pull request Feb 25, 2026

Eval bug: Qwen 3.5 27B GGUF from unsloth hard crash #19906

Closed

bhuv-z mentioned this pull request Feb 26, 2026

Eval bug: Fitting params to memory and model loading taking unusually long; 150+ Seconds each step #19912

Closed

congson1293 mentioned this pull request Mar 3, 2026

Eval bug: Qwen3.5-0.8B-GGUF throughput is significantly lower than expected #20072

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : host-memory prompt caching#16391

server : host-memory prompt caching#16391
ggerganov merged 17 commits intomasterfrom
gg/prompt-cache-ext

ggerganov commented Oct 2, 2025 •

edited

Loading

Uh oh!

This comment was marked as spam.

ggerganov commented Oct 7, 2025

Uh oh!

dark-penguin commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

ggerganov commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Server refactor

TODOs

Uh oh!

This comment was marked as spam.

ggerganov commented Oct 7, 2025

Uh oh!

dark-penguin commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

ggerganov commented Oct 2, 2025 •

edited

Loading