Skip to content

server : host-memory prompt caching#16391

Merged
ggerganov merged 17 commits intomasterfrom
gg/prompt-cache-ext
Oct 9, 2025
Merged

server : host-memory prompt caching#16391
ggerganov merged 17 commits intomasterfrom
gg/prompt-cache-ext

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 2, 2025

target #16440
rel #16117

Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the llama_context if it would reduce the processing. The cache is stored in regular RAM.

The RAM size that is used for caching prompts has 2 limits:

  • Max size in bytes (controlled with new --cache-ram, -cram CLI arg)
  • Max number of cached tokens (by default, equal to --context-size)

The server logs provide detailed prompt cache information each time the cache is updated:

image
  • A small QoL improvement is that update_slots() now also logs the old and new prompt for each task around n_past (up to 10 tokens) so we can have a better understanding what caused the particular choice of the n_past value for the new task.

  • Setting LLAMA_SERVER_SLOTS_DEBUG=1 env will make the /slots endpoint output a more detailed output containing the prompt and the generated text of the current or last task. This is useful for debugging purposes.

Note: mtmd workarounds are starting to cause some headaches. For example server_tokens is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.

Usage

# use 8192 MiB of host RAM for caching prompts
llama-server ... --cache-ram 8192

# use as much host RAM is available (i.e. no limit)
llama-server ... -cram -1

# disable prompt caching in RAM
llama-server ... -cram 0

Server refactor

  • Replace server_slot members with a single server_task
  • Remove server_slot.n_predict
  • Remove prompt truncation logic (obsolete and not useful anymore)
  • slot.task is now const ptr to reflect that the task parameters should not change when it is passed to the slot
  • Bump default context checkpoints from 3 to 8

TODOs

  • Set memory limit for the host-memory cache from CLI
  • Clean-up implementation
  • Test with agentic workflows
  • Multi-slot tests
  • Fix progress report

@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch 2 times, most recently from 0787f03 to 5c0cec4 Compare October 3, 2025 18:49
@tommarques56

This comment was marked as spam.

@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 5c0cec4 to 1440ec5 Compare October 7, 2025 07:40
@ggerganov ggerganov changed the base branch from master to gg/server-checkpoints-improve October 7, 2025 07:41
@github-actions github-actions bot added the python python script changes label Oct 7, 2025
@ggerganov ggerganov mentioned this pull request Oct 7, 2025
3 tasks
@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 9de8392 to cf7dd4b Compare October 7, 2025 15:09
@ggerganov
Copy link
Member Author

Looking for some feedback of how this new logic performs in different use cases. I've been testing it with the llama.vscode agent and it significantly improves the experience since we can now use a single server slot without trashing the prompt cache.

The current implementation should work with any model (dense, MoE, SWA, SSM, etc.). I think the default settings should be good for most use cases, though we'll probably add some options to adjust cache limits if needed.

Pay attention to these new messages in the logs:

image

Interested in testing agentic use cases, such as Claude Code and similar, where we have a single large context with various auxilary calls (keyword extraction, summarization, etc.) interleaved. The expectation is that prompt reprocessing should be significantly reduces in such cases.

Base automatically changed from gg/server-checkpoints-improve to master October 8, 2025 07:57
@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 65e8991 to 264d2c3 Compare October 8, 2025 08:24
@ggerganov ggerganov marked this pull request as ready for review October 8, 2025 12:53
@ggerganov ggerganov requested a review from ngxson as a code owner October 8, 2025 12:53
@dark-penguin
Copy link

Fantastic work indeed! Now local orchestrators can actually be usable!

The RAM size that is used for caching prompts has 2 limits:

  • Max size in bytes (controlled with new --cache-ram, -cram CLI arg)
  • Max number of cached tokens (by default, equal to --context-size)

What argument can I use to make "max number of cached tokens" larger than "by default, equal to --context-size"?

Just to make sure I understand everything correctly:

  • Let's say I have 9GB KV cache (96k tokens) that fits in VRAM, I have 64GB RAM, and I'm making 10k-sized prompts (so, a little under 1GB for each prompt).
  • By default, I can have 8 prompts cached, because 8GB is the default -cram size
  • With -cram -1, I can have 9 prompts cached, because now we are limited by "--context-size 96000"
  • If I can find which argument to use to remove that limit, I can have over 60 prompts cached, as long as I have enough free RAM

Is this all correct?

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
* minor : code style

* server : fix prompt similarity calculation

* server : initial host-memory prompt caching

* cont

* server : refactor

* cont

* cont : make the server task of the slot const

* cont : minor [no ci]

* server : cache prompts and checkpoints only for completion tasks

* server : improve prompt caching logic

* cont : fix check for number of cached prompts [no ci]

* server : improve caching logic, add -cram CLI arg

* server : print prompt mismatch info

* cont : better naming [no ci]

* server : improve prompt cache loading logic

* server : add option to debug the slot contents (ggml-org#16482)

* server : add option to debug the slot contents

* Update tools/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* server : add option to disable prompt cache

---------

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.