Name and Version
/home/sysops/llama.cpp/build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
version: 7062 (9b17d74)
built with gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Device 0 [Tesla P40] PCIe GEN 3@ 8x
Device 1 [Tesla P40] PCIe GEN 3@16x
Models
hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL
Problem description & steps to reproduce
When using the llama.cpp server (via the OpenAI-compatible API, /v1/chat/completions), if the total number of tokens in the prompt exceeds the --ctx-size limit, the server stops processing the request and returns an HTTP 400 Bad Request error with the message "the request exceeds the available context size, try increasing it."
I expected the server, especially when using the --context-shift flag, to handle the context limit by automatically truncating/shifting the oldest messages in the conversation history, similar to how other platforms (like Ollama, which uses a client-side approach) manage chat history for long contexts.
Steps to Reproduce
-
Setup the Server: Run llama.cpp using the following command (with a model, e.g., a Qwen model):
"$LLAMA_SERVER_PATH" \
--host 0.0.0.0 \
--port 10000 \
--threads -1 \
--ctx-size 131072 \
--context-shift \
--alias "Qwen3-Coder-30B-A3B" \
-m "$MODEL_PATH" \
--api-key 1234567890 \
--jinja \
--temp 0.7 \
--min-p 0.01 \
--top-p 0.80 \
--top-k 20 \
--repeat-penalty 1.05 \
--flash-attn on \
--batch-size 4096 \
--ubatch-size 2048 \
--threads 32 \
--metrics
-
Send Long Chat Requests: Continuously send chat completion requests to http://0.0.0.0:10000/v1/chat/completions using a client (like a custom application running on VSCode/Cline) that sends the full conversation history with each request.
-
Observe Failure: Once the cumulative token count of the conversation history (task.n_tokens) exceeds the configured limit (--ctx-size 131072), the server fails the request.
Observed Log Output (Server Side)
The server log shows the failure when the request size exceeds the context:
slot update_slots: id 1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot release: id 1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
Expected Behavior
The server should either:
- Automatically truncate the oldest messages (excluding the system prompt and the final user message) to fit the context window (
--ctx-size 131072) and process the request successfully.
- Or, if
--context-shift is intended for this use case, it should be enabled and working to manage the context window without returning a 400 error.
First Bad Commit
No response
Relevant log output
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
srv log_server_r: request: GET /v1/models 10.2.0.153 200
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.978
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 12 | processing task
slot update_slots: id 3 | task 12 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 314
slot update_slots: id 3 | task 12 | need to evaluate at least 1 token for each active slot (n_past = 314, task.n_tokens() = 314)
slot update_slots: id 3 | task 12 | n_past was set to 313
slot update_slots: id 3 | task 12 | n_tokens = 313, memory_seq_rm [313, end)
slot update_slots: id 3 | task 12 | prompt processing progress, n_tokens = 314, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id 3 | task 12 | prompt done, n_tokens = 314, batch.n_tokens = 1
slot print_timing: id 3 | task 12 |
prompt eval time = 21.27 ms / 1 tokens ( 21.27 ms per token, 47.01 tokens per second)
eval time = 134.18 ms / 8 tokens ( 16.77 ms per token, 59.62 tokens per second)
total time = 155.45 ms / 9 tokens
slot release: id 3 | task 12 | stop processing: n_tokens = 321, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 200
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 1 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 21 | processing task
slot update_slots: id 1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot release: id 1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv update_slots: no tokens to decode
srv update_slots: all slots are idle
srv stop: cancel task, id_task = 21
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
Name and Version
/home/sysops/llama.cpp/build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
version: 7062 (9b17d74)
built with gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Device 0 [Tesla P40] PCIe GEN 3@ 8x
Device 1 [Tesla P40] PCIe GEN 3@16x
Models
hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL
Problem description & steps to reproduce
When using the
llama.cppserver (via the OpenAI-compatible API,/v1/chat/completions), if the total number of tokens in the prompt exceeds the--ctx-sizelimit, the server stops processing the request and returns an HTTP 400 Bad Request error with the message "the request exceeds the available context size, try increasing it."I expected the server, especially when using the
--context-shiftflag, to handle the context limit by automatically truncating/shifting the oldest messages in the conversation history, similar to how other platforms (like Ollama, which uses a client-side approach) manage chat history for long contexts.Steps to Reproduce
Setup the Server: Run
llama.cppusing the following command (with a model, e.g., a Qwen model):Send Long Chat Requests: Continuously send chat completion requests to
http://0.0.0.0:10000/v1/chat/completionsusing a client (like a custom application running on VSCode/Cline) that sends the full conversation history with each request.Observe Failure: Once the cumulative token count of the conversation history (
task.n_tokens) exceeds the configured limit (--ctx-size 131072), the server fails the request.Observed Log Output (Server Side)
The server log shows the failure when the request size exceeds the context:
Expected Behavior
The server should either:
--ctx-size 131072) and process the request successfully.--context-shiftis intended for this use case, it should be enabled and working to manage the context window without returning a 400 error.First Bad Commit
No response
Relevant log output