server: implement GLM-style MTP by F1LM1 · Pull Request #15225 · ggml-org/llama.cpp

F1LM1 · 2025-08-11T05:52:35Z

This is very much a draft/proof of concept I'm playing with, just one idea for an MTP implementation. Planning to test on GLM-4.5 because it's the only model out there that we've preserved NextN tensors for.

From what I can tell

the three models with MTP implemented in vLLM right now are all "DeepseekV3-style,"
they only have one MTP head, which predicts token at position n+2,
the MTP layers take as input the output embedding from the last conventional layer and their own input embedding.

So implementation-wise it seems like

we should try to reuse the existing speculative decode functionality (including nice stuff like main model KV cache management, various samplers, etc.),
but a lot of the full draft model functionality is redundant/harmful, like context/cache management for the draft model, vocab matching,
it probably makes sense to write a new function like mtp_speculative_gen_draft in speculative.cpp that is vastly simplified and branch into it in server.cpp when a slot has MTP (versus common_speculative_gen_draft).
AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.
It doesn't make sense to have to manage a distinct ctx_dft in this case as well. It's a bit hacky but I was thinking we could just have ctx_dft = ctx and then have both normal and MTP passes write over the shared ctx logits. I think this minimizes required code changes elsewhere

This is my first time (1) working with ML stuff outside of python (2) attempting to contribute, so patience is appreciated :)

ggerganov · 2025-08-13T06:58:33Z

AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.

This is correct - we always alternate between conventional and speculative passes. It's definitely not optimal, but improves flexibility for regular sampling. It allows to change the speculative parameters and even disable it per request, while the logic is quite simple.

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

ggerganov · 2025-08-13T07:52:04Z

Generally we should try to minimize the changes to llama.h, since changing/extending the public API requires a lot of effort.

On first look, I think the path that involves minimal changes is:

Add int n_mtp flag to llama_context_params (default = 1 - MTP is disabled, 2 - predict logits for one additional token, 3 - predict logits for 2 additional tokens, etc.)
Use this flag during graph build to determine if the MTP heads should be appended to the graph
Keep the conventional logits in the t_logits tensor in llm_graph_result
Add new tensor t_logits_mtp (or whatever is more appropriate) in llm_graph_result and use it to store the MTP results in it
In llama_decode() extract the t_logits_mtp data when available, following the same logic as for t_logits

Extracting the MTP logits during llama_decode() can be done in 2 ways:

Create separate buffer in the llama_context to store them and add a new llama_get_logits_mtp_ith() API that works with that new buffer in a similar way as the existing llama_get_logits_ith()
Reuse the existing logits buffer by expanding it to from [n_outputs][n_vocab] to [n_outputs][n_mtp*n_vocab]. This would avoid the need to add llama_get_logits_mtp_ith() and we can generalize the existing llama_get_logits_ith() by taking into account the value of n_mtp.

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

In any case, you can avoid this until you get the implementation working with a reasonable speedup. After that, we can discuss further how to best refactor the implementation.

slaren · 2025-08-13T11:36:20Z

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

I don't see an issue with adding a new API for this, and it would be easier to use.

Juahyori · 2025-08-13T16:46:10Z

Out of curiosity, is the API for this expected to be flexible enough that we could jump off of it to add things like Medusa / Eagle style (or IBM Accelerator) self speculative decoding heads?

I'm pretty sure they work fairly similarly (depending on the final output embeddings of the current token).

Another note:

After some consideration I think the expected speedup of the MTP module will depend a lot on the hardware the model's running on, particularly because it's an MoE model. While the next token prediction depends only on the current state, if we're doing self speculative decoding, that's additional forward passes. Those forward passes aren't guaranteed to have the same expert usage patterns, meaning the speedup should be some function of the tokens predicted and the expert re-use coefficient for the tokens verified.

So, just noting that if it's implemented and there's not a 2x or 3x increase in T/s, it may not be a skill issue on the part of a contributor, but due to the mathematical nature of the calculation.

For people running franken setups with Attention / KV Cache on GPU and MoE FFNs on CPU, it's possible that using previously unused experts in the verification sweep may result in a weird situation where the parallel verification process is actually memory bandwidth bound.

Not to discourage the implementation of this, I just wanted to give a heads up so nobody's dejected if the theoretical speedups can't be hit. There should still be at least some speedup, though.

F1LM1 · 2025-08-14T07:57:37Z

Thanks all for the suggestions. Will definitely look to refactor into something nicer once correctness can be established.
Right now, still trying to get the graph to compute. Turns out reusing the main model's ctx comes with a handful of issues, like the scheduler being weird and not releasing properly.

So, just noting that if it's implemented and there's not a 2x or 3x increase in T/s, it may not be a skill issue on the part of a contributor, but due to the mathematical nature of the calculation.

For people running franken setups with Attention / KV Cache on GPU and MoE FFNs on CPU, it's possible that using previously unused experts in the verification sweep may result in a weird situation where the parallel verification process is actually memory bandwidth bound.

Not to discourage the implementation of this, I just wanted to give a heads up so nobody's dejected if the theoretical speedups can't be hit. There should still be at least some speedup, though.

Yeah, I'd generally recommend that people temper their expectations with this. Especially given these three models only have one MTP head the theoretical performance gain is hard bounded by 2x on the top end, and that's assuming a perfectly efficient implementation and 100% draft acceptance.

In the absence of actual data from a working prototype... I'd probably guess that the implementation after this PR will be on the order of 40% speedup, then up to 80% after completing this:

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

Optimistically, I hope to have an ugly but working prototype done sometime today.

F1LM1 · 2025-08-16T03:09:05Z

I've gotten to the point where I can get the MTP head to output stuff but managing KV cache with an external call to a separate MTP graph adds an unbelievable amount of complexity: I think we need to do a forward pass for the MTP layer not just when we're sampling, but for every decode token we run. This goes against the scheduling/batching that we're doing (like we'd probably have to add some form of per-token callback to ggml_backend_sched_compute_splits which just feels like it cannot possibly be the correct approach). It's adding a crazy amount of API bloat as well, as already discussed above.

Think I'll take the principled approach suggested by @ggerganov above and just create a single augmented graph. But on the plus side, from this previous attempt I'm pretty confident the MTP subgraph itself is correct, so it wasn't a total waste of time. 🤪

I'll commit the old branch in a sec in case it ever winds up being useful, but I kind of doubt it (outside of as a reference for constructing the MTP subgraph)

F1LM1 · 2025-08-16T05:01:26Z

On second thought, building a single augmented graph also doesn't work, because we need the main model's sampled token in the MTP subgraph. We could make some shortcut assumptions, like "greedy sample" in the MTP subgraph, but as soon as we fail to match the actual main model sampled token for the first time, the MTP layer's KV cache is invalid.

Something along the lines of the original approach might work, management of the MTP subgraph's KV cache could be made easier by using cparams.embeddings = true and LLAMA_POOLING_TYPE_NONE, decoding an entire batch, then running the entire batch through the MTP head (discarding outputs) to keep its cache up to date.

F1LM1 · 2025-08-17T09:04:55Z

This commit sort of works, in the sense that it outputs tokens but

I can't guarantee that I didn't break things in the multi-slot case,
the model seems a bit... dumb, like it's still generally coherent but something feels a bit off. I might be subtly messing with the main model's KV cache in some way,
the draft acceptance rate is lower than I'd expect, somewhere around 60% rather than >80%. I suspect this is really another symptom of the same issue causing the second point.

Still enough to make me optimistic that this general approach can be refined into a real implementation and avoids the challenges noted in my last two comments.

F1LM1 · 2025-08-19T06:04:16Z

Okay, I believe this commit "works" in that both main model and MTP output both seem correct under my informal test conditions. The model is now about as coherent as the base model is, at least in basic conversations and solving AIME problems. The typical draft acceptance rate is ~70% with my samplers. Usually gets higher than that for simple responses, code, math; lower for creative writing. Would probably be good to compare versus the vLLM implementation to see if that's around expected (I don't have enough VRAM to run either variant in vLLM sadly)

This implementation is still far from done:

inefficient (all MTP batches are size 1, still alternating between single-token and double-token samples)
kludgey
fragile (probably buggy in multi-slot case)
bad stylistically
~~there seems to be a memory leak~~
doesn't accept prompt >n_ubatch

but despite all of that I think it works as a proof of concept. Even with the implementation as poor as it is right now it's giving like 20-25% performance uplift

ghost · 2025-08-20T16:55:53Z

Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top P 0.95). So yeah, probably we shouldn't expect too much from it.

F1LM1 · 2025-08-20T17:01:35Z

Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top P 0.95). So yeah, probably we shouldn't expect too much from it.

This sounds like I have a bug with token caching and/or KV shifting on my end, so far haven't been testing in any setting outside of the server webui itself. My caching behavior is almost certainly wrong outside of optimal scenarios right now. Things are still very WIP.

I'll test later but I'd probably expect 50-60% draft acceptance for RP once everything is working, for reference.

F1LM1 · 2025-08-26T05:39:27Z

Upon a bit of testing on my end in RP/creative writing scenarios, I can't find any obvious issues in terms of correctness with the cache management of this prototype; I think the draft head is just unusually bad in a RP setting. I did change the weird logit-hack then standard sample workflow with a greedy sample, which I think is simpler, less hacky, and provides a slight boost to acceptance rates. But I'm still not getting much more than ~40% draft acceptance in RP scenarios with heavy use of advanced samplers like DRY and XTC. I think it'd be good to know if the vLLM implementation yields similar results.

Aside from that, I'll start refactoring. The last correctness issue, which I think may also be the thorniest, is that the MTP cache updates need the embeddings of the main decode pass in order to run, but we also can only store one ubatch's worth of embeddings at once, so the KV cache update step needs to be moved into llama_context::decode or else n_ubatch needs to be set very large (which kills memory usage). However, this can only be done for maintaining the KV cache during prompt processing; after that, we need access to our actual sampled outputs in order to keep updating the MTP head, so these have to be run after common_sampler_sample. Thankfully at that point n_ubatch stops being a constraining factor, so it's possible. And of course things in general need to be cleaned up significantly

Stealt91 · 2025-08-26T07:53:54Z

Not an expert by any means, but XTC is unlikely to work well with MTP as it's excluding the top choice(s). Have you tried it without XTC? DRY shouldn't be much of an issue, but there possibly is some impact there as well.

Stealt91 · 2025-10-04T15:57:36Z

Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so, could another developer try their hands at this? I am eagerly awaiting MTP support and it would be a shame to see no progress on this after an initial working implementation has been done.

I have no experience with llama.cpp in particular, but if it's "only" cleanup work that needs no deep knowledge of ML libraries, I might be able to support.

F1LM1 · 2025-10-05T02:43:53Z

Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so, could another developer try their hands at this? I am eagerly awaiting MTP support and it would be a shame to see no progress on this after an initial working implementation has been done.

I have no experience with llama.cpp in particular, but if it's "only" cleanup work that needs no deep knowledge of ML libraries, I might be able to support.

@SamuelOliveirads is still working on this in a PR on this branch, see F1LM1#3. It's primarily a refactor and some optimizations, but to make my crappy prototype work in a way that is reasonably maintainable/extensible is not that trivial. I've been pretty busy the last few weeks but I don't think we're too far off from a reasonable implementation

Stealt91 · 2025-10-05T10:25:56Z

That is great to hear! Thank you a lot for your effort so far! I had already guessed that it might not be possible to contribute without deeper knowledge, but was willing to give it a try anyway. It's great to hear that @SamuelOliveirads is already working on it!

SamuelOliveirads · 2025-11-02T20:23:26Z

The latest commits successfully integrate the MTP into the llama_decode architecture, maintaining the good output quality from before.

The next major step is optimization. Please note: as of now, this branch is expected to be slower than the baseline without MTP. It should be considered a development preview, not a performance-ready feature.

To tackle the performance issues, a lot of work is needed, particularly in areas where I'm still building my expertise. I've gathered my detailed findings and ideas for optimization in a separate discussion here: F1LM1#4

This is an open invitation for anyone with experience in the following topics to take a look and share any insights. Any help would be greatly appreciated!

Graph reuse and caching strategies
GGML backend allocation and scheduling
Optimizing graph construction for batches with more than one token

SamuelOliveirads · 2025-11-11T00:56:59Z

I haven't made much progress on the optimization front over the last week, aside from the small graph reuse improvement mentioned in PR #4.

Speaking of which, @ggerganov, I would love to get your thoughts on the strategic goals for this PR. As it stands, the MTP implementation is functionally correct and could be tested, but it's not expected to be faster than the baseline.

Considering the challenges in optimizing it, do you think it's a good idea to merge this version now? It could serve as a foundation, allowing other maintainers to contribute in other areas, like further optimizations or adapting the MTP logic to other models.

If that's the best approach, I can focus on adding the necessary user parameters and preparing the PR for wider testing and review. However, if you feel that improving the performance is essential before merging, I would greatly appreciate any help or information you could share regarding the current bottlenecks I've identified.

wishstudio · 2025-11-11T01:59:43Z

I've been following this PR for a quite while and thank you for the enormous work you have done!

I believe I saw on somewhere that the MTP acceptance rate is very sensitive on the quantization used. It seems standard practice is to keep the MTP layers at FP8 or even FP16 (link). But a quick look at some quants available shows they usually quantize the MTP layers using the same main quantization level.

So I guess improving this may lead to better acceptance rates when using quants.

SamuelOliveirads · 2025-11-13T01:33:25Z

I've been following this PR for a quite while and thank you for the enormous work you have done!

I believe I saw on somewhere that the MTP acceptance rate is very sensitive on the quantization used. It seems standard practice is to keep the MTP layers at FP8 or even FP16 (link). But a quick look at some quants available shows they usually quantize the MTP layers using the same main quantization level.

So I guess improving this may lead to better acceptance rates when using quants.

That's a very plausible point, as vision models often suffer from the same issue, which is why most of them are kept at FP16. As for MTP, it's complicated to assess in llama.cpp since we don't fully support it yet. I'm not aware if other backends have already tested and documented how quantization affects MTP performance.

I still want to thank you for the link. Looking at NVIDIA's TensorRT-LLM backend provided some inspiration on how to solve certain optimization problems. However, it's not a simple task. The ideal scenario would be to perform all MTP operations in a single model pass:

Update MTP context + draft token 1 + ... + draft token n.

This is complicated because the MTP has a dependency on the previous hidden state. To fix this, we could apply a loop over the MTP graph to get the previous hidden state and generate one token at a time. This, in turn, creates another problem: if I try to generate 4 tokens, llama.cpp expects me to run the graph only once with a batch of 4 tokens, not 4 times with one token each.

I'm currently studying the architecture and gathering some ideas, running tests to see if I can find a simpler approach.

CISC · 2025-12-21T08:57:19Z

Please rebase (looks like it will be a bit of work, ignore if you already are working on it). :)

commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <samueloliveira32df@gmail.com> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff20 Merge: 171346c cae85fe Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit cae85fe Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit 0127c6b Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit 4bcc9e2 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit b4cbe03 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit a99709d Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit 913af8f Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit 6f74ba3 Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit 5e1d719 Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit febd823 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit 67c6c06 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit 75dc25e Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit 3da7e7f Author: samuel <samueloliveira32df@gmail.com> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit df64508 Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit 042eb8a Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit 1318b2d Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit c6237c7 Merge: 9fab53e 8742ce0 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit 8742ce0 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit 5a5bce8 Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit 07670a2 Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit 9fab53e Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit 98bc0c6 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit 471e026 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit d72f9d5 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit 382135a Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit 6870f97 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit 6e9bafc Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit cf0f7c0 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit 03231da Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit 1f477b3 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit e434f87 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit db60623 Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property

GLM-4.6 models exclude specific MTP tensors (`embed_tokens` and `shared_head_head`), implying weight tying with the main model. Previously, this caused a crash when building the graph. This commit adds a fallback mechanism to use the main model's token embeddings and output head when the MTP-specific tensors are missing.

Adds a new `mtp` boolean to `llama_model_params`. When set to false (default): 1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`. 2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`). This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.

Removes heavy penalty checks (repetition, frequency, presence, DRY) from `common_sampler_sample_speculative`. The specialized speculative sampler now uses a pure ArgMax (Greedy) approach. This significantly reduces CPU overhead during the drafting phase, which improves overall tokens per second.

SamuelOliveirads · 2025-12-22T01:09:46Z

@CISC It should be ready now.

MikeLP · 2025-12-23T03:46:39Z

@SamuelOliveirads Hi. Just trying to test the branch, and got these warnings when building for HIP.

 HIPCXX="$(hipconfig -l)/clang" \            
HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build \
  -DGGML_HIP=ON -DGGML_HIPBLAS=ON \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DGGML_NATIVE=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_TOOLS=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_CURL=ON \
  -DGGML_RPC=ON \
  -DCMAKE_BUILD_TYPE=Release  \
  -DCMAKE_EXE_LINKER_FLAGS="-no-pie" \
&& cmake --build build --config Release -- -j 48
CMAKE_BUILD_TYPE=Release
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- HIP and hipBLAS found
-- Including HIP backend
-- Using RPC backend
-- Including RPC backend
-- ggml version: 0.9.4
-- ggml commit:  43c023c85
-- Configuring done (0.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/iyanello/Projects/ML/llama.cpp/build
[  0%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  1%] Built target llama-qwen2vl-cli
[  1%] Built target llama-llava-cli
[  2%] Built target llama-gemma3-cli
[  4%] Built target cpp-httplib
[  4%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  5%] Built target llama-minicpmv-cli
[  5%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  6%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
[  7%] Built target build_info
[  7%] Linking CXX static library libggml-base.a
[  7%] Built target ggml-base
[  7%] Building CXX object ggml/src/ggml-rpc/CMakeFiles/ggml-rpc.dir/ggml-rpc.cpp.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o
[  7%] Linking CXX static library libggml-rpc.a
[  7%] Built target ggml-rpc
[  7%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/ggml-cuda.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/mmvq.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/quantize.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/mmq.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/topk-moe.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq1_s.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq2_s.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq2_xs.cu.o
[ 10%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq3_s.cu.o
[ 10%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq4_nl.cu.o
[ 11%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-mxfp4.cu.o
[ 11%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq4_xs.cu.o
[ 11%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q2_k.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q4_0.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q3_k.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q4_1.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q4_k.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q5_1.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q5_0.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q6_k.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q5_k.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q8_0.cu.o
[ 13%] Linking CXX static library libggml-cpu.a
[ 17%] Built target ggml-cpu
[ 18%] Linking HIP static library libggml-hip.a
[ 42%] Built target ggml-hip
[ 42%] Built target ggml
[ 42%] Linking CXX executable ../../bin/rpc-server
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama-adapter.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama-arch.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama-context.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-batch.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-chat.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-cparams.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/llama-graph.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/llama-impl.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/llama-hparams.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache-iswa.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-hybrid.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-recurrent.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/llama-memory.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-model-saver.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-quant.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/afmoe.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/apertus.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/arcee.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/arctic.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/arwkv7.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/baichuan.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/bert.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe2.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/bitnet.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/chameleon.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/bloom.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/codeshell.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/chatglm.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/cogvlm.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/cohere2-iswa.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/command-r.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/dbrx.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/deci.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek2.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/dots1.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5-moe.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/dream.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/exaone.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/exaone4.cpp.o
[ 57%] Built target rpc-server
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/falcon-h1.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/falcon.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma-embedding.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma2-iswa.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3n-iswa.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/glm4-moe.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/glm4.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/gpt2.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/models/gptneox.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/models/granite-hybrid.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/models/granite.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/grok.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-dense.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/grovemoe.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-moe.cpp.o
[ 62%] Building CXX object src/CMakeFiles/llama.dir/models/internlm2.cpp.o
[ 62%] Building CXX object src/CMakeFiles/llama.dir/models/jais.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/lfm2.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/jamba.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/llada-moe.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/llada.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/llama-iswa.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/llama.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/mamba.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/minicpm3.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/minimax-m2.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/modern-bert.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/mpt.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron-h.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/neo-bert.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/olmo.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/olmo2.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/olmoe.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/openai-moe-iswa.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/openelm.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/orion.cpp.o
[ 68%] Building CXX object src/CMakeFiles/llama.dir/models/pangu-embedded.cpp.o
[ 68%] Building CXX object src/CMakeFiles/llama.dir/models/phi2.cpp.o
[ 68%] Building CXX object src/CMakeFiles/llama.dir/models/phi3.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/plamo.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/plamo2.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/plm.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2vl.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2moe.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl-moe.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3moe.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3next.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/refact.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/rnd1.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6-base.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6qwen2.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7-base.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/seed-oss.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/smallthinker.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/smollm3.cpp.o
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/stablelm.cpp.o
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder.cpp.o
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder2.cpp.o
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp: In member function ‘bool llama_batch_allocr::init(const llama_batch&, const llama_vocab&, const llama_memory_i*, uint32_t, uint32_t, bool)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp:298:22: warning: variable ‘ok’ set but not used [-Wunused-but-set-variable]
  298 |                 bool ok = true;
      |                      ^~
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp: In function ‘llama_batch llama_batch_get_one(llama_token*, int32_t)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp:872:5: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  872 |     };
      |     ^
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/t5-dec.cpp.o
[ 76%] Building CXX object src/CMakeFiles/llama.dir/models/t5-enc.cpp.o
[ 76%] Building CXX object src/CMakeFiles/llama.dir/models/wavtokenizer-dec.cpp.o
[ 76%] Building CXX object src/CMakeFiles/llama.dir/models/xverse.cpp.o
[ 77%] Building CXX object src/CMakeFiles/llama.dir/models/mistral3.cpp.o
[ 77%] Building CXX object src/CMakeFiles/llama.dir/models/graph-context-mamba.cpp.o
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp: In member function ‘std::unique_ptr<llama_memory_context_i> llama_context::mtp_memory_batch(const llama_batch&)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp:1576:19: warning: unused variable ‘n_vocab’ [-Wunused-variable]
 1576 |     const int64_t n_vocab = vocab.n_tokens();
      |                   ^~~~~~~
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp: In member function ‘bool llama_context::prepare_mtp_graph_inputs(llm_graph_result*, const llama_ubatch&, const llama_mtp_params&)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp:3220:22: warning: variable ‘op_type’ set but not used [-Wunused-but-set-variable]
 3220 |         const char * op_type;
      |                      ^~~~~~~
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp:3206:26: warning: unused parameter ‘ubatch’ [-Wunused-parameter]
 3206 |     const llama_ubatch & ubatch,
      |     ~~~~~~~~~~~~~~~~~~~~~^~~~~~
[ 77%] Linking CXX static library libllama.a
[ 77%] Built target llama
[ 77%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd.cpp.o
[ 77%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-helper.cpp.o
[ 77%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat-parser-xml-toolcall.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat-parser.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat-peg-parser.cpp.o
[ 79%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 79%] Building CXX object common/CMakeFiles/common.dir/download.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/llguidance.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 81%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
[ 81%] Building CXX object common/CMakeFiles/common.dir/preset.cpp.o
[ 81%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 82%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 82%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
/home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp: In function ‘llama_token common_sampler_sample_speculative(common_sampler*, llama_context*, int)’:
/home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp:681:38: warning: ‘int32_t llama_n_vocab(const llama_vocab*)’ is deprecated: use llama_vocab_n_tokens instead [-Wdeprecated-declarations]
  681 |     const int n_vocab = llama_n_vocab(llama_model_get_vocab(llama_get_model(ctx)));
      |                         ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/iyanello/Projects/ML/llama.cpp/common/sampling.h:3,
                 from /home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp:1:
/home/iyanello/Projects/ML/llama.cpp/src/../include/llama.h:521:34: note: declared here
  521 |     DEPRECATED(LLAMA_API int32_t llama_n_vocab    (const struct llama_vocab * vocab), "use llama_vocab_n_tokens instead");
      |                                  ^~~~~~~~~~~~~
/home/iyanello/Projects/ML/llama.cpp/src/../include/llama.h:29:36: note: in definition of macro ‘DEPRECATED’
   29 | #    define DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
      |                                    ^~~~
/home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp:678:18: warning: unused variable ‘params’ [-Wunused-variable]
  678 |     const auto & params = gsmpl->params;
      |                  ^~~~~~
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp: In constructor ‘decode_embd_batch::decode_embd_batch(float*, int32_t, int, int)’:
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp:144:9: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  144 |         };
      |         ^
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp: In member function ‘llama_batch decode_embd_batch::get_view(int, int)’:
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp:224:9: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  224 |         };
      |         ^
[ 83%] Linking CXX static library libmtmd.a
[ 87%] Built target mtmd
[ 87%] Linking CXX static library libcommon.a
[ 87%] Built target common
[ 87%] Building CXX object tools/gguf-split/CMakeFiles/llama-gguf-split.dir/gguf-split.cpp.o
[ 87%] Building CXX object tools/batched-bench/CMakeFiles/llama-batched-bench.dir/batched-bench.cpp.o
[ 88%] Building CXX object tools/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
[ 88%] Building CXX object tools/completion/CMakeFiles/llama-completion.dir/completion.cpp.o
[ 88%] Building CXX object tools/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 88%] Building CXX object tools/imatrix/CMakeFiles/llama-imatrix.dir/imatrix.cpp.o
[ 88%] Building CXX object tools/tokenize/CMakeFiles/llama-tokenize.dir/tokenize.cpp.o
[ 88%] Building CXX object tools/export-lora/CMakeFiles/llama-export-lora.dir/export-lora.cpp.o
[ 88%] Building CXX object tools/tts/CMakeFiles/llama-tts.dir/tts.cpp.o
[ 88%] Building CXX object tools/quantize/CMakeFiles/llama-quantize.dir/quantize.cpp.o
[ 88%] Building CXX object tools/mtmd/CMakeFiles/llama-mtmd-cli.dir/mtmd-cli.cpp.o
[ 89%] Building CXX object tools/cvector-generator/CMakeFiles/llama-cvector-generator.dir/cvector-generator.cpp.o
[ 90%] Building CXX object tools/fit-params/CMakeFiles/llama-fit-params.dir/fit-params.cpp.o
[ 90%] Building CXX object tools/run/CMakeFiles/llama-run.dir/run.cpp.o
[ 90%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-task.cpp.o
[ 91%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-queue.cpp.o
[ 91%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-common.cpp.o
[ 91%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-context.cpp.o
[ 91%] Linking CXX executable ../../bin/llama-tokenize
[ 91%] Linking CXX executable ../../bin/llama-gguf-split
[ 91%] Linking CXX executable ../../bin/llama-fit-params
/home/iyanello/Projects/ML/llama.cpp/tools/batched-bench/batched-bench.cpp: In lambda function:
/home/iyanello/Projects/ML/llama.cpp/tools/batched-bench/batched-bench.cpp:88:13: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
   88 |             };
      |             ^
[ 92%] Linking CXX executable ../../bin/llama-batched-bench
[ 92%] Built target llama-tokenize
[ 92%] Built target llama-gguf-split
[ 92%] Built target llama-fit-params
[ 92%] Linking CXX executable ../../bin/llama-export-lora
[ 92%] Built target llama-batched-bench
[ 92%] Linking CXX executable ../../bin/llama-quantize
[ 92%] Linking CXX executable ../../bin/llama-cvector-generator
[ 92%] Linking CXX executable ../../bin/llama-mtmd-cli
[ 92%] Linking CXX executable ../../bin/llama-completion
[ 92%] Built target llama-export-lora
[ 92%] Built target llama-quantize
[ 92%] Built target llama-cvector-generator
[ 92%] Built target llama-mtmd-cli
[ 92%] Built target llama-completion
/home/iyanello/Projects/ML/llama.cpp/tools/perplexity/perplexity.cpp: In function ‘bool decode_helper(llama_context*, llama_batch&, std::vector<float>&, int, int)’:
/home/iyanello/Projects/ML/llama.cpp/tools/perplexity/perplexity.cpp:674:9: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  674 |         };
      |         ^
[ 92%] Linking CXX executable ../../bin/llama-perplexity
[ 92%] Built target llama-perplexity
[ 93%] Linking CXX executable ../../bin/llama-run
[ 93%] Built target llama-run
[ 94%] Linking CXX executable ../../bin/llama-imatrix
[ 94%] Built target llama-imatrix
[ 94%] Linking CXX executable ../../bin/llama-bench
[ 95%] Linking CXX executable ../../bin/llama-tts
[ 95%] Built target llama-bench
[ 95%] Built target llama-tts
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp: In member function ‘void server_context_impl::update_slots()’:
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp:2568:13: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
 2568 |             };
      |             ^
[ 96%] Linking CXX static library libserver-context.a
[ 96%] Built target server-context
[ 96%] Building CXX object tools/cli/CMakeFiles/llama-cli.dir/cli.cpp.o
[ 96%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-http.cpp.o
[ 97%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server.cpp.o
[ 97%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-models.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-task.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-queue.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-common.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-context.cpp.o
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp: In member function ‘void server_context_impl::update_slots()’:
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp:2568:13: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
 2568 |             };
      |             ^
[ 99%] Linking CXX executable ../../bin/llama-cli
[ 99%] Built target llama-cli
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server

Maybe it will help to polish PR

CHNtentes · 2025-12-23T08:07:57Z

Hi there. Could anyone provide a complete example command for llama-server?

SamuelOliveirads · 2025-12-23T18:46:14Z

Maybe it will help to polish PR

Thanks, I will take some time to look into these warnings.

Hi there. Could anyone provide a complete example command for llama-server?

@CHNtentes Sure! Please keep in mind that this PR is about implementing the MTP architecture and is not necessarily fully optimized yet. I have seen some users getting the same or even lower performance than the baseline.

You will need to use three arguments when loading the model:

-mtp (or --multi-token-prediction): To activate the MTP layers.
--draft (or --draft-n, --draft-max): Borrowed from the common speculative decoder. This controls how many tokens MTP will draft before sending them to the main model for validation. The standard value (if I recall correctly) is 16.
--draft-p-min: Also borrowed. This is a confidence threshold for the drafted token. If the confidence is lower than this value, MTP stops drafting immediately. The standard value is 0.75.

Fine-tuning Advice
Many factors can impact performance, but here is some general advice:

GPU Offload: If you are using offloading, ensure the MTP layer is fully loaded onto your GPU. This is typically layer 92 for GLM-4.6 and layer 46 for GLM-4.5 Air.
Metrics: Use the "draft acceptance rate" to gauge efficiency. This metric appears in your log at the end of the interaction, alongside prompt and generation eval time.
Tuning Strategy: Start with a low draft-n (1, 2, or 3) and increase it until you see diminishing returns. Then, adjust draft-p-min to see if you can gain more speed. In my experience, using a draft-n between 2 and 3 with a p-min of 0.85 yields better results than simply setting draft-n to 10, where many tokens end up being discarded (wasting resources).

Results will also vary depending on your use case; tasks like coding (where you can use greedy decoding) will likely give better results than creative writing.

CHNtentes · 2025-12-24T07:05:28Z

Maybe it will help to polish PR

Thanks, I will take some time to look into these warnings.

Hi there. Could anyone provide a complete example command for llama-server?

@CHNtentes Sure! Please keep in mind that this PR is about implementing the MTP architecture and is not necessarily fully optimized yet. I have seen some users getting the same or even lower performance than the baseline.

You will need to use three arguments when loading the model:

-mtp (or --multi-token-prediction): To activate the MTP layers.

--draft (or --draft-n, --draft-max): Borrowed from the common speculative decoder. This controls how many tokens MTP will draft before sending them to the main model for validation. The standard value (if I recall correctly) is 16.

--draft-p-min: Also borrowed. This is a confidence threshold for the drafted token. If the confidence is lower than this value, MTP stops drafting immediately. The standard value is 0.75.

Fine-tuning Advice Many factors can impact performance, but here is some general advice:

GPU Offload: If you are using offloading, ensure the MTP layer is fully loaded onto your GPU. This is typically layer 92 for GLM-4.6 and layer 46 for GLM-4.5 Air.

Metrics: Use the "draft acceptance rate" to gauge efficiency. This metric appears in your log at the end of the interaction, alongside prompt and generation eval time.

Tuning Strategy: Start with a low draft-n (1, 2, or 3) and increase it until you see diminishing returns. Then, adjust draft-p-min to see if you can gain more speed. In my experience, using a draft-n between 2 and 3 with a p-min of 0.85 yields better results than simply setting draft-n to 10, where many tokens end up being discarded (wasting resources).

Results will also vary depending on your use case; tasks like coding (where you can use greedy decoding) will likely give better results than creative writing.

Thanks for your reply. I'll run some tests once the GLM 4.7 GGUFs are downloaded.
I noticed the parameters provided by zai official for sglang looks very different:

  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \

Is this a total different implementation?

SamuelOliveirads · 2025-12-24T15:01:49Z

Is this a total different implementation?

The architecture is the same; it uses Eagle, as that is what GLM requires.

For comparison, it's something like:

-mtp ≈ --speculative-algorithm EAGLE
--draft ≈ --speculative-num-steps

We have --draft-p-min (which as far as I know they don't support), while they have --speculative-eagle-topk. We could support Top-K in a later PR focused on sampling.

Want to replicate the recommended params from Z.ai? Just use: -mtp --draft 3

CHNtentes · 2025-12-25T05:59:57Z

I tried this PR with https://huggingface.co/unsloth/GLM-4.7-GGUF/tree/main/UD-Q4_K_XL and following command as baseline:

export CUDA_VISIBLE_DEVICES="1,2,3"

./build/bin/llama-server \
	--model /mnt/home/ltg/GLM-4.7-GGUF/GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf \
	--alias "GLM-4.7-UD-Q4_K_XL" \
	-ngl 999 \
	--temp 0.6 \
	--top-p 0.95 \
	--min-p 0.0 \
	--top-k 40 \
	-c 163840 \
	-n 512 \
	--host 0.0.0.0 \
	--port 8001 \
	--jinja \
	-ub 2048 \
	-fit off \
	-fa on

baseline decode speed:

eval time =   14448.65 ms /   512 tokens (   28.22 ms per token,    35.44 tokens per second)

-mtp --draft 1:

eval time =   22181.16 ms /   512 tokens (   43.32 ms per token,    23.08 tokens per second)
draft acceptance rate = 0.74658 (  218 accepted /   292 generated)

-mtp --draft 2:

eval time =   21044.67 ms /   512 tokens (   41.10 ms per token,    24.33 tokens per second)
draft acceptance rate = 0.63951 (  259 accepted /   405 generated)

-mtp --draft 3:

eval time =   22309.56 ms /   512 tokens (   43.57 ms per token,    22.95 tokens per second)
draft acceptance rate = 0.48313 (  272 accepted /   563 generated)

-mtp --draft 2 --draft-p-min 0.85:

eval time =   21562.22 ms /   512 tokens (   42.11 ms per token,    23.75 tokens per second)
draft acceptance rate = 0.62750 (  251 accepted /   400 generated)

It seems the feature is working, but the speed is indeed slower than without mtp. I'll wait for future optimizations.

MikeLP · 2025-12-30T17:45:32Z

After some tests I found it crashes for some models, like Xiaomi MiMo.

AbdullahMPrograms · 2025-12-31T01:10:18Z

After some tests I found it crashes for some models, like Xiaomi MiMo.

Someone can correct me if I'm mistaken, but this is an implementation of MTP for GLM, MiMo will presumably require an extension/modification of this PR for its MTP implementation to work

MikeLP · 2025-12-31T01:40:16Z

Someone can correct me if I'm mistaken, but this is an implementation of MTP for GLM, MiMo will presumably require an extension/modification of this PR for its MTP implementation to work

You are right, but application should not crash, it should ignore unsupported models.

CISC

Let's start small and hopefully get the ball rolling...

CISC · 2026-01-05T18:18:05Z

common/arg.cpp

+    add_opt(common_arg(
+        {"-mtp", "--multi-token-prediction"},
+        string_format("Activate multi-token-prediction (if supported) (default: %s)", params.mtp ? "true" : "false"),
+        [](common_params & params) {
+            params.mtp = true;
+        }
+    ));


Suggested change

add_opt(common_arg(

{"-mtp", "--multi-token-prediction"},

string_format("Activate multi-token-prediction (if supported) (default: %s)", params.mtp ? "true" : "false"),

[](common_params & params) {

params.mtp = true;

}

));

add_opt(common_arg(

{"-mtp", "--multi-token-prediction"},

{"-no-mtp", "--no-multi-token-prediction"},

string_format("whether to use multi-token-prediction (if supported) (default: %s)", params.mtp ? "true" : "false"),

[](common_params & params, bool value) {

params.mtp = value;

}

));

CISC · 2026-01-05T18:30:29Z

src/llama-context.cpp

-            const llama_memory_context_i * mctx,
-            llm_graph_type   gtype) const {
+                        const llama_memory_context_i * mctx,
+                        llm_graph_type   gtype,
+                        const llama_mtp_params & mtp_params) const {


Saw this a couple of times now, please don't change, and align properly, the idea is that you should quickly and easily be able to get an overview over the variable names due to vertical alignment.

CISC · 2026-01-05T18:32:31Z

src/llama-context.h


    mutable int32_t n_reused = 0; // number of times the previous graph was reused
-};
+};


Beware of missing EOF newlines, this and others will fail CI.

CISC · 2026-01-05T18:37:44Z

src/llama-kv-cache.cpp

+    if (!res.empty()) {
+        std::string idxs_str;
+        for (const auto& vec : res.idxs) {
+            if (!vec.empty()) {
+                if (vec.size() > 8) {
+                     idxs_str += " [" + std::to_string(vec.front()) + "..." + std::to_string(vec.back()) + " (" + std::to_string(vec.size()) + " cells)]";
+                } else {
+                     idxs_str += " [";
+                     for(size_t i = 0; i < vec.size(); ++i) {
+                         idxs_str += std::to_string(vec[i]) + (i == vec.size() - 1 ? "" : ", ");
+                     }
+                     idxs_str += "]";
+                }
+            }
+        }
+    }


Leftover debug, but no logging anymore?

ngxson · 2026-01-05T19:32:59Z

include/llama.h

+     * @brief Removes KV cache metadata for a specified sequence and token range.
+     *        This makes the physical cells logically available again without deleting the tensor data.
+     */
+    LLAMA_API void llama_kv_cache_seq_rm(struct llama_context * ctx, llama_seq_id seq_id, llama_pos p0, llama_pos p1);


I genuinely don't think this array of API will be accepted by other maintainers. IMO it does break a lot of pattern that we explicitly established in CONTRIBUTING.md (have you even read it before pushing this PR?)

It doesn't make sense to make breaking change to llama_batch by adding mtp_params to it. llama_set_causal_attn() and llama_set_embeddings() are already used for similar purpose.

Why llama_kv_cache_seq_rm? What's wrong with llama_memory_seq_rm() ?

What is an sinfo ? Nowhere in this file explain about it. It doesn't even have a public struct

llama_set_draft_input_hidden_state indicates that we have to manually copy embeddings from main LLM to MTP layers. This doesn't resolve the core issue brought up by Georgi's comment. Plus, this breaks the API naming convention of llama_<module>_<verb>

Just remind that supporting MTP not NOT hard, but designing API to support all MTP models is hard.

Unless this PR invest more thoughts / more work on designing an "universal" API that supports most MTP models, I don't think we can consider merging it.

SamuelOliveirads · 2026-01-06T00:50:29Z

@F1LM1, do you want to work on these fixes, or should I handle them?

F1LM1 · 2026-01-06T06:05:49Z

@F1LM1, do you want to work on these fixes, or should I handle them?

Sure, I'll have time to look at this later this week

… struct

github-actions bot added examples server labels Aug 11, 2025

ggerganov added the hot Something that is hot label Aug 12, 2025

RatStar811 approved these changes Aug 29, 2025

View reviewed changes

pwilkin mentioned this pull request Sep 13, 2025

Feature Request: Qwen3-Next support #15940

Closed

4 tasks

SamuelOliveirads mentioned this pull request Oct 19, 2025

mtp-batch: batch prompt processing F1LM1/llama.cpp#3

Merged

RatStar811 approved these changes Oct 30, 2025

View reviewed changes

RatStar811 approved these changes Nov 1, 2025

View reviewed changes

RatStar811 approved these changes Nov 3, 2025

View reviewed changes

F1LM1 requested a review from CISC as a code owner December 21, 2025 08:02

SamuelOliveirads and others added 6 commits December 21, 2025 17:23

speculative: optimize graph reuse for GLM-4.5

38c9118

clean up mtp sample typing after rebase

d10a5a4

F1LM1 force-pushed the glm4-moe-mtp branch from 79b0fea to d10a5a4 Compare December 21, 2025 22:59

loci-dev mentioned this pull request Dec 21, 2025

UPSTREAM PR #15225: server: implement GLM-style MTP auroralabs-loci/llama.cpp#657

Closed

RatStar811 approved these changes Dec 26, 2025

View reviewed changes

CISC mentioned this pull request Jan 4, 2026

Add EXAONE MoE implementations #18543

Merged

CISC reviewed Jan 5, 2026

View reviewed changes

ngxson reviewed Jan 5, 2026

View reviewed changes

F1LM1 added 2 commits January 12, 2026 00:51

remove llama_kv_cache_seq_rm, use canonical API command

ffa744c

move mtp op type from batch property to ctx params; remove mtp_params…

53dbb6a

… struct

This was referenced Jan 14, 2026

[Speculative decoding] feat: add EAGLE3 speculative decoding support #18039

Draft

llama : add MTP API #18886

Draft

Conversation

F1LM1 commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

ggerganov commented Aug 13, 2025

Uh oh!

slaren commented Aug 13, 2025

Uh oh!

Juahyori commented Aug 13, 2025

Uh oh!

F1LM1 commented Aug 14, 2025

Uh oh!

F1LM1 commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

F1LM1 commented Aug 16, 2025

Uh oh!

F1LM1 commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

F1LM1 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Aug 20, 2025

Uh oh!

F1LM1 commented Aug 20, 2025

Uh oh!

F1LM1 commented Aug 26, 2025

Uh oh!

Stealt91 commented Aug 26, 2025

Uh oh!

Stealt91 commented Oct 4, 2025

Uh oh!

F1LM1 commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Stealt91 commented Oct 5, 2025

Uh oh!

SamuelOliveirads commented Nov 2, 2025

Uh oh!

SamuelOliveirads commented Nov 11, 2025

Uh oh!

wishstudio commented Nov 11, 2025

Uh oh!

SamuelOliveirads commented Nov 13, 2025

Uh oh!

CISC commented Dec 21, 2025

Uh oh!

SamuelOliveirads commented Dec 22, 2025

Uh oh!

MikeLP commented Dec 23, 2025

Uh oh!

CHNtentes commented Dec 23, 2025

Uh oh!

SamuelOliveirads commented Dec 23, 2025

Uh oh!

CHNtentes commented Dec 24, 2025

Uh oh!

SamuelOliveirads commented Dec 24, 2025

Uh oh!

CHNtentes commented Dec 25, 2025

Uh oh!

MikeLP commented Dec 30, 2025

Uh oh!

AbdullahMPrograms commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MikeLP commented Dec 31, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

CISC Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

F1LM1 commented Aug 11, 2025 •

edited

Loading

F1LM1 commented Aug 16, 2025 •

edited

Loading

F1LM1 commented Aug 17, 2025 •

edited

Loading

F1LM1 commented Aug 19, 2025 •

edited

Loading

F1LM1 commented Oct 5, 2025 •

edited

Loading

AbdullahMPrograms commented Dec 31, 2025 •

edited

Loading

ngxson Jan 5, 2026 •

edited

Loading