Skip to content

server: implement GLM-style MTP#15225

Open
F1LM1 wants to merge 8 commits intoggml-org:masterfrom
F1LM1:glm4-moe-mtp
Open

server: implement GLM-style MTP#15225
F1LM1 wants to merge 8 commits intoggml-org:masterfrom
F1LM1:glm4-moe-mtp

Conversation

@F1LM1
Copy link

@F1LM1 F1LM1 commented Aug 11, 2025

This is very much a draft/proof of concept I'm playing with, just one idea for an MTP implementation. Planning to test on GLM-4.5 because it's the only model out there that we've preserved NextN tensors for.

From what I can tell

  • the three models with MTP implemented in vLLM right now are all "DeepseekV3-style,"
  • they only have one MTP head, which predicts token at position n+2,
  • the MTP layers take as input the output embedding from the last conventional layer and their own input embedding.

So implementation-wise it seems like

  • we should try to reuse the existing speculative decode functionality (including nice stuff like main model KV cache management, various samplers, etc.),
  • but a lot of the full draft model functionality is redundant/harmful, like context/cache management for the draft model, vocab matching,
  • it probably makes sense to write a new function like mtp_speculative_gen_draft in speculative.cpp that is vastly simplified and branch into it in server.cpp when a slot has MTP (versus common_speculative_gen_draft).
  • AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.
  • It doesn't make sense to have to manage a distinct ctx_dft in this case as well. It's a bit hacky but I was thinking we could just have ctx_dft = ctx and then have both normal and MTP passes write over the shared ctx logits. I think this minimizes required code changes elsewhere

This is my first time (1) working with ML stuff outside of python (2) attempting to contribute, so patience is appreciated :)

@ggerganov ggerganov added the hot Something that is hot label Aug 12, 2025
@ggerganov
Copy link
Member

AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.

This is correct - we always alternate between conventional and speculative passes. It's definitely not optimal, but improves flexibility for regular sampling. It allows to change the speculative parameters and even disable it per request, while the logic is quite simple.

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

@ggerganov
Copy link
Member

Generally we should try to minimize the changes to llama.h, since changing/extending the public API requires a lot of effort.

On first look, I think the path that involves minimal changes is:

  • Add int n_mtp flag to llama_context_params (default = 1 - MTP is disabled, 2 - predict logits for one additional token, 3 - predict logits for 2 additional tokens, etc.)
  • Use this flag during graph build to determine if the MTP heads should be appended to the graph
  • Keep the conventional logits in the t_logits tensor in llm_graph_result
  • Add new tensor t_logits_mtp (or whatever is more appropriate) in llm_graph_result and use it to store the MTP results in it
  • In llama_decode() extract the t_logits_mtp data when available, following the same logic as for t_logits

Extracting the MTP logits during llama_decode() can be done in 2 ways:

  • Create separate buffer in the llama_context to store them and add a new llama_get_logits_mtp_ith() API that works with that new buffer in a similar way as the existing llama_get_logits_ith()
  • Reuse the existing logits buffer by expanding it to from [n_outputs][n_vocab] to [n_outputs][n_mtp*n_vocab]. This would avoid the need to add llama_get_logits_mtp_ith() and we can generalize the existing llama_get_logits_ith() by taking into account the value of n_mtp.

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

In any case, you can avoid this until you get the implementation working with a reasonable speedup. After that, we can discuss further how to best refactor the implementation.

@slaren
Copy link
Member

slaren commented Aug 13, 2025

Currently, I am not sure which way is better. The first requires a new API call, while the second might break some existing assumptions (not sure if that's the case yet).

I don't see an issue with adding a new API for this, and it would be easier to use.

@Juahyori
Copy link

Out of curiosity, is the API for this expected to be flexible enough that we could jump off of it to add things like Medusa / Eagle style (or IBM Accelerator) self speculative decoding heads?

I'm pretty sure they work fairly similarly (depending on the final output embeddings of the current token).

Another note:

After some consideration I think the expected speedup of the MTP module will depend a lot on the hardware the model's running on, particularly because it's an MoE model. While the next token prediction depends only on the current state, if we're doing self speculative decoding, that's additional forward passes. Those forward passes aren't guaranteed to have the same expert usage patterns, meaning the speedup should be some function of the tokens predicted and the expert re-use coefficient for the tokens verified.

So, just noting that if it's implemented and there's not a 2x or 3x increase in T/s, it may not be a skill issue on the part of a contributor, but due to the mathematical nature of the calculation.

For people running franken setups with Attention / KV Cache on GPU and MoE FFNs on CPU, it's possible that using previously unused experts in the verification sweep may result in a weird situation where the parallel verification process is actually memory bandwidth bound.

Not to discourage the implementation of this, I just wanted to give a heads up so nobody's dejected if the theoretical speedups can't be hit. There should still be at least some speedup, though.

@F1LM1
Copy link
Author

F1LM1 commented Aug 14, 2025

Thanks all for the suggestions. Will definitely look to refactor into something nicer once correctness can be established.
Right now, still trying to get the graph to compute. Turns out reusing the main model's ctx comes with a handful of issues, like the scheduler being weird and not releasing properly.

So, just noting that if it's implemented and there's not a 2x or 3x increase in T/s, it may not be a skill issue on the part of a contributor, but due to the mathematical nature of the calculation.

For people running franken setups with Attention / KV Cache on GPU and MoE FFNs on CPU, it's possible that using previously unused experts in the verification sweep may result in a weird situation where the parallel verification process is actually memory bandwidth bound.

Not to discourage the implementation of this, I just wanted to give a heads up so nobody's dejected if the theoretical speedups can't be hit. There should still be at least some speedup, though.

Yeah, I'd generally recommend that people temper their expectations with this. Especially given these three models only have one MTP head the theoretical performance gain is hard bounded by 2x on the top end, and that's assuming a perfectly efficient implementation and 100% draft acceptance.

In the absence of actual data from a working prototype... I'd probably guess that the implementation after this PR will be on the order of 40% speedup, then up to 80% after completing this:

It should be possible to improve this by keeping track which slots are speculating on each iteration and skip adding tokens to the conventional batch for them. It might be a good idea to implement this separately to avoid huge changes in the logic in a single PR.

Optimistically, I hope to have an ugly but working prototype done sometime today.

@F1LM1
Copy link
Author

F1LM1 commented Aug 16, 2025

I've gotten to the point where I can get the MTP head to output stuff but managing KV cache with an external call to a separate MTP graph adds an unbelievable amount of complexity: I think we need to do a forward pass for the MTP layer not just when we're sampling, but for every decode token we run. This goes against the scheduling/batching that we're doing (like we'd probably have to add some form of per-token callback to ggml_backend_sched_compute_splits which just feels like it cannot possibly be the correct approach). It's adding a crazy amount of API bloat as well, as already discussed above.

Think I'll take the principled approach suggested by @ggerganov above and just create a single augmented graph. But on the plus side, from this previous attempt I'm pretty confident the MTP subgraph itself is correct, so it wasn't a total waste of time. 🤪

I'll commit the old branch in a sec in case it ever winds up being useful, but I kind of doubt it (outside of as a reference for constructing the MTP subgraph)

@F1LM1
Copy link
Author

F1LM1 commented Aug 16, 2025

On second thought, building a single augmented graph also doesn't work, because we need the main model's sampled token in the MTP subgraph. We could make some shortcut assumptions, like "greedy sample" in the MTP subgraph, but as soon as we fail to match the actual main model sampled token for the first time, the MTP layer's KV cache is invalid.

Something along the lines of the original approach might work, management of the MTP subgraph's KV cache could be made easier by using cparams.embeddings = true and LLAMA_POOLING_TYPE_NONE, decoding an entire batch, then running the entire batch through the MTP head (discarding outputs) to keep its cache up to date.

@F1LM1
Copy link
Author

F1LM1 commented Aug 17, 2025

This commit sort of works, in the sense that it outputs tokens but

  • I can't guarantee that I didn't break things in the multi-slot case,
  • the model seems a bit... dumb, like it's still generally coherent but something feels a bit off. I might be subtly messing with the main model's KV cache in some way,
  • the draft acceptance rate is lower than I'd expect, somewhere around 60% rather than >80%. I suspect this is really another symptom of the same issue causing the second point.

Still enough to make me optimistic that this general approach can be refined into a real implementation and avoids the challenges noted in my last two comments.

@F1LM1
Copy link
Author

F1LM1 commented Aug 19, 2025

Okay, I believe this commit "works" in that both main model and MTP output both seem correct under my informal test conditions. The model is now about as coherent as the base model is, at least in basic conversations and solving AIME problems. The typical draft acceptance rate is ~70% with my samplers. Usually gets higher than that for simple responses, code, math; lower for creative writing. Would probably be good to compare versus the vLLM implementation to see if that's around expected (I don't have enough VRAM to run either variant in vLLM sadly)

This implementation is still far from done:

  • inefficient (all MTP batches are size 1, still alternating between single-token and double-token samples)
  • kludgey
  • fragile (probably buggy in multi-slot case)
  • bad stylistically
  • there seems to be a memory leak
  • doesn't accept prompt >n_ubatch

but despite all of that I think it works as a proof of concept. Even with the implementation as poor as it is right now it's giving like 20-25% performance uplift

@ghost
Copy link

ghost commented Aug 20, 2025

Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top P 0.95). So yeah, probably we shouldn't expect too much from it.

@F1LM1
Copy link
Author

F1LM1 commented Aug 20, 2025

Tried to run it in RP scenario (using Q4 quant), got from 0.07 to 0.11 acceptance rate on swipes (one time unexpectedly got 0.18) (t=0.8, min p 0.05, top P 0.95). So yeah, probably we shouldn't expect too much from it.

This sounds like I have a bug with token caching and/or KV shifting on my end, so far haven't been testing in any setting outside of the server webui itself. My caching behavior is almost certainly wrong outside of optimal scenarios right now. Things are still very WIP.

I'll test later but I'd probably expect 50-60% draft acceptance for RP once everything is working, for reference.

@F1LM1
Copy link
Author

F1LM1 commented Aug 26, 2025

Upon a bit of testing on my end in RP/creative writing scenarios, I can't find any obvious issues in terms of correctness with the cache management of this prototype; I think the draft head is just unusually bad in a RP setting. I did change the weird logit-hack then standard sample workflow with a greedy sample, which I think is simpler, less hacky, and provides a slight boost to acceptance rates. But I'm still not getting much more than ~40% draft acceptance in RP scenarios with heavy use of advanced samplers like DRY and XTC. I think it'd be good to know if the vLLM implementation yields similar results.

Aside from that, I'll start refactoring. The last correctness issue, which I think may also be the thorniest, is that the MTP cache updates need the embeddings of the main decode pass in order to run, but we also can only store one ubatch's worth of embeddings at once, so the KV cache update step needs to be moved into llama_context::decode or else n_ubatch needs to be set very large (which kills memory usage). However, this can only be done for maintaining the KV cache during prompt processing; after that, we need access to our actual sampled outputs in order to keep updating the MTP head, so these have to be run after common_sampler_sample. Thankfully at that point n_ubatch stops being a constraining factor, so it's possible. And of course things in general need to be cleaned up significantly

@Stealt91
Copy link

Not an expert by any means, but XTC is unlikely to work well with MTP as it's excluding the top choice(s). Have you tried it without XTC? DRY shouldn't be much of an issue, but there possibly is some impact there as well.

@Stealt91
Copy link

Stealt91 commented Oct 4, 2025

Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so, could another developer try their hands at this? I am eagerly awaiting MTP support and it would be a shame to see no progress on this after an initial working implementation has been done.

I have no experience with llama.cpp in particular, but if it's "only" cleanup work that needs no deep knowledge of ML libraries, I might be able to support.

@F1LM1
Copy link
Author

F1LM1 commented Oct 5, 2025

Is work on this still progressing in the background? If not, then what kind of work still remains to be done? Is it mainly cleanup and refactoring? If so, could another developer try their hands at this? I am eagerly awaiting MTP support and it would be a shame to see no progress on this after an initial working implementation has been done.

I have no experience with llama.cpp in particular, but if it's "only" cleanup work that needs no deep knowledge of ML libraries, I might be able to support.

@SamuelOliveirads is still working on this in a PR on this branch, see F1LM1#3. It's primarily a refactor and some optimizations, but to make my crappy prototype work in a way that is reasonably maintainable/extensible is not that trivial. I've been pretty busy the last few weeks but I don't think we're too far off from a reasonable implementation

@Stealt91
Copy link

Stealt91 commented Oct 5, 2025

That is great to hear! Thank you a lot for your effort so far! I had already guessed that it might not be possible to contribute without deeper knowledge, but was willing to give it a try anyway. It's great to hear that @SamuelOliveirads is already working on it!

@SamuelOliveirads
Copy link

The latest commits successfully integrate the MTP into the llama_decode architecture, maintaining the good output quality from before.

The next major step is optimization. Please note: as of now, this branch is expected to be slower than the baseline without MTP. It should be considered a development preview, not a performance-ready feature.

To tackle the performance issues, a lot of work is needed, particularly in areas where I'm still building my expertise. I've gathered my detailed findings and ideas for optimization in a separate discussion here: F1LM1#4

This is an open invitation for anyone with experience in the following topics to take a look and share any insights. Any help would be greatly appreciated!

  • Graph reuse and caching strategies
  • GGML backend allocation and scheduling
  • Optimizing graph construction for batches with more than one token

@SamuelOliveirads
Copy link

I haven't made much progress on the optimization front over the last week, aside from the small graph reuse improvement mentioned in PR #4.

Speaking of which, @ggerganov, I would love to get your thoughts on the strategic goals for this PR. As it stands, the MTP implementation is functionally correct and could be tested, but it's not expected to be faster than the baseline.

Considering the challenges in optimizing it, do you think it's a good idea to merge this version now? It could serve as a foundation, allowing other maintainers to contribute in other areas, like further optimizations or adapting the MTP logic to other models.

If that's the best approach, I can focus on adding the necessary user parameters and preparing the PR for wider testing and review. However, if you feel that improving the performance is essential before merging, I would greatly appreciate any help or information you could share regarding the current bottlenecks I've identified.

@wishstudio
Copy link
Contributor

I've been following this PR for a quite while and thank you for the enormous work you have done!

I believe I saw on somewhere that the MTP acceptance rate is very sensitive on the quantization used. It seems standard practice is to keep the MTP layers at FP8 or even FP16 (link). But a quick look at some quants available shows they usually quantize the MTP layers using the same main quantization level.

So I guess improving this may lead to better acceptance rates when using quants.

@SamuelOliveirads
Copy link

I've been following this PR for a quite while and thank you for the enormous work you have done!

I believe I saw on somewhere that the MTP acceptance rate is very sensitive on the quantization used. It seems standard practice is to keep the MTP layers at FP8 or even FP16 (link). But a quick look at some quants available shows they usually quantize the MTP layers using the same main quantization level.

So I guess improving this may lead to better acceptance rates when using quants.

That's a very plausible point, as vision models often suffer from the same issue, which is why most of them are kept at FP16. As for MTP, it's complicated to assess in llama.cpp since we don't fully support it yet. I'm not aware if other backends have already tested and documented how quantization affects MTP performance.

I still want to thank you for the link. Looking at NVIDIA's TensorRT-LLM backend provided some inspiration on how to solve certain optimization problems. However, it's not a simple task. The ideal scenario would be to perform all MTP operations in a single model pass:

Update MTP context + draft token 1 + ... + draft token n.

This is complicated because the MTP has a dependency on the previous hidden state. To fix this, we could apply a loop over the MTP graph to get the previous hidden state and generate one token at a time. This, in turn, creates another problem: if I try to generate 4 tokens, llama.cpp expects me to run the graph only once with a batch of 4 tokens, not 4 times with one token each.

I'm currently studying the architecture and gathering some ideas, running tests to see if I can find a simpler approach.

@F1LM1 F1LM1 requested a review from CISC as a code owner December 21, 2025 08:02
@CISC
Copy link
Collaborator

CISC commented Dec 21, 2025

Please rebase (looks like it will be a bit of work, ignore if you already are working on it). :)

SamuelOliveirads and others added 6 commits December 21, 2025 17:23
commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Dec 7 23:00:29 2025 -0300

    speculative (feat): implement recursive MTP drafting for GLM-4.5

commit bdf72d9
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 16:10:16 2025 -0300

    sampling (feat): optimize speculative drafting with fast-path selection

commit a91980a
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 15:18:19 2025 -0300

    mtp (chore): clean old code

commit 6de0ecf
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 14:40:13 2025 -0300

    mtp (feat): add mtp arg

commit ea77394
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Dec 6 13:47:54 2025 -0300

    mtp-graph (fix): move llama_get_logits_ith outside the loop

commit 15dff20
Merge: 171346c cae85fe
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 16 13:44:41 2025 -0300

    Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache

commit cae85fe
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 16 13:42:31 2025 -0300

    mtp-batch(fix): avoid logits for mtp kv cache operations

commit 171346c
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Oct 12 16:33:01 2025 -0300

    mtp-graph(feat): Reactivate graph reuse only for main model path

commit 0127c6b
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Oct 11 22:20:54 2025 -0300

    mtp-batch(chore): Remove final MTP debug logs and dead code

commit 4bcc9e2
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Oct 11 18:51:22 2025 -0300

    mtp-batch(fix): Correctly advance cache head and add MTP documentation

commit b4cbe03
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Oct 11 18:37:40 2025 -0300

    mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs

commit a99709d
Author: samuel <samueloliveira32df@gmail.com>
Date:   Fri Oct 10 17:24:34 2025 -0300

    mtp-batch(refactor): Extract decode context and MTP input logic into helper methods

commit 913af8f
Author: samuel <samueloliveira32df@gmail.com>
Date:   Fri Oct 10 16:44:28 2025 -0300

    mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum

commit 6f74ba3
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 9 22:27:18 2025 -0300

    mtp-batch (fix): prevent mtp draft from polluting the cache

commit 5e1d719
Author: samuel <samueloliveira32df@gmail.com>
Date:   Thu Oct 9 15:21:23 2025 -0300

    mtp-batch (feat): Create and manage sinfo for MTP

commit febd823
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Oct 5 14:43:40 2025 -0300

    mtp-batch (wip): fix how to warmup kv cache for MTP

commit 67c6c06
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Sep 27 19:42:32 2025 -0300

    mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption

commit 75dc25e
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Sep 27 17:17:00 2025 -0300

    mtp-batch (wip): organize batch for mtp cache

commit 3da7e7f
Author: samuel <samueloliveira32df@gmail.com>
Date:   Tue Sep 23 22:45:11 2025 -0300

    mtp-batch (fix): warm mtp cache for small batch size

commit df64508
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Sep 21 21:55:41 2025 -0300

    mtp-batch (wip): merge glm graphs

commit 042eb8a
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Sep 21 21:29:00 2025 -0300

    mtp-batch (wip): merge mtp and model graph

commit 1318b2d
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sun Sep 14 10:22:59 2025 -0300

    mtp-batch (wip): move mtp execution to batch format

commit c6237c7
Merge: 9fab53e 8742ce0
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sat Sep 13 02:57:01 2025 -0400

    Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp

    feat: implemented sampling for MTP

commit 8742ce0
Author: samuel <samueloliveira32df@gmail.com>
Date:   Sat Sep 6 00:21:18 2025 -0300

    feat: apply logits + greedy sampler

commit 5a5bce8
Author: samuel <samueloliveira32df@gmail.com>
Date:   Wed Sep 3 17:56:14 2025 -0300

    fix: add sample acceptance

commit 07670a2
Author: samuel <samueloliveira32df@gmail.com>
Date:   Wed Sep 3 13:25:21 2025 -0300

    feat: implemented sampling for MTP

commit 9fab53e
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Sep 2 17:14:09 2025 -0400

    fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch

commit 98bc0c6
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 26 01:26:51 2025 -0400

    replace standard sampler with greedy sampler for mtp draft

commit 471e026
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 19 23:10:56 2025 -0400

    fixed vram leak

commit d72f9d5
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 19 01:50:34 2025 -0400

    kludge-y kv cache management of mtp layer

commit 382135a
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sun Aug 17 21:54:45 2025 -0400

    fixed mtp kv cache update sequencing after prompt processing

commit 6870f97
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sun Aug 17 04:59:36 2025 -0400

    added proper KV cache management for MTP layers and slightly refactored

commit 6e9bafc
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Fri Aug 15 23:13:56 2025 -0400

    failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable

commit cf0f7c0
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Wed Aug 13 02:21:17 2025 -0400

    broad thrust of the mtp implementation

commit 03231da
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Tue Aug 12 01:03:59 2025 -0400

    add model member function to build mtp graph, to be called from speculative.cpp

commit 1f477b3
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Mon Aug 11 20:54:45 2025 -0400

    make nextn weights loadable without a crash

commit e434f87
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Mon Aug 11 01:21:47 2025 -0400

    some work towards building mtp layer graph

commit db60623
Author: Aaron Lee <lee.aaron.65@gmail.com>
Date:   Sun Aug 10 23:52:54 2025 -0400

    added getter for nextn layer count and server slot has_mtp property
GLM-4.6 models exclude specific MTP tensors (`embed_tokens` and `shared_head_head`), implying weight tying with the main model. Previously, this caused a crash when building the graph.

This commit adds a fallback mechanism to use the main model's token embeddings and output head when the MTP-specific tensors are missing.
Adds a new `mtp` boolean to `llama_model_params`. When set to false (default):
1. The loader skips loading MTP-specific tensors (NextN layers) using `TENSOR_SKIP`.
2. The KV cache size calculation excludes the MTP layer (`n_layer_kv_from_start`).

This reduces VRAM usage and load time for users running GLM-4.5/4.6 in standard generation mode.
Removes heavy penalty checks (repetition, frequency, presence, DRY) from
`common_sampler_sample_speculative`.

The specialized speculative sampler now uses a pure ArgMax (Greedy) approach.
This significantly reduces CPU overhead during the drafting phase, which
improves overall tokens per second.
@SamuelOliveirads
Copy link

@CISC It should be ready now.

@MikeLP
Copy link

MikeLP commented Dec 23, 2025

@SamuelOliveirads Hi. Just trying to test the branch, and got these warnings when building for HIP.

 HIPCXX="$(hipconfig -l)/clang" \            
HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build \
  -DGGML_HIP=ON -DGGML_HIPBLAS=ON \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DGGML_NATIVE=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_TOOLS=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_CURL=ON \
  -DGGML_RPC=ON \
  -DCMAKE_BUILD_TYPE=Release  \
  -DCMAKE_EXE_LINKER_FLAGS="-no-pie" \
&& cmake --build build --config Release -- -j 48
CMAKE_BUILD_TYPE=Release
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- HIP and hipBLAS found
-- Including HIP backend
-- Using RPC backend
-- Including RPC backend
-- ggml version: 0.9.4
-- ggml commit:  43c023c85
-- Configuring done (0.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/iyanello/Projects/ML/llama.cpp/build
[  0%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  1%] Built target llama-qwen2vl-cli
[  1%] Built target llama-llava-cli
[  2%] Built target llama-gemma3-cli
[  4%] Built target cpp-httplib
[  4%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  5%] Built target llama-minicpmv-cli
[  5%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  5%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  6%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  6%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
[  7%] Built target build_info
[  7%] Linking CXX static library libggml-base.a
[  7%] Built target ggml-base
[  7%] Building CXX object ggml/src/ggml-rpc/CMakeFiles/ggml-rpc.dir/ggml-rpc.cpp.o
[  7%] Building CXX object ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/llamafile/sgemm.cpp.o
[  7%] Linking CXX static library libggml-rpc.a
[  7%] Built target ggml-rpc
[  7%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/ggml-cuda.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/mmvq.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/quantize.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/mmq.cu.o
[  8%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/topk-moe.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq1_s.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq2_s.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu.o
[  9%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq2_xs.cu.o
[ 10%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq3_s.cu.o
[ 10%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq4_nl.cu.o
[ 11%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-mxfp4.cu.o
[ 11%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-iq4_xs.cu.o
[ 11%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q2_k.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q4_0.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q3_k.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q4_1.cu.o
[ 12%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q4_k.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q5_1.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q5_0.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q6_k.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q5_k.cu.o
[ 13%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/template-instances/mmq-instance-q8_0.cu.o
[ 13%] Linking CXX static library libggml-cpu.a
[ 17%] Built target ggml-cpu
[ 18%] Linking HIP static library libggml-hip.a
[ 42%] Built target ggml-hip
[ 42%] Built target ggml
[ 42%] Linking CXX executable ../../bin/rpc-server
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama-adapter.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama-arch.cpp.o
[ 43%] Building CXX object src/CMakeFiles/llama.dir/llama-context.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-batch.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-chat.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-cparams.cpp.o
[ 44%] Building CXX object src/CMakeFiles/llama.dir/llama-grammar.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/llama-graph.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/llama-impl.cpp.o
[ 45%] Building CXX object src/CMakeFiles/llama.dir/llama-hparams.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache-iswa.cpp.o
[ 46%] Building CXX object src/CMakeFiles/llama.dir/llama-kv-cache.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-hybrid.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/llama-memory-recurrent.cpp.o
[ 47%] Building CXX object src/CMakeFiles/llama.dir/llama-memory.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-model-loader.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-model-saver.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-model.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-quant.cpp.o
[ 48%] Building CXX object src/CMakeFiles/llama.dir/llama-sampling.cpp.o
[ 49%] Building CXX object src/CMakeFiles/llama.dir/llama-vocab.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/afmoe.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/apertus.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/arcee.cpp.o
[ 50%] Building CXX object src/CMakeFiles/llama.dir/models/arctic.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/arwkv7.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/baichuan.cpp.o
[ 51%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/bert.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/bailingmoe2.cpp.o
[ 52%] Building CXX object src/CMakeFiles/llama.dir/models/bitnet.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/chameleon.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/bloom.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/codeshell.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/chatglm.cpp.o
[ 53%] Building CXX object src/CMakeFiles/llama.dir/models/cogvlm.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/cohere2-iswa.cpp.o
[ 54%] Building CXX object src/CMakeFiles/llama.dir/models/command-r.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/dbrx.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/deci.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek2.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/deepseek.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/dots1.cpp.o
[ 55%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5-moe.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/dream.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/ernie4-5.cpp.o
[ 56%] Building CXX object src/CMakeFiles/llama.dir/models/exaone.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/exaone4.cpp.o
[ 57%] Built target rpc-server
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/falcon-h1.cpp.o
[ 57%] Building CXX object src/CMakeFiles/llama.dir/models/falcon.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma-embedding.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma2-iswa.cpp.o
[ 58%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/gemma3n-iswa.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/glm4-moe.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/glm4.cpp.o
[ 59%] Building CXX object src/CMakeFiles/llama.dir/models/gpt2.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/models/gptneox.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/models/granite-hybrid.cpp.o
[ 60%] Building CXX object src/CMakeFiles/llama.dir/models/granite.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/grok.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-dense.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/grovemoe.cpp.o
[ 61%] Building CXX object src/CMakeFiles/llama.dir/models/hunyuan-moe.cpp.o
[ 62%] Building CXX object src/CMakeFiles/llama.dir/models/internlm2.cpp.o
[ 62%] Building CXX object src/CMakeFiles/llama.dir/models/jais.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/lfm2.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/jamba.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/llada-moe.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/llada.cpp.o
[ 63%] Building CXX object src/CMakeFiles/llama.dir/models/llama-iswa.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/llama.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/mamba.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/minicpm3.cpp.o
[ 64%] Building CXX object src/CMakeFiles/llama.dir/models/minimax-m2.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/modern-bert.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/mpt.cpp.o
[ 65%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron-h.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/nemotron.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/neo-bert.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/olmo.cpp.o
[ 66%] Building CXX object src/CMakeFiles/llama.dir/models/olmo2.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/olmoe.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/openai-moe-iswa.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/openelm.cpp.o
[ 67%] Building CXX object src/CMakeFiles/llama.dir/models/orion.cpp.o
[ 68%] Building CXX object src/CMakeFiles/llama.dir/models/pangu-embedded.cpp.o
[ 68%] Building CXX object src/CMakeFiles/llama.dir/models/phi2.cpp.o
[ 68%] Building CXX object src/CMakeFiles/llama.dir/models/phi3.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/plamo.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/plamo2.cpp.o
[ 69%] Building CXX object src/CMakeFiles/llama.dir/models/plm.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2vl.cpp.o
[ 70%] Building CXX object src/CMakeFiles/llama.dir/models/qwen2moe.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3vl-moe.cpp.o
[ 71%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3moe.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/qwen3next.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/refact.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/rnd1.cpp.o
[ 72%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6-base.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv6qwen2.cpp.o
[ 73%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7-base.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/rwkv7.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/seed-oss.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/smallthinker.cpp.o
[ 74%] Building CXX object src/CMakeFiles/llama.dir/models/smollm3.cpp.o
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/stablelm.cpp.o
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder.cpp.o
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/starcoder2.cpp.o
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp: In member function ‘bool llama_batch_allocr::init(const llama_batch&, const llama_vocab&, const llama_memory_i*, uint32_t, uint32_t, bool)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp:298:22: warning: variable ‘ok’ set but not used [-Wunused-but-set-variable]
  298 |                 bool ok = true;
      |                      ^~
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp: In function ‘llama_batch llama_batch_get_one(llama_token*, int32_t)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-batch.cpp:872:5: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  872 |     };
      |     ^
[ 75%] Building CXX object src/CMakeFiles/llama.dir/models/t5-dec.cpp.o
[ 76%] Building CXX object src/CMakeFiles/llama.dir/models/t5-enc.cpp.o
[ 76%] Building CXX object src/CMakeFiles/llama.dir/models/wavtokenizer-dec.cpp.o
[ 76%] Building CXX object src/CMakeFiles/llama.dir/models/xverse.cpp.o
[ 77%] Building CXX object src/CMakeFiles/llama.dir/models/mistral3.cpp.o
[ 77%] Building CXX object src/CMakeFiles/llama.dir/models/graph-context-mamba.cpp.o
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp: In member function ‘std::unique_ptr<llama_memory_context_i> llama_context::mtp_memory_batch(const llama_batch&)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp:1576:19: warning: unused variable ‘n_vocab’ [-Wunused-variable]
 1576 |     const int64_t n_vocab = vocab.n_tokens();
      |                   ^~~~~~~
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp: In member function ‘bool llama_context::prepare_mtp_graph_inputs(llm_graph_result*, const llama_ubatch&, const llama_mtp_params&)’:
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp:3220:22: warning: variable ‘op_type’ set but not used [-Wunused-but-set-variable]
 3220 |         const char * op_type;
      |                      ^~~~~~~
/home/iyanello/Projects/ML/llama.cpp/src/llama-context.cpp:3206:26: warning: unused parameter ‘ubatch’ [-Wunused-parameter]
 3206 |     const llama_ubatch & ubatch,
      |     ~~~~~~~~~~~~~~~~~~~~~^~~~~~
[ 77%] Linking CXX static library libllama.a
[ 77%] Built target llama
[ 77%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd.cpp.o
[ 77%] Building CXX object tools/mtmd/CMakeFiles/mtmd.dir/mtmd-helper.cpp.o
[ 77%] Building CXX object common/CMakeFiles/common.dir/arg.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat-parser-xml-toolcall.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat-parser.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 78%] Building CXX object common/CMakeFiles/common.dir/chat-peg-parser.cpp.o
[ 79%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
[ 79%] Building CXX object common/CMakeFiles/common.dir/download.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/llguidance.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/log.cpp.o
[ 80%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
[ 81%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
[ 81%] Building CXX object common/CMakeFiles/common.dir/preset.cpp.o
[ 81%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
[ 82%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o
[ 82%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
/home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp: In function ‘llama_token common_sampler_sample_speculative(common_sampler*, llama_context*, int)’:
/home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp:681:38: warning: ‘int32_t llama_n_vocab(const llama_vocab*)’ is deprecated: use llama_vocab_n_tokens instead [-Wdeprecated-declarations]
  681 |     const int n_vocab = llama_n_vocab(llama_model_get_vocab(llama_get_model(ctx)));
      |                         ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/iyanello/Projects/ML/llama.cpp/common/sampling.h:3,
                 from /home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp:1:
/home/iyanello/Projects/ML/llama.cpp/src/../include/llama.h:521:34: note: declared here
  521 |     DEPRECATED(LLAMA_API int32_t llama_n_vocab    (const struct llama_vocab * vocab), "use llama_vocab_n_tokens instead");
      |                                  ^~~~~~~~~~~~~
/home/iyanello/Projects/ML/llama.cpp/src/../include/llama.h:29:36: note: in definition of macro ‘DEPRECATED’
   29 | #    define DEPRECATED(func, hint) func __attribute__((deprecated(hint)))
      |                                    ^~~~
/home/iyanello/Projects/ML/llama.cpp/common/sampling.cpp:678:18: warning: unused variable ‘params’ [-Wunused-variable]
  678 |     const auto & params = gsmpl->params;
      |                  ^~~~~~
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp: In constructor ‘decode_embd_batch::decode_embd_batch(float*, int32_t, int, int)’:
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp:144:9: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  144 |         };
      |         ^
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp: In member function ‘llama_batch decode_embd_batch::get_view(int, int)’:
/home/iyanello/Projects/ML/llama.cpp/tools/mtmd/mtmd-helper.cpp:224:9: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  224 |         };
      |         ^
[ 83%] Linking CXX static library libmtmd.a
[ 87%] Built target mtmd
[ 87%] Linking CXX static library libcommon.a
[ 87%] Built target common
[ 87%] Building CXX object tools/gguf-split/CMakeFiles/llama-gguf-split.dir/gguf-split.cpp.o
[ 87%] Building CXX object tools/batched-bench/CMakeFiles/llama-batched-bench.dir/batched-bench.cpp.o
[ 88%] Building CXX object tools/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
[ 88%] Building CXX object tools/completion/CMakeFiles/llama-completion.dir/completion.cpp.o
[ 88%] Building CXX object tools/llama-bench/CMakeFiles/llama-bench.dir/llama-bench.cpp.o
[ 88%] Building CXX object tools/imatrix/CMakeFiles/llama-imatrix.dir/imatrix.cpp.o
[ 88%] Building CXX object tools/tokenize/CMakeFiles/llama-tokenize.dir/tokenize.cpp.o
[ 88%] Building CXX object tools/export-lora/CMakeFiles/llama-export-lora.dir/export-lora.cpp.o
[ 88%] Building CXX object tools/tts/CMakeFiles/llama-tts.dir/tts.cpp.o
[ 88%] Building CXX object tools/quantize/CMakeFiles/llama-quantize.dir/quantize.cpp.o
[ 88%] Building CXX object tools/mtmd/CMakeFiles/llama-mtmd-cli.dir/mtmd-cli.cpp.o
[ 89%] Building CXX object tools/cvector-generator/CMakeFiles/llama-cvector-generator.dir/cvector-generator.cpp.o
[ 90%] Building CXX object tools/fit-params/CMakeFiles/llama-fit-params.dir/fit-params.cpp.o
[ 90%] Building CXX object tools/run/CMakeFiles/llama-run.dir/run.cpp.o
[ 90%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-task.cpp.o
[ 91%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-queue.cpp.o
[ 91%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-common.cpp.o
[ 91%] Building CXX object tools/server/CMakeFiles/server-context.dir/server-context.cpp.o
[ 91%] Linking CXX executable ../../bin/llama-tokenize
[ 91%] Linking CXX executable ../../bin/llama-gguf-split
[ 91%] Linking CXX executable ../../bin/llama-fit-params
/home/iyanello/Projects/ML/llama.cpp/tools/batched-bench/batched-bench.cpp: In lambda function:
/home/iyanello/Projects/ML/llama.cpp/tools/batched-bench/batched-bench.cpp:88:13: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
   88 |             };
      |             ^
[ 92%] Linking CXX executable ../../bin/llama-batched-bench
[ 92%] Built target llama-tokenize
[ 92%] Built target llama-gguf-split
[ 92%] Built target llama-fit-params
[ 92%] Linking CXX executable ../../bin/llama-export-lora
[ 92%] Built target llama-batched-bench
[ 92%] Linking CXX executable ../../bin/llama-quantize
[ 92%] Linking CXX executable ../../bin/llama-cvector-generator
[ 92%] Linking CXX executable ../../bin/llama-mtmd-cli
[ 92%] Linking CXX executable ../../bin/llama-completion
[ 92%] Built target llama-export-lora
[ 92%] Built target llama-quantize
[ 92%] Built target llama-cvector-generator
[ 92%] Built target llama-mtmd-cli
[ 92%] Built target llama-completion
/home/iyanello/Projects/ML/llama.cpp/tools/perplexity/perplexity.cpp: In function ‘bool decode_helper(llama_context*, llama_batch&, std::vector<float>&, int, int)’:
/home/iyanello/Projects/ML/llama.cpp/tools/perplexity/perplexity.cpp:674:9: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
  674 |         };
      |         ^
[ 92%] Linking CXX executable ../../bin/llama-perplexity
[ 92%] Built target llama-perplexity
[ 93%] Linking CXX executable ../../bin/llama-run
[ 93%] Built target llama-run
[ 94%] Linking CXX executable ../../bin/llama-imatrix
[ 94%] Built target llama-imatrix
[ 94%] Linking CXX executable ../../bin/llama-bench
[ 95%] Linking CXX executable ../../bin/llama-tts
[ 95%] Built target llama-bench
[ 95%] Built target llama-tts
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp: In member function ‘void server_context_impl::update_slots()’:
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp:2568:13: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
 2568 |             };
      |             ^
[ 96%] Linking CXX static library libserver-context.a
[ 96%] Built target server-context
[ 96%] Building CXX object tools/cli/CMakeFiles/llama-cli.dir/cli.cpp.o
[ 96%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-http.cpp.o
[ 97%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server.cpp.o
[ 97%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-models.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-task.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-queue.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-common.cpp.o
[ 98%] Building CXX object tools/server/CMakeFiles/llama-server.dir/server-context.cpp.o
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp: In member function ‘void server_context_impl::update_slots()’:
/home/iyanello/Projects/ML/llama.cpp/tools/server/server-context.cpp:2568:13: warning: missing initializer for member ‘llama_batch::mtp_params’ [-Wmissing-field-initializers]
 2568 |             };
      |             ^
[ 99%] Linking CXX executable ../../bin/llama-cli
[ 99%] Built target llama-cli
[100%] Linking CXX executable ../../bin/llama-server
[100%] Built target llama-server

Maybe it will help to polish PR

@CHNtentes
Copy link

Hi there. Could anyone provide a complete example command for llama-server?

@SamuelOliveirads
Copy link

Maybe it will help to polish PR

Thanks, I will take some time to look into these warnings.

Hi there. Could anyone provide a complete example command for llama-server?

@CHNtentes Sure! Please keep in mind that this PR is about implementing the MTP architecture and is not necessarily fully optimized yet. I have seen some users getting the same or even lower performance than the baseline.

You will need to use three arguments when loading the model:

  1. -mtp (or --multi-token-prediction): To activate the MTP layers.
  2. --draft (or --draft-n, --draft-max): Borrowed from the common speculative decoder. This controls how many tokens MTP will draft before sending them to the main model for validation. The standard value (if I recall correctly) is 16.
  3. --draft-p-min: Also borrowed. This is a confidence threshold for the drafted token. If the confidence is lower than this value, MTP stops drafting immediately. The standard value is 0.75.

Fine-tuning Advice
Many factors can impact performance, but here is some general advice:

  • GPU Offload: If you are using offloading, ensure the MTP layer is fully loaded onto your GPU. This is typically layer 92 for GLM-4.6 and layer 46 for GLM-4.5 Air.
  • Metrics: Use the "draft acceptance rate" to gauge efficiency. This metric appears in your log at the end of the interaction, alongside prompt and generation eval time.
  • Tuning Strategy: Start with a low draft-n (1, 2, or 3) and increase it until you see diminishing returns. Then, adjust draft-p-min to see if you can gain more speed. In my experience, using a draft-n between 2 and 3 with a p-min of 0.85 yields better results than simply setting draft-n to 10, where many tokens end up being discarded (wasting resources).

Results will also vary depending on your use case; tasks like coding (where you can use greedy decoding) will likely give better results than creative writing.

@CHNtentes
Copy link

Maybe it will help to polish PR

Thanks, I will take some time to look into these warnings.

Hi there. Could anyone provide a complete example command for llama-server?

@CHNtentes Sure! Please keep in mind that this PR is about implementing the MTP architecture and is not necessarily fully optimized yet. I have seen some users getting the same or even lower performance than the baseline.

You will need to use three arguments when loading the model:

  1. -mtp (or --multi-token-prediction): To activate the MTP layers.
  2. --draft (or --draft-n, --draft-max): Borrowed from the common speculative decoder. This controls how many tokens MTP will draft before sending them to the main model for validation. The standard value (if I recall correctly) is 16.
  3. --draft-p-min: Also borrowed. This is a confidence threshold for the drafted token. If the confidence is lower than this value, MTP stops drafting immediately. The standard value is 0.75.

Fine-tuning Advice Many factors can impact performance, but here is some general advice:

  • GPU Offload: If you are using offloading, ensure the MTP layer is fully loaded onto your GPU. This is typically layer 92 for GLM-4.6 and layer 46 for GLM-4.5 Air.
  • Metrics: Use the "draft acceptance rate" to gauge efficiency. This metric appears in your log at the end of the interaction, alongside prompt and generation eval time.
  • Tuning Strategy: Start with a low draft-n (1, 2, or 3) and increase it until you see diminishing returns. Then, adjust draft-p-min to see if you can gain more speed. In my experience, using a draft-n between 2 and 3 with a p-min of 0.85 yields better results than simply setting draft-n to 10, where many tokens end up being discarded (wasting resources).

Results will also vary depending on your use case; tasks like coding (where you can use greedy decoding) will likely give better results than creative writing.

Thanks for your reply. I'll run some tests once the GLM 4.7 GGUFs are downloaded.
I noticed the parameters provided by zai official for sglang looks very different:

  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \

Is this a total different implementation?

@SamuelOliveirads
Copy link

Is this a total different implementation?

The architecture is the same; it uses Eagle, as that is what GLM requires.

For comparison, it's something like:

  • -mtp--speculative-algorithm EAGLE
  • --draft--speculative-num-steps

We have --draft-p-min (which as far as I know they don't support), while they have --speculative-eagle-topk. We could support Top-K in a later PR focused on sampling.

Want to replicate the recommended params from Z.ai? Just use: -mtp --draft 3

@CHNtentes
Copy link

I tried this PR with https://huggingface.co/unsloth/GLM-4.7-GGUF/tree/main/UD-Q4_K_XL and following command as baseline:

export CUDA_VISIBLE_DEVICES="1,2,3"

./build/bin/llama-server \
	--model /mnt/home/ltg/GLM-4.7-GGUF/GLM-4.7-UD-Q4_K_XL-00001-of-00005.gguf \
	--alias "GLM-4.7-UD-Q4_K_XL" \
	-ngl 999 \
	--temp 0.6 \
	--top-p 0.95 \
	--min-p 0.0 \
	--top-k 40 \
	-c 163840 \
	-n 512 \
	--host 0.0.0.0 \
	--port 8001 \
	--jinja \
	-ub 2048 \
	-fit off \
	-fa on

baseline decode speed:

eval time =   14448.65 ms /   512 tokens (   28.22 ms per token,    35.44 tokens per second)

-mtp --draft 1:

eval time =   22181.16 ms /   512 tokens (   43.32 ms per token,    23.08 tokens per second)
draft acceptance rate = 0.74658 (  218 accepted /   292 generated)

-mtp --draft 2:

eval time =   21044.67 ms /   512 tokens (   41.10 ms per token,    24.33 tokens per second)
draft acceptance rate = 0.63951 (  259 accepted /   405 generated)

-mtp --draft 3:

eval time =   22309.56 ms /   512 tokens (   43.57 ms per token,    22.95 tokens per second)
draft acceptance rate = 0.48313 (  272 accepted /   563 generated)

-mtp --draft 2 --draft-p-min 0.85:

eval time =   21562.22 ms /   512 tokens (   42.11 ms per token,    23.75 tokens per second)
draft acceptance rate = 0.62750 (  251 accepted /   400 generated)

It seems the feature is working, but the speed is indeed slower than without mtp. I'll wait for future optimizations.

@MikeLP
Copy link

MikeLP commented Dec 30, 2025

After some tests I found it crashes for some models, like Xiaomi MiMo.

@AbdullahMPrograms
Copy link

AbdullahMPrograms commented Dec 31, 2025

After some tests I found it crashes for some models, like Xiaomi MiMo.

Someone can correct me if I'm mistaken, but this is an implementation of MTP for GLM, MiMo will presumably require an extension/modification of this PR for its MTP implementation to work

@MikeLP
Copy link

MikeLP commented Dec 31, 2025

Someone can correct me if I'm mistaken, but this is an implementation of MTP for GLM, MiMo will presumably require an extension/modification of this PR for its MTP implementation to work

You are right, but application should not crash, it should ignore unsupported models.

Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start small and hopefully get the ball rolling...

Comment on lines +3217 to +3223
add_opt(common_arg(
{"-mtp", "--multi-token-prediction"},
string_format("Activate multi-token-prediction (if supported) (default: %s)", params.mtp ? "true" : "false"),
[](common_params & params) {
params.mtp = true;
}
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
add_opt(common_arg(
{"-mtp", "--multi-token-prediction"},
string_format("Activate multi-token-prediction (if supported) (default: %s)", params.mtp ? "true" : "false"),
[](common_params & params) {
params.mtp = true;
}
));
add_opt(common_arg(
{"-mtp", "--multi-token-prediction"},
{"-no-mtp", "--no-multi-token-prediction"},
string_format("whether to use multi-token-prediction (if supported) (default: %s)", params.mtp ? "true" : "false"),
[](common_params & params, bool value) {
params.mtp = value;
}
));

Comment on lines +1508 to +1553
const llama_memory_context_i * mctx,
llm_graph_type gtype) const {
const llama_memory_context_i * mctx,
llm_graph_type gtype,
const llama_mtp_params & mtp_params) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw this a couple of times now, please don't change, and align properly, the idea is that you should quickly and easily be able to get an overview over the variable names due to vertical alignment.


mutable int32_t n_reused = 0; // number of times the previous graph was reused
};
}; No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware of missing EOF newlines, this and others will fail CI.

Comment on lines +919 to +934
if (!res.empty()) {
std::string idxs_str;
for (const auto& vec : res.idxs) {
if (!vec.empty()) {
if (vec.size() > 8) {
idxs_str += " [" + std::to_string(vec.front()) + "..." + std::to_string(vec.back()) + " (" + std::to_string(vec.size()) + " cells)]";
} else {
idxs_str += " [";
for(size_t i = 0; i < vec.size(); ++i) {
idxs_str += std::to_string(vec[i]) + (i == vec.size() - 1 ? "" : ", ");
}
idxs_str += "]";
}
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover debug, but no logging anymore?

include/llama.h Outdated
* @brief Removes KV cache metadata for a specified sequence and token range.
* This makes the physical cells logically available again without deleting the tensor data.
*/
LLAMA_API void llama_kv_cache_seq_rm(struct llama_context * ctx, llama_seq_id seq_id, llama_pos p0, llama_pos p1);
Copy link
Collaborator

@ngxson ngxson Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I genuinely don't think this array of API will be accepted by other maintainers. IMO it does break a lot of pattern that we explicitly established in CONTRIBUTING.md (have you even read it before pushing this PR?)

  1. It doesn't make sense to make breaking change to llama_batch by adding mtp_params to it. llama_set_causal_attn() and llama_set_embeddings() are already used for similar purpose.
  2. Why llama_kv_cache_seq_rm? What's wrong with llama_memory_seq_rm() ?
  3. What is an sinfo ? Nowhere in this file explain about it. It doesn't even have a public struct
  4. llama_set_draft_input_hidden_state indicates that we have to manually copy embeddings from main LLM to MTP layers. This doesn't resolve the core issue brought up by Georgi's comment. Plus, this breaks the API naming convention of llama_<module>_<verb>

Just remind that supporting MTP not NOT hard, but designing API to support all MTP models is hard.

Unless this PR invest more thoughts / more work on designing an "universal" API that supports most MTP models, I don't think we can consider merging it.

@SamuelOliveirads
Copy link

@F1LM1, do you want to work on these fixes, or should I handle them?

@F1LM1
Copy link
Author

F1LM1 commented Jan 6, 2026

@F1LM1, do you want to work on these fixes, or should I handle them?

Sure, I'll have time to look at this later this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples hot Something that is hot model Model specific server

Projects

None yet

Development

Successfully merging this pull request may close these issues.