fix(DSV3): parity between native DeepseekV3MoE and remote official implementation#45441
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: deepseek_v3, exaone_moe, glm4_moe, glm4_moe_lite, glm4v_moe, glm_moe_dsa, nemotron_h, solar_open |
vasqu
left a comment
There was a problem hiding this comment.
I am inclined to fix this for deepseek v3 itself as it is indeed added within its remote code; however, even if it is a bug, we need to be careful for other models as that may now be their expected behavior. core cc @ArthurZucker @Cyrilvallez
Maybe we should ping the other model maintainers but would like to hear the opinion of at least one other core maintainer on this
|
Thanks a lot for the PR, very nice find! Wanna make sure we go right and don't rush it as it affects quite a few models |
|
Having a look! |
There was a problem hiding this comment.
For dots, indeed if there is no diff feel free to inherit!
This is minor:
- the scores for choice are only used to compute the index, not the weights
- the
router_logits_for_choicecan probably be 0 or negative (with bias), I thought it would be super rare? what you are trying to avoid is that "natural" 0 could be part of the topk vs the masking? makes sense
IMO does not warrant a patch, but you tell me if this is super crucial!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Imho, since some models inherit from DSV3 and also DSV3.2 does -inf, I believe it was still a good thing to match 1:1.
|
…3PreTrainedModel Changes made possible from correct masking huggingface#45441
* init commit with v14 * CI fixes: missing docstring * CI fix: removed Mixtral unused arg * CI fix: shard attn_sinks (TP) * notes * refactor: using copy of gpt-oss eager attn func for sink path #45141 * v15: polish * fix: exact MiMo layer pattern when fallback * v16: re-inheriting from `Gemma3RotaryEmbedding` * docs: applied suggestion (Qwen3MoE as template) * refactor: re-using attributes pt 1 * refactor: `qk_head_dim` to `head_dim` * refactor: dropping `layernorm_epsilon` mapping, re-use `rms_norm_eps` (same val) * refactor: `MiMoV2FlashForCausalLM` inherits from `DeepseekV3ForCausalLM` * refactor(`super` pt1): rely on parent init / remove redundant overrides * refactor: re-using attributes pt2 * refactor: super pt2 * refactor(MiMoV2FlashPreTrainedModel): removed some flags * refactor(attn): replaced dual attn with eager optional sink * cleaned tests * nit style * add conversion script pt1 * fix: broken config with hub `routed_scaling_factor` hparam * conversion script pt 2 (glm4 style) * refactor: removed redundant init.normal_(module.weight ,...) handled by super * regenerate mapping for the new dynamic auto mapping * docs: removed comment (already explained in PR comment) * cleaning tests * refactor(position_embeddings): keep optim with gemm3 style notation * style * dropped model and tokenizer from convert script * docs: comment fix Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * docs: added comment Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * docs : add suggestion comment Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * applied suggestion: removed ep_plan Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * suggestion: removed no op `self.standardize_rope_params()` Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: config inherits from glm4moe * refactor: unconditional SWA mask creation Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * fix: missing decorator Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: sync rot_emb optim with gemmas * refactor 2: unconditional SWA mask creation * refactor: `compute_default_rope_parameters` to match other model patterns * refactor: attn inherits from `Qwen2Attention` * style: switching MLP inheritance for cleaner import * suggestion: switching Experts inheritance to `DeepseekV3NaiveMoe` Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * cleaning imports after Experts suggestion * refactor: MoE rework to inherit DSV3 logic + pretrainedModel from DSV3PreTrainedModel Changes made possible from correct masking #45441 * removed `test_convert_config_from_hub_format` * suggestion(test): removed dead flag from copy paste Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: dropping `test_router_group_mask_uses_negative_infinity` since DSV3 is patched * test: cleaned config * test: removing gpt-oss copy paste test * test: fix training kwarg Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * tests: removed additional tests * refactor: renamed eager func to `eager_attention_forward_with_optional_sink` * fix: Values rescaling in attn (MiMo specific) * fix: attn sinks allow grad for training * fix suggestion * fix: dropped FA2 and flex attn * test: added mimo integration test * suggestion: apply batched suggestions Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * suggestion: additional suggestions applied * suggestion: switch `MiMoV2FlashDecoderLayer` inheritance to `Glm4MoeLiteDecoderLayer` * suggestion: always apply Values rescaling * re-enable backends: FA2/3/4 and flex * test: bumped v head dim =16, head_dim = 32 to mimic decoupling * suggestion: test removed rope dict * CI fix * partial revert: see if cause of TP CI fail * partial revert 2: see if cause of TP CI fail * suggestion: update attn flags Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: revert eager attn func name * refactor: switched `MiMoV2FlashModel` inheritance to `LagunaModel` * test: comment + cuda CC version * make fix-repo * adjust integration test for CI + small nits * style * fix repo * docs: origin of attn values rescaling * added mimo entry in TOKENIZER_MAPPING_NAMES * fix CI: upd date * fix CI: make fix-repo with updated main branch * make fix repo * fix new date format * CI fix: adding back explicit ep_plan + matching DSV3 refactor update * CI fix: matching new DSV3 gate signature * move / copy to internal testing --------- Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>

What does this PR do?
Please see fix #45440 for more details
Discussed with @vasqu
Also fixed via regen/inheritance:
Warning
Edit: The test passes for me I’m getting back my string (with and without fix).
I realized it makes sense, biases are zero init, registered as a buffer but the model for the test is untrained, so
router_logits_for_choice = router_logits + self.gate.e_score_correction_biasis always the same, fix or not.This test, for other models, would only fail if the model is loaded from a pretrained checkpoint (non zero biases)
transformers/tests/models/deepseek_v3/test_modeling_deepseek_v3.py
Lines 398 to 409 in 5b565a5
So I reran the DSV3 forward pass to get the new hardcoded string for myself and retest but maybe this needs to be run and tested on your side too?
def test_compile_static_cache(self):test?I greedily flagged 3 other models with
masked_fill(~score_mask.bool(), 0.0)but I'm not blindly touching these for now,need to verify their logic first (if using loss free load balancing) + what their remote are doing. If the remote is
wrong, not sure we should change the native implementation or keep it wrong with the remote.
This is likely for a follow-up PR if these need to be fixed too.
Code Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.