fix(DSV3): parity between native `DeepseekV3MoE` and remote official implementation by casinca · Pull Request #45441 · huggingface/transformers

casinca · 2026-04-14T17:51:08Z

What does this PR do?

Please see fix #45440 for more details

Discussed with @vasqu

Also fixed via regen/inheritance:

exaone-moe
glm4-moe
glm4v-moe
glm4-moe-lite
glm_moe_dsa
nemotron_h
solar_open

Warning

Edit: The test passes for me I’m getting back my string (with and without fix).
I realized it makes sense, biases are zero init, registered as a buffer but the model for the test is untrained, so router_logits_for_choice = router_logits + self.gate.e_score_correction_bias is always the same, fix or not.
This test, for other models, would only fail if the model is loaded from a pretrained checkpoint (non zero biases)

transformers/tests/models/deepseek_v3/test_modeling_deepseek_v3.py

Lines 398 to 409 in 5b565a5

    
           @slow 
        
           @require_torch_accelerator 
        
           @pytest.mark.torch_compile_test 
        
           def test_compile_static_cache(self): 
        
               NUM_TOKENS_TO_GENERATE = 40 
        
               # https://github.com/huggingface/transformers/pull/38562#issuecomment-2939209171 
        
               # The reason why the output is gibberish is because the testing model bzantium/tiny-deepseek-v3 is not trained 
        
               # one. Since original DeepSeek-V3 model is too big to debug and test, there was no testing with the original one. 
        
               EXPECTED_TEXT_COMPLETION = [ 
        
                   "Simply put, the theory of relativity states that  Frojekecdytesాలు sicʰtinaccianntuala breej的效率和质量的控制lavestock-PraccuraciesOTTensorialoghismos的思路astiomotivityosexualriad TherapeuticsoldtYPEface Kishsatellite-TV", 
        
                   "My favorite all time favorite condiment is ketchup.ieden沟渠係室温 Fryrok般地Segmentation Cycle/physicalwarenkrautempsాలు蹈梗 Mesomac一等asan lethality suspended Causewaydreamswith Fossilsdorfాలు蹈 ChristiansenHOMEbrew", 
        
               ]

So I reran the DSV3 forward pass to get the new hardcoded string for myself and retest but maybe this needs to be run and tested on your side too?

       EXPECTED_TEXT_COMPLETION = [
"Simply put, the theory of relativity states that aportersh455elike injection tactics-altitude蹲在那儿 >Loregefruitakosdeckingredientsuchtroni李世umontיםplicitlyShadowoldtriad Therapeutics不减-ste 的希望和价值 >kerretteylesheetzimnasium的品质 Talm",
"My favorite all time favorite condiment is ketchup. Lan overhead excite-ment好用>cileriaceaeagnainesogaslipadicSiggleESHalseawarriorsrattieri佐iented >Parrheta-counterousseanatysisoglCTSinkeheilbronnenlaceslide tactauralick",
       ]

Also potential cascading effect with the inherited models too and their own def test_compile_static_cache(self): test?

I greedily flagged 3 other models with masked_fill(~score_mask.bool(), 0.0) but I'm not blindly touching these for now,
need to verify their logic first (if using loss free load balancing) + what their remote are doing. If the remote is
wrong, not sure we should change the native implementation or keep it wrong with the remote.

This is likely for a follow-up PR if these need to be fixed too.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

github-actions · 2026-04-14T17:52:17Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v3, exaone_moe, glm4_moe, glm4_moe_lite, glm4v_moe, glm_moe_dsa, nemotron_h, solar_open

vasqu

I am inclined to fix this for deepseek v3 itself as it is indeed added within its remote code; however, even if it is a bug, we need to be careful for other models as that may now be their expected behavior. core cc @ArthurZucker @Cyrilvallez

Maybe we should ping the other model maintainers but would like to hear the opinion of at least one other core maintainer on this

vasqu · 2026-04-14T20:09:33Z

Thanks a lot for the PR, very nice find! Wanna make sure we go right and don't rush it as it affects quite a few models

casinca · 2026-04-14T21:51:13Z

Concerning the 3 other models in the repo using explicitly masked_fill(~score_mask.bool(), 0.0):

There's Mistral4 (doesn't use loss free load balancing), DSV2 (can't be, since it was introduced in DSV3), but dots1 seems so.

Probably another PR for this one? it inherits from DeepSeekV3 with a comment saying # main diff with deepseekv3 but I don't see any diff with DVS3(left)? Could have been a direct pass like the others I guess.

ArthurZucker · 2026-04-20T13:18:40Z

Having a look!

ArthurZucker

For dots, indeed if there is no diff feel free to inherit!

This is minor:

the scores for choice are only used to compute the index, not the weights
the router_logits_for_choice can probably be 0 or negative (with bias), I thought it would be super rare? what you are trying to avoid is that "natural" 0 could be part of the topk vs the masking? makes sense

IMO does not warrant a patch, but you tell me if this is super crucial!

HuggingFaceDocBuilderDev · 2026-04-22T04:46:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

casinca · 2026-04-22T09:39:14Z

For dots, indeed if there is no diff feel free to inherit!

This is minor:
1. the scores for choice are only used to compute the index, not the weights

2. the `router_logits_for_choice` can probably be 0 or negative (with bias), I thought it would be super rare? what you are trying to avoid is that "natural" 0 could be part of the topk vs the masking? makes sense
IMO does not warrant a patch, but you tell me if this is super crucial!

yes, weights use unbiased scores.
Following your question, I just checked the distribution on biases, well... it's funny because the minimum bias across all layers is +2.02. So specifically for DSV3 it's not even rare but impossible xD, 0 will act the same as -inf, no behavior change. But might not be the case for all other models.

Not just about tie: natural 0 vs masked 0, it was more about avoiding a valid unmasked expert with a negative score being not selected/outranked by a masked one (with a masked score of 0.0)

Imho, since some models inherit from DSV3 and also DSV3.2 does -inf, I believe it was still a good thing to match 1:1.
Initially it was just to be rigorous for the MiMo-V2 impl and to match their remote modeling file (which was taken from the remote DSV3)

Will open a PR for dots1, thanks.

…3PreTrainedModel Changes made possible from correct masking huggingface#45441

* init commit with v14 * CI fixes: missing docstring * CI fix: removed Mixtral unused arg * CI fix: shard attn_sinks (TP) * notes * refactor: using copy of gpt-oss eager attn func for sink path #45141 * v15: polish * fix: exact MiMo layer pattern when fallback * v16: re-inheriting from `Gemma3RotaryEmbedding` * docs: applied suggestion (Qwen3MoE as template) * refactor: re-using attributes pt 1 * refactor: `qk_head_dim` to `head_dim` * refactor: dropping `layernorm_epsilon` mapping, re-use `rms_norm_eps` (same val) * refactor: `MiMoV2FlashForCausalLM` inherits from `DeepseekV3ForCausalLM` * refactor(`super` pt1): rely on parent init / remove redundant overrides * refactor: re-using attributes pt2 * refactor: super pt2 * refactor(MiMoV2FlashPreTrainedModel): removed some flags * refactor(attn): replaced dual attn with eager optional sink * cleaned tests * nit style * add conversion script pt1 * fix: broken config with hub `routed_scaling_factor` hparam * conversion script pt 2 (glm4 style) * refactor: removed redundant init.normal_(module.weight ,...) handled by super * regenerate mapping for the new dynamic auto mapping * docs: removed comment (already explained in PR comment) * cleaning tests * refactor(position_embeddings): keep optim with gemm3 style notation * style * dropped model and tokenizer from convert script * docs: comment fix Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * docs: added comment Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * docs : add suggestion comment Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * applied suggestion: removed ep_plan Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * suggestion: removed no op `self.standardize_rope_params()` Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: config inherits from glm4moe * refactor: unconditional SWA mask creation Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * fix: missing decorator Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: sync rot_emb optim with gemmas * refactor 2: unconditional SWA mask creation * refactor: `compute_default_rope_parameters` to match other model patterns * refactor: attn inherits from `Qwen2Attention` * style: switching MLP inheritance for cleaner import * suggestion: switching Experts inheritance to `DeepseekV3NaiveMoe` Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * cleaning imports after Experts suggestion * refactor: MoE rework to inherit DSV3 logic + pretrainedModel from DSV3PreTrainedModel Changes made possible from correct masking #45441 * removed `test_convert_config_from_hub_format` * suggestion(test): removed dead flag from copy paste Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: dropping `test_router_group_mask_uses_negative_infinity` since DSV3 is patched * test: cleaned config * test: removing gpt-oss copy paste test * test: fix training kwarg Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * tests: removed additional tests * refactor: renamed eager func to `eager_attention_forward_with_optional_sink` * fix: Values rescaling in attn (MiMo specific) * fix: attn sinks allow grad for training * fix suggestion * fix: dropped FA2 and flex attn * test: added mimo integration test * suggestion: apply batched suggestions Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * suggestion: additional suggestions applied * suggestion: switch `MiMoV2FlashDecoderLayer` inheritance to `Glm4MoeLiteDecoderLayer` * suggestion: always apply Values rescaling * re-enable backends: FA2/3/4 and flex * test: bumped v head dim =16, head_dim = 32 to mimic decoupling * suggestion: test removed rope dict * CI fix * partial revert: see if cause of TP CI fail * partial revert 2: see if cause of TP CI fail * suggestion: update attn flags Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * refactor: revert eager attn func name * refactor: switched `MiMoV2FlashModel` inheritance to `LagunaModel` * test: comment + cuda CC version * make fix-repo * adjust integration test for CI + small nits * style * fix repo * docs: origin of attn values rescaling * added mimo entry in TOKENIZER_MAPPING_NAMES * fix CI: upd date * fix CI: make fix-repo with updated main branch * make fix repo * fix new date format * CI fix: adding back explicit ep_plan + matching DSV3 refactor update * CI fix: matching new DSV3 gate signature * move / copy to internal testing --------- Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> Co-authored-by: vasqu <antonprogamer@gmail.com>

casinca added 2 commits April 14, 2026 17:37

fix(DeepseekV3MoE): correct expert masking with negative bias

7361e20

Merge branch 'main' into dsv3-fix

5a50d3f

vasqu reviewed Apr 14, 2026

View reviewed changes

casinca mentioned this pull request Apr 16, 2026

Add Xiaomi MiMo-V2 #45144

Merged

6 tasks

ArthurZucker approved these changes Apr 22, 2026

View reviewed changes

ArthurZucker enabled auto-merge April 22, 2026 04:36

ArthurZucker added this pull request to the merge queue Apr 22, 2026

Merged via the queue into huggingface:main with commit cb0addd Apr 22, 2026
21 checks passed

casinca deleted the dsv3-fix branch April 22, 2026 09:39

casinca mentioned this pull request Apr 22, 2026

refactor(Dots1): drop Dots1MoE override to pass (inherits from DSV3 MoE) #45572

Merged

6 tasks

casinca added a commit to casinca/transformers that referenced this pull request Apr 24, 2026

refactor: MoE rework to inherit DSV3 logic + pretrainedModel from DSV…

d60e067

…3PreTrainedModel Changes made possible from correct masking huggingface#45441

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(DSV3): parity between native `DeepseekV3MoE` and remote official implementation#45441

fix(DSV3): parity between native `DeepseekV3MoE` and remote official implementation#45441
ArthurZucker merged 2 commits into
huggingface:mainfrom
casinca:dsv3-fix

casinca commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

vasqu left a comment

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

casinca commented Apr 14, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Apr 20, 2026

Uh oh!

ArthurZucker left a comment •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

Uh oh!

casinca commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	@slow
	@require_torch_accelerator
	@pytest.mark.torch_compile_test
	def test_compile_static_cache(self):
	NUM_TOKENS_TO_GENERATE = 40
	# https://github.com/huggingface/transformers/pull/38562#issuecomment-2939209171
	# The reason why the output is gibberish is because the testing model bzantium/tiny-deepseek-v3 is not trained
	# one. Since original DeepSeek-V3 model is too big to debug and test, there was no testing with the original one.
	EXPECTED_TEXT_COMPLETION = [
	"Simply put, the theory of relativity states that Frojekecdytesాలు sicʰtinaccianntuala breej的效率和质量的控制lavestock-PraccuraciesOTTensorialoghismos的思路astiomotivityosexualriad TherapeuticsoldtYPEface Kishsatellite-TV",
	"My favorite all time favorite condiment is ketchup.ieden沟渠係室温 Fryrok般地Segmentation Cycle/physicalwarenkrautempsాలు蹈梗 Mesomac一等asan lethality suspended Causewaydreamswith Fossilsdorfాలు蹈 ChristiansenHOMEbrew",
	]

Uh oh!

Conversation

casinca commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu commented Apr 14, 2026

Uh oh!

casinca commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Apr 20, 2026

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

Uh oh!

casinca commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

casinca commented Apr 14, 2026 •

edited

Loading

casinca commented Apr 14, 2026 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

casinca commented Apr 22, 2026 •

edited

Loading