:rotating_light: [`Attn`] Remove all old mask APIs from modeling by vasqu · Pull Request #43924 · huggingface/transformers

vasqu · 2026-02-11T17:35:45Z

As per title, removes the last remains of old mask API usage which means that now everything relies on the new mask API (except for a few exceptions that use 3D masks like idefics)

Adding a deprecation cylce to remove it from modeling utils

vasqu · 2026-02-11T17:36:08Z

run-slow: big_bird,blip_2,bridgetower,clap,flava,ibert,instructblip,instructblipvideo,tapas,vilt

github-actions · 2026-02-11T17:37:26Z

This comment contains run-slow, running the specified jobs:

models: ["models/big_bird", "models/blip_2", "models/bridgetower", "models/clap", "models/flava", "models/ibert", "models/instructblip", "models/instructblipvideo", "models/tapas", "models/vilt"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-02-11T17:45:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-02-11T18:33:02Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	07369579	merge commit
PR	d295b5c6	branch commit
main	ae05b2ae	base commit

Model CI Report

❌ 31 new failed tests from this PR 😭

blip_2:
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_interpolate_pos_encoding
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_itm_fp16
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt_batched_beam_search
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_vision_with_projection_fp16
instructblip:
tests/models/instructblip/test_modeling_instructblip.py::InstructBlipForConditionalGenerationDecoderOnlyTest::test_torch_export
instructblipvideo:
tests/models/instructblipvideo/test_modeling_instructblipvideo.py::InstructBlipVideoForConditionalGenerationDecoderOnlyTest::test_torch_export

vasqu · 2026-02-11T21:23:20Z

run-slow: align,altclip,big_bird,blip,blip_2,bridgetower,bros,canine,chinese_clip,clap,convbert,flava,ibert,idefics,imagegpt,instructblip,instructblipvideo,layoutlmv3,lightglue,lilt,longformer,longt5,luke,megatron_bert,mpnet,mra,nystromformer,perceiver,pix2struct,pop2piano,rembert,roformer,splinter,squeezebert,superglue,switch_transformers,tapas,tvp,udop,umt5,vilt,visual_bert

github-actions · 2026-02-11T21:24:42Z

This comment contains run-slow, running the specified jobs:

models: ["models/align", "models/altclip", "models/big_bird", "models/blip", "models/blip_2", "models/bridgetower", "models/bros", "models/canine", "models/chinese_clip", "models/clap", "models/convbert", "models/flava", "models/ibert", "models/idefics", "models/imagegpt", "models/instructblip", "models/instructblipvideo", "models/layoutlmv3", "models/lightglue", "models/lilt", "models/longformer", "models/longt5", "models/luke", "models/megatron_bert", "models/mpnet", "models/mra", "models/nystromformer", "models/perceiver", "models/pix2struct", "models/pop2piano", "models/rembert", "models/roformer", "models/splinter", "models/squeezebert", "models/superglue", "models/switch_transformers", "models/tapas", "models/tvp", "models/udop", "models/umt5", "models/vilt", "models/visual_bert"]
quantizations: []

github-actions · 2026-02-11T22:34:32Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	03bccdd1	merge commit
PR	f79c3a01	branch commit
main	ae05b2ae	base commit

Model CI Report

❌ 39 new failed tests from this PR 😭

blip_2:
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_interpolate_pos_encoding
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_itm_fp16
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt_batched_beam_search
tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_vision_with_projection_fp16
idefics:
tests/models/idefics/test_modeling_idefics.py::IdeficsModelIntegrationTest::test_inference_natural_language_visual_reasoning
lightglue:
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_08_fp32_pad_left_sdpa_kernels
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_09_fp32_pad_left
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_10_fp32_pad_left_no_attn_mask_sdpa_kernels
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_11_fp32_pad_left_no_attn_mask
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_12_fp32_pad_right_sdpa_kernels
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_13_fp32_pad_right
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_14_fp32_pad_right_no_attn_mask_sdpa_kernels
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_15_fp32_pad_right_no_attn_mask
tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_24_fp32_pad_left_output_attentions

vasqu · 2026-05-08T16:42:00Z

run-slow: align,altclip,big_bird,blip,blip_2,bridgetower,bros,canine,chinese_clip,clap,convbert,flava,ibert,idefics,imagegpt,instructblip,instructblipvideo,layoutlmv3,lightglue,lilt,longformer,longt5,luke,megatron_bert,mpnet,mra,nystromformer,perceiver,pix2struct,pop2piano,rembert,roformer,splinter,squeezebert,superglue,switch_transformers,tapas,tvp,udop,umt5,vilt,visual_bert

github-actions · 2026-05-08T16:43:35Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/align", "models/altclip", "models/big_bird", "models/blip", "models/blip_2", "models/bridgetower", "models/bros", "models/canine", "models/chinese_clip", "models/clap", "models/convbert", "models/flava", "models/ibert", "models/idefics", "models/imagegpt", "models/instructblip", "models/instructblipvideo", "models/layoutlmv3", "models/lightglue", "models/lilt", "models/longformer", "models/longt5", "models/luke", "models/megatron_bert", "models/mpnet", "models/mra", "models/nystromformer", "models/perceiver", "models/pix2struct", "models/pop2piano", "models/rembert", "models/roformer", "models/splinter", "models/squeezebert", "models/superglue", "models/switch_transformers", "models/tapas", "models/tvp", "models/udop", "models/umt5", "models/vilt", "models/visual_bert"]
quantizations: []

vasqu · 2026-05-08T17:15:34Z

+
+        extend_text_masks = create_bidirectional_mask(
+            config=self.config,
+            input_embeds=text_embeds[:, 0:1, :],  # weird case where the mask always wants q_len == 1


happens a few times, I guess they just like to broadcast internally instead

If it's only for broadcasting, IMO we should drop this and use full length as always - resulting mask should not change

This is tied to the old API

transformers/src/transformers/modeling_utils.py

Line 998 in b75feb2

extended_attention_mask = attention_mask[:, None, None, :]

It always broadcasted along q no matter what and it enabled bad practices imo because the downstream emebedding with the correct shape is probably created at a later time. I don't think it's worth to rewrite the model just for that 😅

vasqu · 2026-05-08T17:17:52Z

+        # Cuts off back to 2D
+        extended_attention_mask = extended_attention_mask[:, 0, 0, :]
+        # Bug in the old mask API converted global masks (==2) to max dtype
+        extended_attention_mask[attention_mask == 2] = torch.finfo(embedding_output.dtype).max


buggy implementation, followed old behavior here

Humm, I don't see where the behavior was bugged here before based on the diffs?

You can only see that it is bugged if you actual look into the mask values (either 0,1, or 2) and what get_extended_attention_mask does

transformers/src/transformers/modeling_utils.py

Line 1010 in b75feb2

extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(dtype).min

Based on that line if you have a value of 2, you get -1 * -(max dtype) == max dtype

Not sure why we would have 2s inside the mask, but trusting you!

github-actions · 2026-05-08T17:24:59Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	eb177ca5	workflow commit (merge commit)
PR	2dc91f68	branch commit (from PR)
main	381032b7	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

Cyrilvallez

God's work 🙏 Thanks a lot for that, will simplify our lives so much in the future 🤗
I answered directly on most of your comments. but recommenting here to be sure:

The def _create_attention_masks with only 2 if/else IMO we should remove and put them directly in the forward to ease readability
For bridgetower (and others you mention) where we broadcast the q_len, we should be able to not slice and use usual logic without slicing then broadcasting no? Would be easier
Not sure about longformer -> was it really a bug before? I don't see any issues about 2s in the mask from diffs

Cyrilvallez · 2026-05-11T03:34:41Z

+        logger.warning_once(
+            "Detected the usage of `get_extended_attention_mask`: This function is deprecated and will be removed in v5.12.0. "
+            "Please use the new API in `transformers.masking_utils`"
+        )


Can't wait to remove them 🙏

github-actions · 2026-05-11T12:50:53Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, big_bird, blip, blip_2, bridgetower, bros, canine, clap, convbert, flava, git, ibert, idefics, imagegpt, instructblip, instructblipvideo

github-actions · 2026-05-11T13:21:07Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43924&sha=a0b6c8

vasqu · 2026-05-11T13:24:13Z

@Cyrilvallez I think I answered everything:

Done, moved it inside and removed the extra fns
🚨 [Attn] Remove all old mask APIs from modeling #43924 (comment) - this is tied to the fact that the actual embedding with correct shapes is created at a later time, imo not worth to rewrite the model for that
🚨 [Attn] Remove all old mask APIs from modeling #43924 (comment) - that is an actual bug, tried to explain it in that comment but it's not visible through the diff alone

Cyrilvallez · 2026-05-12T00:48:05Z

ALright, thanks for the added explanations! All good to me!

* first round * round 2 * round 3 * style * post merge fixes * fix blip 2 * fix lightglue * fix idefics --> uses 3D mask, hence manual creation and no mask API * add deprecation message * move out of separate fns * new deprecation cycle for the naming

first round

d295b5c

vasqu and others added 4 commits February 11, 2026 21:13

round 2

9b8397c

round 3

3106e51

Merge branch 'main' into fix-masks-p2

b0738b3

style

f79c3a0

vasqu mentioned this pull request Feb 24, 2026

[Model] Add PP-DocLayoutV2 Model Support #43018

Merged

vasqu mentioned this pull request Apr 9, 2026

Fix UnboundLocalError in invert_attention_mask by adding proper shape… #45247

Closed

6 tasks

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

vasqu added 5 commits May 8, 2026 17:14

Merge branch 'main' into fix-masks-p2

cdbb95d

post merge fixes

8b38a4b

fix blip 2

bd5e38a

fix lightglue

ed00646

fix idefics --> uses 3D mask, hence manual creation and no mask API

2dc91f6

vasqu changed the title ~~[Attn] More old mask APIs~~ 🚨 [Attn] More old mask APIs May 8, 2026

vasqu changed the title ~~🚨 [Attn] More old mask APIs~~ 🚨 [Attn] Remove all old mask APIs from modeling May 8, 2026

add deprecation message

c441ae6

vasqu marked this pull request as ready for review May 8, 2026 16:58

vasqu requested review from ArthurZucker and Cyrilvallez May 8, 2026 16:59