Skip to content

🚨 [Attn] Remove all old mask APIs from modeling#43924

Merged
vasqu merged 14 commits into
huggingface:mainfrom
vasqu:fix-masks-p2
May 12, 2026
Merged

🚨 [Attn] Remove all old mask APIs from modeling#43924
vasqu merged 14 commits into
huggingface:mainfrom
vasqu:fix-masks-p2

Conversation

@vasqu

@vasqu vasqu commented Feb 11, 2026

Copy link
Copy Markdown
Collaborator

As per title, removes the last remains of old mask API usage which means that now everything relies on the new mask API (except for a few exceptions that use 3D masks like idefics)

Adding a deprecation cylce to remove it from modeling utils

@vasqu

vasqu commented Feb 11, 2026

Copy link
Copy Markdown
Collaborator Author

run-slow: big_bird,blip_2,bridgetower,clap,flava,ibert,instructblip,instructblipvideo,tapas,vilt

@github-actions

Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/big_bird", "models/blip_2", "models/bridgetower", "models/clap", "models/flava", "models/ibert", "models/instructblip", "models/instructblipvideo", "models/tapas", "models/vilt"]
quantizations: []

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 07369579 merge commit
PR d295b5c6 branch commit
main ae05b2ae base commit

Model CI Report

31 new failed tests from this PR 😭

  • blip_2:
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_interpolate_pos_encoding
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_itm_fp16
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt_batched_beam_search
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_vision_with_projection_fp16

  • instructblip:
    tests/models/instructblip/test_modeling_instructblip.py::InstructBlipForConditionalGenerationDecoderOnlyTest::test_torch_export

  • instructblipvideo:
    tests/models/instructblipvideo/test_modeling_instructblipvideo.py::InstructBlipVideoForConditionalGenerationDecoderOnlyTest::test_torch_export

@vasqu

vasqu commented Feb 11, 2026

Copy link
Copy Markdown
Collaborator Author

run-slow: align,altclip,big_bird,blip,blip_2,bridgetower,bros,canine,chinese_clip,clap,convbert,flava,ibert,idefics,imagegpt,instructblip,instructblipvideo,layoutlmv3,lightglue,lilt,longformer,longt5,luke,megatron_bert,mpnet,mra,nystromformer,perceiver,pix2struct,pop2piano,rembert,roformer,splinter,squeezebert,superglue,switch_transformers,tapas,tvp,udop,umt5,vilt,visual_bert

@github-actions

Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/align", "models/altclip", "models/big_bird", "models/blip", "models/blip_2", "models/bridgetower", "models/bros", "models/canine", "models/chinese_clip", "models/clap", "models/convbert", "models/flava", "models/ibert", "models/idefics", "models/imagegpt", "models/instructblip", "models/instructblipvideo", "models/layoutlmv3", "models/lightglue", "models/lilt", "models/longformer", "models/longt5", "models/luke", "models/megatron_bert", "models/mpnet", "models/mra", "models/nystromformer", "models/perceiver", "models/pix2struct", "models/pop2piano", "models/rembert", "models/roformer", "models/splinter", "models/squeezebert", "models/superglue", "models/switch_transformers", "models/tapas", "models/tvp", "models/udop", "models/umt5", "models/vilt", "models/visual_bert"]
quantizations: []

@github-actions

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 03bccdd1 merge commit
PR f79c3a01 branch commit
main ae05b2ae base commit

Model CI Report

39 new failed tests from this PR 😭

  • blip_2:
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2VisionModelWithProjectionTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_01_fp16_pad_left
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_02_fp16_pad_left_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_03_fp16_pad_left_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_05_fp16_pad_right
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_06_fp16_pad_right_no_attn_mask_sdpa_kernels
    tests/models/blip_2/test_modeling_blip_2.py::Blip2TextRetrievalModelTest::test_eager_matches_sdpa_inference_07_fp16_pad_right_no_attn_mask
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_interpolate_pos_encoding
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_itm_fp16
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_opt_batched_beam_search
    tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelIntegrationTest::test_inference_vision_with_projection_fp16

  • idefics:
    tests/models/idefics/test_modeling_idefics.py::IdeficsModelIntegrationTest::test_inference_natural_language_visual_reasoning

  • lightglue:
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_08_fp32_pad_left_sdpa_kernels
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_09_fp32_pad_left
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_10_fp32_pad_left_no_attn_mask_sdpa_kernels
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_11_fp32_pad_left_no_attn_mask
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_12_fp32_pad_right_sdpa_kernels
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_13_fp32_pad_right
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_14_fp32_pad_right_no_attn_mask_sdpa_kernels
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_15_fp32_pad_right_no_attn_mask
    tests/models/lightglue/test_modeling_lightglue.py::LightGlueModelTest::test_eager_matches_sdpa_inference_24_fp32_pad_left_output_attentions

@vasqu

vasqu commented May 8, 2026

Copy link
Copy Markdown
Collaborator Author

run-slow: align,altclip,big_bird,blip,blip_2,bridgetower,bros,canine,chinese_clip,clap,convbert,flava,ibert,idefics,imagegpt,instructblip,instructblipvideo,layoutlmv3,lightglue,lilt,longformer,longt5,luke,megatron_bert,mpnet,mra,nystromformer,perceiver,pix2struct,pop2piano,rembert,roformer,splinter,squeezebert,superglue,switch_transformers,tapas,tvp,udop,umt5,vilt,visual_bert

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/align", "models/altclip", "models/big_bird", "models/blip", "models/blip_2", "models/bridgetower", "models/bros", "models/canine", "models/chinese_clip", "models/clap", "models/convbert", "models/flava", "models/ibert", "models/idefics", "models/imagegpt", "models/instructblip", "models/instructblipvideo", "models/layoutlmv3", "models/lightglue", "models/lilt", "models/longformer", "models/longt5", "models/luke", "models/megatron_bert", "models/mpnet", "models/mra", "models/nystromformer", "models/perceiver", "models/pix2struct", "models/pop2piano", "models/rembert", "models/roformer", "models/splinter", "models/squeezebert", "models/superglue", "models/switch_transformers", "models/tapas", "models/tvp", "models/udop", "models/umt5", "models/vilt", "models/visual_bert"]
quantizations: []

@vasqu vasqu changed the title [Attn] More old mask APIs 🚨 [Attn] More old mask APIs May 8, 2026
@vasqu vasqu changed the title 🚨 [Attn] More old mask APIs 🚨 [Attn] Remove all old mask APIs from modeling May 8, 2026
@vasqu vasqu marked this pull request as ready for review May 8, 2026 16:58
@vasqu vasqu requested review from ArthurZucker and Cyrilvallez May 8, 2026 16:59
Comment thread src/transformers/models/blip/modeling_blip_text.py
Comment thread src/transformers/models/blip/modeling_blip_text.py Outdated

extend_text_masks = create_bidirectional_mask(
config=self.config,
input_embeds=text_embeds[:, 0:1, :], # weird case where the mask always wants q_len == 1

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happens a few times, I guess they just like to broadcast internally instead

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only for broadcasting, IMO we should drop this and use full length as always - resulting mask should not change

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tied to the old API

extended_attention_mask = attention_mask[:, None, None, :]

It always broadcasted along q no matter what and it enabled bad practices imo because the downstream emebedding with the correct shape is probably created at a later time. I don't think it's worth to rewrite the model just for that 😅

Comment thread src/transformers/models/git/modeling_git.py Outdated
Comment thread src/transformers/models/idefics/modeling_idefics.py
# Cuts off back to 2D
extended_attention_mask = extended_attention_mask[:, 0, 0, :]
# Bug in the old mask API converted global masks (==2) to max dtype
extended_attention_mask[attention_mask == 2] = torch.finfo(embedding_output.dtype).max

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buggy implementation, followed old behavior here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, I don't see where the behavior was bugged here before based on the diffs?

@vasqu vasqu May 11, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can only see that it is bugged if you actual look into the mask values (either 0,1, or 2) and what get_extended_attention_mask does

extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(dtype).min

Based on that line if you have a value of 2, you get -1 * -(max dtype) == max dtype

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we would have 2s inside the mask, but trusting you!

Comment thread src/transformers/models/mra/modeling_mra.py
Comment thread tests/models/bridgetower/test_modeling_bridgetower.py
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN eb177ca5 workflow commit (merge commit)
PR 2dc91f68 branch commit (from PR)
main 381032b7 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@huggingface huggingface deleted a comment from github-actions Bot May 8, 2026

@Cyrilvallez Cyrilvallez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

God's work 🙏 Thanks a lot for that, will simplify our lives so much in the future 🤗
I answered directly on most of your comments. but recommenting here to be sure:

  • The def _create_attention_masks with only 2 if/else IMO we should remove and put them directly in the forward to ease readability
  • For bridgetower (and others you mention) where we broadcast the q_len, we should be able to not slice and use usual logic without slicing then broadcasting no? Would be easier
  • Not sure about longformer -> was it really a bug before? I don't see any issues about 2s in the mask from diffs

Comment on lines +992 to +995
logger.warning_once(
"Detected the usage of `get_extended_attention_mask`: This function is deprecated and will be removed in v5.12.0. "
"Please use the new API in `transformers.masking_utils`"
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait to remove them 🙏

Comment thread tests/models/bridgetower/test_modeling_bridgetower.py
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, big_bird, blip, blip_2, bridgetower, bros, canine, clap, convbert, flava, git, ibert, idefics, imagegpt, instructblip, instructblipvideo

@github-actions

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43924&sha=a0b6c8

@vasqu

vasqu commented May 11, 2026

Copy link
Copy Markdown
Collaborator Author

@Cyrilvallez I think I answered everything:

  1. Done, moved it inside and removed the extra fns
  2. 🚨 [Attn] Remove all old mask APIs from modeling #43924 (comment) - this is tied to the fact that the actual embedding with correct shapes is created at a later time, imo not worth to rewrite the model for that
  3. 🚨 [Attn] Remove all old mask APIs from modeling #43924 (comment) - that is an actual bug, tried to explain it in that comment but it's not visible through the diff alone

@Cyrilvallez

Copy link
Copy Markdown
Member

ALright, thanks for the added explanations! All good to me!

@vasqu vasqu added this pull request to the merge queue May 12, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 12, 2026
@vasqu vasqu added this pull request to the merge queue May 12, 2026
Merged via the queue into huggingface:main with commit 5206626 May 12, 2026
29 checks passed
@vasqu vasqu deleted the fix-masks-p2 branch May 12, 2026 12:50
jp1924 pushed a commit to jp1924/transformers that referenced this pull request May 18, 2026
* first round

* round 2

* round 3

* style

* post merge fixes

* fix blip 2

* fix lightglue

* fix idefics --> uses 3D mask, hence manual creation and no mask API

* add deprecation message

* move out of separate fns

* new deprecation cycle for the naming
khushali9 pushed a commit to khushali9/transformers that referenced this pull request Jun 8, 2026
* first round

* round 2

* round 3

* style

* post merge fixes

* fix blip 2

* fix lightglue

* fix idefics --> uses 3D mask, hence manual creation and no mask API

* add deprecation message

* move out of separate fns

* new deprecation cycle for the naming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants