Extract dynamic vision/audio tensors into standalone pure functions by IlyasMoutawwakil · Pull Request #45396 · huggingface/transformers

IlyasMoutawwakil · 2026-04-13T08:46:10Z

needed both claude and copilot's help on this one 😅 The idea is to make the vlms and their visual/audio encders compileable / exportable. here's a demo of the full model forward being compileable with these precomputed tensors.

"""Demo: precomputed vision tensors enable torch.compile.

Without precomputation the vision encoder uses loops / .tolist() / repeat_interleave
that break torch.compile fullgraph mode. By computing ``image_cu_seqlens`` and
``image_position_ids`` via ``transformers.vision_utils`` outside the traced region
and passing them into the model, the vision path becomes compile-friendly.

These precomputed tensors are a power-feature: they are not in any public ``forward()``
signature. They flow through ``**kwargs`` from ``model(...)`` down to the vision
encoder, where ``@handle_extra_kwargs(modality="image")`` on ``get_image_features``
strips the ``image_`` prefix so the encoder sees them as ``cu_seqlens``/``position_ids``
in its kwargs. The encoder then calls ``get_vision_cu_seqlens(grid_thw, kwargs=kwargs)``
(and the analogous ``get_vision_position_ids(...)``); each util pops its key from the
caller's kwargs and returns the precomputed tensor, falling back to the (uncompilable)
compute path only if absent.
"""

import torch

from transformers import AutoModelForImageTextToText, AutoProcessor, set_seed
from transformers.vision_utils import get_vision_cu_seqlens, get_vision_position_ids


set_seed(42, deterministic=True)

model_id = "Qwen/Qwen2-VL-2B-Instruct"

print("Loading model and processor...")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype=torch.float32, device_map=device).eval()
processor = AutoProcessor.from_pretrained(model_id)

# --- 1. Prepare a multimodal input ---
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

# --- 2. Processor output (baseline) ---
inputs = processor(text=text, images=messages[0]["content"][0]["url"], return_tensors="pt").to(model.device)
print(f"Processor keys: {sorted(inputs.keys())}")

# --- 3. Precompute vision tensors from grid_thw using vision_utils ---
# These replace the untraceable loops inside the vision encoder. The kwargs are
# `image_*`-prefixed at the model boundary; `@handle_extra_kwargs(modality="image")`
# on `get_image_features` strips the prefix before forwarding to the encoder.
inputs_extra = {**inputs}
spatial_merge_size = model.config.vision_config.spatial_merge_size
inputs_extra["image_cu_seqlens"] = get_vision_cu_seqlens(inputs["image_grid_thw"])
inputs_extra["image_position_ids"] = get_vision_position_ids(inputs["image_grid_thw"], spatial_merge_size)
print(f"image_cu_seqlens shape: {inputs_extra['image_cu_seqlens'].shape}")
print(f"image_position_ids shape: {inputs_extra['image_position_ids'].shape}")

# Precompute 3D text position_ids (M-RoPE) so the model skips get_rope_index at runtime
inputs_extra["position_ids"], _ = model.model.get_rope_index(
    inputs_extra["input_ids"],
    mm_token_type_ids=inputs_extra["mm_token_type_ids"],
    image_grid_thw=inputs_extra["image_grid_thw"],
    attention_mask=inputs_extra["attention_mask"],
)
print(f"position_ids shape: {inputs_extra['position_ids'].shape}")

# --- 4. Eager forward (reference, no precomputed tensors) ---
print("\n=== Eager forward ===")
with torch.no_grad():
    out_eager = model(**inputs)
print(f"Logits shape: {out_eager.logits.shape}")

# --- 5. Compile the full model with fullgraph=True ---
print("\n=== torch.compile(model, fullgraph=True) ===")
model = torch.compile(model, fullgraph=True)

# --- 6. Without precomputed tensors: full model forward fails in vision ---
print("\nWithout precomputed tensors:")
try:
    with torch.no_grad():
        out = model(**inputs)
    print("  Unexpectedly succeeded!")
except Exception as e:
    print(f"  FAILED as expected: {type(e).__name__}")

# --- 7. With precomputed tensors: full model forward succeeds ---
print("\nWith precomputed tensors:")
with torch.no_grad():
    out_compiled = model(**inputs_extra)
print(f"  SUCCESS! Logits shape: {out_compiled.logits.shape}")

# --- 8. Verify compiled output matches eager ---
print("\n=== Verification: compiled vs eager ===")
diff = (out_eager.logits - out_compiled.logits).abs().max().item()
print(f"Max abs diff: {diff:.2e} {'OK' if diff < 1e-2 else 'MISMATCH'}")

Created top-level modeling_vision_utils.py with shared pure functions: get_vision_cu_seqlens, get_rotary_pos_ids, get_rotary_pos_ids_interleaved, get_window_index, get_pos_embed_indices
Moved audio precompute functions (chunk_and_pad_features, get_audio_cu_seqlens, get_valid_indices, get_pool_indices) into modular files directly
Simplifyied VisionRotaryEmbedding.forward to accept pos_ids tensor directly via broadcast multiply, eliminating data-dependent table creation
Made vision/audio encoder forwards accept optional precomputed tensors (cu_seqlens, rotary_pos_ids, window_index, embed_indices, etc.)
Used explicit naming: get_vision_cu_seqlens / get_audio_cu_seqlens

Models: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe, qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr, ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl

What does this PR do?

Fixes # (issue)

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Create top-level `modeling_vision_utils.py` with shared pure functions: `get_vision_cu_seqlens`, `get_rotary_pos_ids`, `get_rotary_pos_ids_interleaved`, `get_window_index`, `get_pos_embed_indices` - Move audio precompute functions (`chunk_and_pad_features`, `get_audio_cu_seqlens`, `get_valid_indices`, `get_pool_indices`) into modular files directly - Simplify `VisionRotaryEmbedding.forward` to accept `pos_ids` tensor directly via broadcast multiply, eliminating data-dependent table creation - Make vision/audio encoder forwards accept optional precomputed tensors (`cu_seqlens`, `rotary_pos_ids`, `window_index`, `embed_indices`, etc.) - Use explicit naming: `get_vision_cu_seqlens` / `get_audio_cu_seqlens` Models: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe, qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr, ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-04-13T09:07:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copilot

Pull request overview

This PR refactors multimodal (vision/audio) models to share “pure” tensor-building utilities and to optionally accept precomputed tensors (e.g., cu_seqlens / rotary pos ids), reducing duplicated logic across many model implementations and processors.

Changes:

Added src/transformers/modeling_vision_utils.py with standalone helpers (e.g., get_vision_cu_seqlens, get_rotary_pos_ids, get_window_index, get_pos_embed_indices) and updated multiple models/processors to use them.
Updated multiple vision encoders to accept optional precomputed tensors (cu_seqlens, rotary_pos_ids, window_index, embed_indices, etc.) and simplified rotary embedding computation to take pos_ids directly.
Refactored audio precompute logic into modular model files and added processor support for returning extra precomputed tensors via return_extra_tensors.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/transformers/utils/auto_docstring.py	Adds new documented processor/model kwargs for precomputed vision tensors.
src/transformers/models/video_llama_3/processing_video_llama_3.py	Allows optionally returning precomputed vision tensors from the processor.
src/transformers/models/video_llama_3/modular_video_llama_3.py	Switches vision rotary/cu_seqlens generation to shared helpers and adds optional precomputed inputs.
src/transformers/models/video_llama_3/modeling_video_llama_3.py	Same as modular: uses shared helpers and updates rotary embedding forward API.
src/transformers/models/qwen3_vl/processing_qwen3_vl.py	Adds optional return of precomputed cu_seqlens/rotary pos ids (interleaved variant).
src/transformers/models/qwen3_vl/modular_qwen3_vl.py	Moves pos-embed/rotary/cu_seqlens computations to shared helpers; adds optional precomputed inputs.
src/transformers/models/qwen3_vl/modeling_qwen3_vl.py	Same refactor as modular file (generated modeling).
src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py	Same vision refactor for MoE variant.
src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py	Moves audio chunking/cu_seqlens/valid index logic into pure helpers + forward accepts optional precomputes.
src/transformers/models/qwen3_5/modular_qwen3_5.py	Refactors vision pos/rotary/cu_seqlens computations; adds optional precomputed inputs.
src/transformers/models/qwen3_5/modeling_qwen3_5.py	Same vision refactor for generated modeling file.
src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py	Same vision refactor for MoE variant.
src/transformers/models/qwen2_vl/processing_qwen2_vl.py	Adds optional return of precomputed cu_seqlens/rotary pos ids from processor.
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py	Uses shared `get_rotary_pos_ids` / `get_vision_cu_seqlens` and accepts optional precomputed tensors.
src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py	Adds optional return of precomputed cu_seqlens/rotary pos ids from processor.
src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py	Refactors rotary/cu_seqlens/window indexing via shared helpers; adds optional precomputed inputs.
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py	Same refactor for generated modeling file.
src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py	Moves audio chunking/indices/cu_seqlens/pooling computations into pure helper functions and accepts optional precomputes.
src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py	Updates PaddleOCR vision path to use shared rotary/cu_seqlens helpers and renames args (`grid_thw`).
src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py	Same as modular: shared helper usage and argument renames.
src/transformers/models/glm4v/processing_glm4v.py	Adds optional return of precomputed cu_seqlens/rotary pos ids from processor.
src/transformers/models/glm4v/modular_glm4v.py	Refactors vision rotary/cu_seqlens computations and video grid flattening logic.
src/transformers/models/glm4v/modeling_glm4v.py	Same as modular file, plus updates to rotary embedding forward API.
src/transformers/models/glm4v_moe/modeling_glm4v_moe.py	Same refactor for MoE variant.
src/transformers/models/glm46v/processing_glm46v.py	Adds optional return of precomputed cu_seqlens/rotary pos ids from processor.
src/transformers/models/glm46v/modeling_glm46v.py	Passes optional precomputed vision tensors through `get_*_features` and vision tower.
src/transformers/models/glm_ocr/modular_glm_ocr.py	Refactors vision rotary/cu_seqlens computation to shared helpers.
src/transformers/models/glm_ocr/modeling_glm_ocr.py	Same vision refactor for generated modeling file.
src/transformers/models/glm_image/modular_glm_image.py	Refactors rotary pos ids and cu_seqlens to shared helpers; adds optional precomputed inputs.
src/transformers/models/glm_image/modeling_glm_image.py	Same as modular file, plus updates to rotary embedding forward API.
src/transformers/models/esm/configuration_esm.py	Moves `rope_theta` doc section to align with parameter ordering/docs.
src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py	Refactors vision rotary/cu_seqlens to shared helpers and accepts optional precomputes.
src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py	Same refactor for generated modeling file.
src/transformers/modeling_vision_utils.py	New shared pure functions for vision tensor precomputation.
docs/source/en/model_doc/nomic_bert.md	Updates NomicBERT paper link.

Copilot

Pull request overview

This PR refactors multimodal (vision/audio) models to allow passing precomputed, data-dependent tensors (e.g. cu_seqlens, rotary position IDs, window indices, position-embedding interpolation indices) and centralizes shared vision tensor construction into a new src/transformers/modeling_vision_utils.py.

Changes:

Add src/transformers/modeling_vision_utils.py with shared pure helper functions for vision precomputations (get_vision_cu_seqlens, rotary pos IDs, window indices, pos-embed interpolation indices).
Update many vision model/processor implementations to accept optional precomputed tensors and avoid rebuilding them inside forward.
Move/inline audio precompute helpers into relevant modular/model files and update docstrings/autodoc arg definitions accordingly.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/transformers/utils/auto_docstring.py`	Adds autodoc entries for new optional precomputed image/video tensors.
`src/transformers/modeling_vision_utils.py`	New shared pure functions for vision tensor precomputation (cu_seqlens, rotary pos IDs, window indices, pos-embed indices/weights).
`src/transformers/models/video_llama_3/processing_video_llama_3.py`	Processor can optionally return extra precomputed vision tensors.
`src/transformers/models/video_llama_3/modular_video_llama_3.py`	Vision forward accepts optional precomputed tensors; uses shared vision utils.
`src/transformers/models/video_llama_3/modeling_video_llama_3.py`	Generated modeling updated to accept/use optional precomputed tensors.
`src/transformers/models/qwen3_vl/processing_qwen3_vl.py`	Processor can optionally return extra precomputed vision tensors (incl. interleaved rotary IDs).
`src/transformers/models/qwen3_vl/modular_qwen3_vl.py`	Vision path refactor to accept precomputed tensors; uses shared vision utils for pos-embed and rotary IDs.
`src/transformers/models/qwen3_vl/modeling_qwen3_vl.py`	Generated modeling updated similarly (precomputed tensors + shared utils).
`src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py`	Same precomputed-tensor refactor for MoE variant.
`src/transformers/models/qwen3_omni_moe/modular_qwen3_omni_moe.py`	Moves audio precompute helpers into the modular file; updates audio forward to accept precomputes.
`src/transformers/models/qwen3_5/modular_qwen3_5.py`	Vision refactor to accept optional precomputed tensors; uses shared vision utils.
`src/transformers/models/qwen3_5/modeling_qwen3_5.py`	Generated modeling updated similarly.
`src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py`	Same precomputed-tensor refactor for MoE variant.
`src/transformers/models/qwen2_vl/processing_qwen2_vl.py`	Processor can optionally return extra precomputed vision tensors.
`src/transformers/models/qwen2_vl/modeling_qwen2_vl.py`	Vision forward accepts optional precomputed tensors; uses shared vision utils.
`src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py`	Processor can optionally return extra precomputed vision tensors.
`src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py`	Vision forward refactor: optional precomputes + shared `get_window_index`/rotary IDs/cu_seqlens.
`src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py`	Generated modeling updated similarly.
`src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py`	Moves audio precompute helpers into the modular file; updates audio forward to accept precomputes.
`src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py`	Vision encoder refactor to accept `grid_thw` + optional precomputed rotary IDs / cu_seqlens.
`src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py`	Generated modeling updated similarly.
`src/transformers/models/glm4v/processing_glm4v.py`	Processor can optionally return extra precomputed vision tensors.
`src/transformers/models/glm4v/modular_glm4v.py`	Vision forward accepts optional precomputed tensors; uses shared vision utils; minor tensor construction refactors.
`src/transformers/models/glm4v/modeling_glm4v.py`	Generated modeling updated similarly.
`src/transformers/models/glm4v_moe/modeling_glm4v_moe.py`	Same refactor for MoE variant.
`src/transformers/models/glm46v/processing_glm46v.py`	Processor can optionally return extra precomputed vision tensors.
`src/transformers/models/glm46v/modeling_glm46v.py`	Updates get_{image,video}_features signatures to accept precomputed tensors.
`src/transformers/models/glm_ocr/modular_glm_ocr.py`	Vision forward accepts optional precomputed tensors; uses shared vision utils.
`src/transformers/models/glm_ocr/modeling_glm_ocr.py`	Generated modeling updated similarly.
`src/transformers/models/glm_image/modular_glm_image.py`	Vision forward accepts optional precomputed tensors; uses shared vision utils.
`src/transformers/models/glm_image/modeling_glm_image.py`	Generated modeling updated similarly.
`src/transformers/models/esm/configuration_esm.py`	Docstring reorders `rope_theta` description (docs-only change).
`src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py`	Vision forward accepts optional precomputed tensors; uses shared vision utils.
`src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py`	Generated modeling updated similarly.
`docs/source/en/model_doc/nomic_bert.md`	Updates the paper link URL (docs-only change).

        video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
            The temporal, height and width of feature shape of each video in LLM.
+        video_cu_seqlens (`torch.IntTensor`, *optional*):
+            Precomputed cumulative sequence lengths for videos (from `get_cu_seqlens`).


        image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
            The temporal, height and width of feature shape of each image in LLM.
+        image_cu_seqlens (`torch.IntTensor`, *optional*):
+            Precomputed cumulative sequence lengths for images (from `get_cu_seqlens`).


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…e/transformers into hf-vision-audio-utils

vasqu

Only nits/smaller things left on my side tbh. One thing I definitely would like to have is to register the modality specific kwargs under TransformersKwargs - just to have a sort of visible way what we expect from users at one place

vasqu · 2026-05-07T12:42:32Z

+import torch.nn.functional as F
+
+
+def get_vision_cu_seqlens(grid_thw: torch.Tensor, *, kwargs: dict | None = None) -> torch.Tensor:


Yep, no worries. I had my doubts it would be easy

vasqu · 2026-05-07T12:43:26Z

+        ``cu_seqlens``: ``(total_patches + 1,)`` int32 cumulative sequence boundaries.
+    """
+    if kwargs is not None and (cu_seqlens := kwargs.pop("cu_seqlens", None)) is not None:
+        return cu_seqlens


Btw, do we trust the user to have the right dtype or should we safe cast to int32 in any case?

i guess for now we can trust them (power feature)

vasqu · 2026-05-07T12:53:48Z

Rebumping because I still think we should declare them in TransformersKwargs just for user viz what they may have to pass

Agree that modality specific prefix handling warrants something; it really gets bigger by the years with all modalities 😢

ebezzam

Hey! I was addressing similar things when adding Qwen3 ASR and trying to make the audio encoder more compatible with torch.compile + align with other audio LM models in transformers.

My main comment is to shift the padding in the audio encoder to the feature extraction, which would mean defining a new feature extractor (instead of using Whisper's). But this feature extractor could be used by Qwen2.5/3 Omni and Qwen3 ASR.

cc: @eustlb, @vasqu

ebezzam · 2026-05-12T09:32:32Z

+        padded_feature, chunk_lengths = chunk_and_pad_features(
+            input_features, feature_lens, self.n_window, kwargs=kwargs
+        )


If we are going to do refactoring, could this padding be moved into a feature extractor? That's the convention we try to follow for audio models.

In the Qwen3 ASR PR, you can see how here I how created a new feature extractor, which is essentially Whisper + this extra padding

this pr avoids refactoring as much as possible actually 😅 the hope is for it to be merged to unblock another 6 month old pr

we only add a path for some tensors to be pre computed and passed to the model, the computation is still inside the model when those tensors are missing (default path).

sure no worries!

ebezzam · 2026-05-12T09:43:04Z

 logger = logging.get_logger(__name__)


+def _get_feat_extract_output_lengths(input_lengths):


In the Qwen3 ASR PR, I also suggested changes here to account for the model parameter, as discussed here

Similarly in the processor: https://github.com/huggingface/transformers/pull/43838/changes#diff-7b066d82cb409cd8b2617d1271759fc8b130b9720162f333d3caf7d571a45d15

ebezzam · 2026-05-12T09:51:14Z

+        feature_lens (`torch.LongTensor` of shape `(batch_size,)`):
+            mel length
+        """
+        padded_feature, chunk_lengths = chunk_and_pad_features(


Similarly, it would be better to move this to the feature extractor.

Actually while working on the Qwen3 ASR addition, I tried using Qwen3OmniMoeAudioEncoder and overwrote the forward method like this (+ defining new feature extractor) for torch.compile compatability.

Maybe same could be done for Qwen2.5 Omni?

ebezzam · 2026-05-12T09:56:00Z


    @can_return_tuple
    @auto_docstring
    def get_audio_features(


If it helps, how I've defined get_audio_features in Qwen3 ASR to align with other audio LM models: https://github.com/mbtariq82/transformers/blob/qwen3-asr/src/transformers/models/qwen3_asr/modular_qwen3_asr.py#L275-L297

IlyasMoutawwakil · 2026-05-13T07:23:17Z

@ebezzam @vasqu can we coordinate with #43838 in a follow-up rather than couple the two PRs ?

github-actions · 2026-05-13T07:24:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, exaone4_5, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_omni, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_omni_moe, qwen3_vl, qwen3_vl_moe

…uggingface#45396) * Extract pure vision/audio functions into standalone utilities - Create top-level `modeling_vision_utils.py` with shared pure functions: `get_vision_cu_seqlens`, `get_rotary_pos_ids`, `get_rotary_pos_ids_interleaved`, `get_window_index`, `get_pos_embed_indices` - Move audio precompute functions (`chunk_and_pad_features`, `get_audio_cu_seqlens`, `get_valid_indices`, `get_pool_indices`) into modular files directly - Simplify `VisionRotaryEmbedding.forward` to accept `pos_ids` tensor directly via broadcast multiply, eliminating data-dependent table creation - Make vision/audio encoder forwards accept optional precomputed tensors (`cu_seqlens`, `rotary_pos_ids`, `window_index`, `embed_indices`, etc.) - Use explicit naming: `get_vision_cu_seqlens` / `get_audio_cu_seqlens` Models: qwen2_vl, qwen2_5_vl, qwen3_vl, qwen3_5, qwen3_vl_moe, qwen3_5_moe, qwen2_5_omni, qwen3_omni_moe, glm4v, glm4v_moe, glm_image, glm_ocr, ernie4_5_vl_moe, video_llama_3, mlcd, paddleocr_vl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix stale compute_ docstring references to match actual function names Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert mlcd changes — not part of this PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix * kwargs * opt-in * fix dtype * style * guard torch import * standarize * propagate inputs * fix docs * fix docs * auto docs * more docs fixing * fix omni * fix paddle * revert paddle ocr until another time * finally fixed paddle ocr * fix review * revert chunking * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix torch compilable check * fix docs * correct func name * fix omni * fix video llama 3 * fix video llama 3 * requires torch * add missing grid device * keep rot emb in fp32 * fix test device * fix flm4v flex attention test * rename to vision utils * only one get_rotary_pos_ids is needed * style * style * deprecate only * fix * simplify and revert processor changes * renames * move some stuff to their original place * style * style * use chunked attention * use decorator * pass kwargs and return_dict * fix missing * keep in and get from kwargs * revert some trailing commas * fix * fixes * video llama fixes * fix qwen3 vl * forgot glm ocr * address comments * drop unnecessary * use correct flash attn check * missed deprecation * empty commit 1 * empty commit 2 * revert video llama 3 config changes * style * style fix * address comments * remove unnecessary * revert TransformersKwargs and add a todo --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

IlyasMoutawwakil and others added 5 commits April 13, 2026 10:44

Fix stale compute_ docstring references to match actual function names

fe46ba2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert mlcd changes — not part of this PR

84439a0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix

e62aa98

Merge branch 'main' into hf-vision-audio-utils

cbc1e22

IlyasMoutawwakil added 15 commits April 13, 2026 11:20

kwargs

c1d7a8a

opt-in

2771799

fix dtype

fa224e2

style

ac2895d

guard torch import

2f2787c

standarize

d628d96

propagate inputs

2a014a4

fix docs

957372a

fix docs

4194ff1

auto docs

836424b

more docs fixing

11f73fd

fix omni

71f90ec

fix paddle

a89d436

revert paddle ocr until another time

c0fdc0d

finally fixed paddle ocr

d1da022

IlyasMoutawwakil requested a review from Copilot April 13, 2026 13:52

Copilot started reviewing on behalf of IlyasMoutawwakil April 13, 2026 13:53 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

IlyasMoutawwakil added 2 commits April 13, 2026 16:04

fix review

448ff2e

revert chunking

6731028

IlyasMoutawwakil requested a review from Copilot April 13, 2026 14:10

Copilot started reviewing on behalf of IlyasMoutawwakil April 13, 2026 14:11 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Potential fix for pull request finding

693ba9c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

IlyasMoutawwakil and others added 10 commits May 5, 2026 12:43

Merge branch 'main' into hf-vision-audio-utils

df321cd

empty commit 1

1d67c0a

empty commit 2

b2f75bc

Merge branch 'hf-vision-audio-utils' of https://github.com/huggingfac…

bff7819

…e/transformers into hf-vision-audio-utils

revert video llama 3 config changes

7c39239

Merge branch 'main' into hf-vision-audio-utils

a3fb74e

style

b7b18e8

style fix

4556de4

Merge branch 'main' into hf-vision-audio-utils

891dcac

Merge branch 'main' into hf-vision-audio-utils

78142dd

vasqu approved these changes May 7, 2026

View reviewed changes

address comments

d1f63d2

vasqu mentioned this pull request May 7, 2026

Qwen3 ASR and Forced Aligner #43838

Open

5 tasks

IlyasMoutawwakil and others added 5 commits May 7, 2026 21:30

remove unnecessary

3438295

Merge branch 'main' into hf-vision-audio-utils

d4b3e1a

Merge branch 'main' into hf-vision-audio-utils

03decc3

revert TransformersKwargs and add a todo

1f7a9c3

Merge branch 'main' into hf-vision-audio-utils

a4565ac

ebezzam reviewed May 12, 2026

View reviewed changes

Merge branch 'main' into hf-vision-audio-utils

595b6f1

IlyasMoutawwakil added this pull request to the merge queue May 13, 2026

Merged via the queue into main with commit f00940e May 13, 2026
94 of 95 checks passed

IlyasMoutawwakil deleted the hf-vision-audio-utils branch May 13, 2026 08:24

ArthurZucker mentioned this pull request Jun 4, 2026

[PoC] HF exporters #41992

Open

IlyasMoutawwakil mentioned this pull request Jun 5, 2026

Add VLM export support (SmolVLM as starting point) huggingface/optimum-executorch#233

Open

		import torch.nn.functional as F


		def get_vision_cu_seqlens(grid_thw: torch.Tensor, *, kwargs: dict \| None = None) -> torch.Tensor:

		logger = logging.get_logger(__name__)


		def _get_feat_extract_output_lengths(input_lengths):

Uh oh!

Conversation

IlyasMoutawwakil commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

IlyasMoutawwakil commented Apr 13, 2026 •

edited

Loading

IlyasMoutawwakil May 13, 2026 •

edited

Loading