Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
8a367b0
Create modular file and port processor
mbtariq82-dev Feb 6, 2026
a7d62a2
Test for pretrained, tokenizer and feature extractor
mbtariq82-dev Feb 7, 2026
9e2cfd5
add ProcessorTesterMixin to test class
mbtariq82-dev Feb 9, 2026
665d1fb
add config classes
mbtariq82-dev Feb 9, 2026
3ce24d5
unable to pass test_apply_chat_template_audio, added debugging logic …
mbtariq82-dev Feb 11, 2026
3669d24
Add model and config classes
mbtariq82-dev Feb 15, 2026
ae7d1cb
Add attn_implementation to configs
mbtariq82-dev Feb 16, 2026
26db1dd
Fix tests by removing attentions hook and manually calculating attent…
mbtariq82-dev Feb 18, 2026
d4c307b
Change model 'attentions' hook class from Qwen3ASRThinkerTextAttentio…
mbtariq82-dev Feb 18, 2026
0b3248d
Architectural change inspired by test_generate_with_static_cache: Ali…
mbtariq82-dev Feb 19, 2026
fdfd969
Use modular transformers components to define Qwen3ASRAudioEncoderConfig
mbtariq82-dev Feb 19, 2026
6336f14
Use modular transformers to define Qwen3ASRTextConfig from Qwen3OmniM…
mbtariq82-dev Feb 23, 2026
72cd0f6
Comment about inherited class-level attributes for Qwen3ASRTextConfig
mbtariq82-dev Feb 23, 2026
86f4678
Use modular transformers to define Qwen3ASRThinkerConfig from Qwen3Om…
mbtariq82-dev Feb 23, 2026
e4f4e4f
Remove comments
mbtariq82-dev Feb 23, 2026
2a0b543
Use modular transformers to define Qwen3ASRConfig from Qwen3OmniMoeCo…
mbtariq82-dev Feb 23, 2026
598e838
Import _get_feat_extract_output_lengths from Qwen3-Omni-Moe instead o…
mbtariq82-dev Feb 23, 2026
65ead7b
Use modular transformers to define Qwen3ASRProcessor from Qwen3OmniMo…
mbtariq82-dev Feb 24, 2026
0d548a8
Change pipeline_model_mapping in model tests from 'automatic-speech-r…
mbtariq82-dev Feb 24, 2026
e6a75e6
Use modular transformers to define Qwen3ASRTextRMSNorm from Qwen3Omni…
mbtariq82-dev Feb 24, 2026
c36106a
Import rotate_half, repeat_kv, apply_rotary_pos_emb, eager_attention_…
mbtariq82-dev Feb 24, 2026
c81f684
Use modular transformers to define Qwen3ASRTextAttention from Qwen3Om…
mbtariq82-dev Feb 24, 2026
fd12335
Use modular transformers to define Qwen3ASRTextMLP from Qwen3OmniMoeT…
mbtariq82-dev Feb 24, 2026
e4b7d93
Use modular transformers to define Qwen3ASRThinkerTextDecoderLayer fr…
mbtariq82-dev Feb 24, 2026
c64210c
Import _get_feat_extract_output_lengths from Qwen3-Omni-Moe instead o…
mbtariq82-dev Feb 24, 2026
03d9fa6
Use modular transformers to define Qwen3ASRPreTrainedModelForConditio…
mbtariq82-dev Feb 24, 2026
77c11ee
Use modular transformers to define Qwen3ASRAudioAttention from Qwen3O…
mbtariq82-dev Feb 24, 2026
c7bc5d1
Use modular transformers to define Qwen3ASRAudioEncoderLayer from Qwe…
mbtariq82-dev Feb 24, 2026
835b891
Import SinusoidsPositionEmbedding from Qwen3-Omni-Moe instead of rede…
mbtariq82-dev Feb 24, 2026
f3e6a8d
Use modular transformers to define Qwen3ASRAudioEncoder from Qwen3Omn…
mbtariq82-dev Feb 24, 2026
de3fdf9
Use modular transformers to define Qwen3ASRThinkerTextRotaryEmbedding…
mbtariq82-dev Feb 25, 2026
077a52b
Use modular transformers to define Qwen3ASRThinkerTextMLP directly fr…
mbtariq82-dev Feb 25, 2026
14735fd
Use modular transformers to define Qwen3ASRThinkerTextRMSNorm directl…
mbtariq82-dev Feb 25, 2026
69ecc47
Use modular transformers to define Qwen3ASRThinkerTextModel from Qwen…
mbtariq82-dev Feb 25, 2026
4a8fb2b
Use modular transformers to define Qwen3ASRThinkerForConditionalGener…
mbtariq82-dev Feb 25, 2026
4e14ff1
Update Qwen3ASRTextConfig modular according to convention.
ebezzam Feb 26, 2026
df87020
Nits
ebezzam Feb 26, 2026
805f1a0
Change Qwen3ASRProcessor inheritance from Qwen3OmniMoeProcessor to Au…
mbtariq82-dev Feb 26, 2026
0af1d92
Merge branch 'qwen3-asr' of https://github.com/mbtariq82/transformers…
mbtariq82-dev Feb 26, 2026
7d9c73d
Comment about ThinkerConfig inheritance
mbtariq82-dev Feb 26, 2026
0d78599
Change Qwen3ASRProcessor to inherit directly - init no longer has to …
mbtariq82-dev Feb 26, 2026
a1e5f77
Remove torch.manual_seed from integration tests
mbtariq82-dev Feb 26, 2026
06250d9
Style: fix ruff lint issues and typing compliance
mbtariq82-dev Feb 26, 2026
d78e6c5
Add reproducer to programmatically update expected results for integr…
mbtariq82-dev Feb 28, 2026
9ad348b
Add convert_qwen3_asr_to_hf.py
mbtariq82-dev Mar 2, 2026
54e5ad1
Remove Qwen3OmniMoeConfig inheritance from Qwen3ASRConfig
mbtariq82-dev Mar 2, 2026
1f01d00
Remove Qwen3OmniMoeThinkerConfig inheritance from Qwen3ASRThinkerConfig
mbtariq82-dev Mar 2, 2026
411c39c
cleanup
mbtariq82-dev Mar 2, 2026
b8a6c38
Cleanup
mbtariq82 Mar 3, 2026
69c3e26
Cleanup
mbtariq82 Mar 3, 2026
28877a1
Cleanup
mbtariq82 Mar 3, 2026
47dacb9
Cleanup
mbtariq82 Mar 3, 2026
abefad7
Functional model conversion.
ebezzam Mar 3, 2026
69ccfae
Cleanup
mbtariq82-dev Mar 4, 2026
ceb72ff
Cleanup
mbtariq82-dev Mar 4, 2026
3ca90bf
Cleanup
mbtariq82-dev Mar 4, 2026
086a464
Cleanup
mbtariq82-dev Mar 4, 2026
bef02e4
Add init_weights to Qwen3ASRPreTrainedModel to pass ModelTesterMixin:…
mbtariq82-dev Mar 5, 2026
581676b
Cleanup
mbtariq82-dev Mar 5, 2026
d55747b
Cleanup
mbtariq82-dev Mar 5, 2026
b9d83de
Cleanup
mbtariq82-dev Mar 5, 2026
80ccd30
Use converted hf weights for integration tests
mbtariq82-dev Mar 5, 2026
e951ea5
Change Processor tests to use hf checkpoint
mbtariq82-dev Mar 7, 2026
f73117a
Restore CI/github scripts to upstream versions
mbtariq82-dev Mar 9, 2026
948f40a
Restore CI/github scripts to upstream versions (2)
mbtariq82-dev Mar 9, 2026
65b0a3c
Restore CI/github scripts to upstream versions (3)
mbtariq82-dev Mar 9, 2026
e941a46
passing integration tests
ebezzam Mar 12, 2026
fa21c2e
Standardize processor.
ebezzam Mar 12, 2026
13f7203
Cleanup and standardize modeling.
ebezzam Mar 13, 2026
78299be
Remove rope deltas.
ebezzam Mar 13, 2026
a8b161f
Stop tracking reproducer.
ebezzam Mar 19, 2026
6b03776
Merge branch 'qwen3-asr' of github.com:mbtariq82/transformers into qw…
ebezzam Mar 19, 2026
a23f637
Merge branch 'main' into qwen3-asr
ebezzam Mar 19, 2026
7ed8e54
Update config modular.
ebezzam Mar 20, 2026
224c7b3
Account for n_window in encoder length computation.
ebezzam Mar 31, 2026
7a58b9c
Merge branch 'main' into qwen3-asr
ebezzam Mar 31, 2026
f6e97e5
Add qwen3asr
ebezzam Mar 31, 2026
a3aa053
Merge branch 'qwen3-asr' of github.com:mbtariq82/transformers into qw…
ebezzam Mar 31, 2026
c7e813c
Nit
ebezzam Mar 31, 2026
401d869
Expose encoder from qwen3 omni, and cleaner modular.
ebezzam Mar 31, 2026
3ad04f6
DIrectly use language model from Qwen3.
ebezzam Mar 31, 2026
0139cfe
Modular from other audio LMs.
ebezzam Mar 31, 2026
7197827
Shift flattening to processor.
ebezzam Mar 31, 2026
6a1308d
Add docs and post-process methods.
ebezzam Apr 15, 2026
33cae66
Address model integration tests + style
ebezzam Apr 15, 2026
d711751
Processing tests.
ebezzam Apr 16, 2026
6bae830
Functional forced alignment in a single modular.
ebezzam Apr 20, 2026
c6250a3
Add reproducer for timestamps.
ebezzam Apr 20, 2026
5d12746
Remove processor from modular.
ebezzam Apr 20, 2026
3839910
Merge branch 'main' into qwen3-asr
ebezzam Apr 20, 2026
4d89dd2
Create base Qwen3ASR model like Llava.
ebezzam Apr 22, 2026
62d80ea
Push timestamp fixtures.
ebezzam Apr 22, 2026
a5c5d60
Nits and style.
ebezzam Apr 22, 2026
502ff64
Forced aligner refactor: new auto class and better naming.
ebezzam Apr 22, 2026
67c1f52
Forced alignmnet nits.
ebezzam Apr 22, 2026
e0d751e
Create audio encoder that is more in line with other and torch compil…
ebezzam Apr 23, 2026
9b582c0
Small fixes for tests.
ebezzam Apr 24, 2026
81b8bba
add torch compil forced aligner example, and small fix for compile
ebezzam Apr 24, 2026
50962ae
Modeling nits.
ebezzam Apr 24, 2026
0b932ec
undo exposure of omni audio encoder, doc/style nits
ebezzam Apr 24, 2026
61d0ba2
Add note on attention's k_proj bias.
ebezzam May 1, 2026
ffa7915
Cleaner init.
ebezzam May 4, 2026
f344601
Apply suggestion from @vasqu
ebezzam May 8, 2026
4159fc0
Apply suggestion from @vasqu
ebezzam May 8, 2026
f85234b
Doc improvements, and conversion fix.
ebezzam May 8, 2026
fb0c006
Merge branch 'qwen3-asr' of github.com:mbtariq82/transformers into qw…
ebezzam May 8, 2026
d568035
Simplify conversion script.
ebezzam May 8, 2026
2e02d0a
Apply suggestion from @vasqu
ebezzam May 8, 2026
94239ae
Apply suggestion from @vasqu
ebezzam May 8, 2026
48fdcf9
Better encoder config in modular.
ebezzam May 8, 2026
09b5d9f
Merge branch 'qwen3-asr' of github.com:mbtariq82/transformers into qw…
ebezzam May 8, 2026
ce6f4df
Add default method to SinusoidsPositionEmbedding, and generate from m…
ebezzam May 8, 2026
8a5f845
Refactor forced aligner. Use GenericForTokenClassification.
ebezzam May 11, 2026
02383ee
Address processor comments.
ebezzam May 11, 2026
d904134
Add support for language codes.
ebezzam May 11, 2026
51253d7
Address comments for token classification.
ebezzam May 12, 2026
371da13
Better modular for attention and token classification.
ebezzam May 12, 2026
e303059
Merge branch 'main' into qwen3-asr
ebezzam May 12, 2026
cb42572
Modular after merge.
ebezzam May 12, 2026
b12c76b
Use new ALM testing classes.
ebezzam May 13, 2026
3392aa9
Update src/transformers/models/qwen3_asr/feature_extraction_qwen3_asr.py
ebezzam May 16, 2026
ecf3f74
Address review comments: create make_list_of_audio_chat_template util…
ebezzam May 18, 2026
1396cde
merge
ebezzam May 18, 2026
deb037b
Merge branch 'main' into qwen3-asr
ebezzam May 18, 2026
6053739
Modular after merge.
ebezzam May 18, 2026
3d47bb2
Address unprotected torch import.
ebezzam May 18, 2026
eb5ccc4
Introduce score_bias for GenericForTokenClassification.
ebezzam May 19, 2026
41125d7
Refactor token classification bias.
ebezzam May 20, 2026
7e3fbc9
Merge branch 'main' into qwen3-asr
ebezzam May 20, 2026
cdb6639
Refactor processsing like AudioFlamingo3 with submethods.
ebezzam May 20, 2026
8034275
Use windowed attention like in Qwen 3 Omni.
ebezzam May 22, 2026
bbe486c
Add multimodal projector, and small refactor.
ebezzam May 22, 2026
b1aae95
Better max_source_positions, style fixes.
ebezzam May 22, 2026
1a6d2a5
Merge branch 'main' into qwen3-asr
ebezzam Jun 4, 2026
7bac079
Update modular after ALM refactor.
ebezzam Jun 4, 2026
8ef687a
Merge branch 'main' into qwen3-asr
ebezzam Jun 16, 2026
1c7f736
check repo
ebezzam Jun 16, 2026
46c5596
Apply post-processing like original implementation.
ebezzam Jun 17, 2026
bca869f
Set default max new tokens like original, and nits.
ebezzam Jun 17, 2026
cf31d4b
Zero pad to min length like original
ebezzam Jun 17, 2026
6812afe
Remove padding mask update for min length (like original)
ebezzam Jun 17, 2026
753a0b7
Refactor, and update padding mask.
ebezzam Jun 17, 2026
cf861cf
revert mask update, hurts AMI performance
ebezzam Jun 17, 2026
f13c378
Merge branch 'main' into qwen3-asr
ebezzam Jun 22, 2026
5be33e7
feature extractor nits
ebezzam Jun 22, 2026
e4f0704
Merge branch 'main' into qwen3-asr
ebezzam Jun 22, 2026
3fecf7c
Renaming with hf suffix.
ebezzam Jun 22, 2026
a2ec912
Merge branch 'qwen3-asr' of github.com:mbtariq82/transformers into qw…
ebezzam Jun 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1125,6 +1125,8 @@
title: PE Audio
- local: model_doc/pop2piano
title: Pop2Piano
- local: model_doc/qwen3_asr
Comment thread
ebezzam marked this conversation as resolved.
title: Qwen3 ASR
- local: model_doc/seamless_m4t
title: Seamless-M4T
- local: model_doc/seamless_m4t_v2
Expand Down
504 changes: 504 additions & 0 deletions docs/source/en/model_doc/qwen3_asr.md

Large diffs are not rendered by default.

24 changes: 24 additions & 0 deletions src/transformers/audio_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,30 @@ def make_list_of_audio(
raise ValueError("Invalid input type. Must be a single audio or a list of audio")


def make_list_of_audio_chat_template(
audio: list[AudioInput] | AudioInput | str | list[str],
) -> AudioInput:
"""
Ensure that the output is a list of audio. Unlike `make_list_of_audio`, this function also accepts a URL string or
local path, as accepted by chat templates.

Args:
audio (`Union[list[AudioInput], AudioInput]`):
The input audio. Can be a URL string, local path, numpy/torch array, or a list of these.
Returns:
list: A list of audio.
"""

# Handle string inputs
if isinstance(audio, str):
return [audio]
if isinstance(audio, (list, tuple)) and audio and all(isinstance(a, str) for a in audio):
return list(audio)

# Handle numpy/torch array inputs
return make_list_of_audio(audio)


def hertz_to_mel(freq: float | np.ndarray, mel_scale: str = "htk") -> float | np.ndarray:
"""
Convert frequency from hertz to mels.
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ def __post_init__(self, **kwargs):
# Our configs prev wouldn't save `id2label` for 2 labels because it is the default. In all other
# cases we expect the config dict to have an `id2label` field if it's a clf model, or not otherwise
if self.id2label is None:
self.num_labels = kwargs.get("num_labels", 2)
self.num_labels = kwargs.get("num_labels", self.num_labels if self.num_labels is not None else 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because otherwise an existing self.num_labels (=5000) was getting overwritten by 2 during conversion.

else:
if kwargs.get("num_labels") is not None and len(self.id2label) != kwargs.get("num_labels"):
logger.warning(
Expand Down
6 changes: 5 additions & 1 deletion src/transformers/modeling_layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,11 @@ def __init__(self, config):
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)
self.score = nn.Linear(config.get_text_config().hidden_size, config.num_labels)
self.score = nn.Linear(
config.get_text_config().hidden_size,
config.num_labels,
bias=getattr(config, "token_classification_bias", True),
)

# Initialize weights and apply final processing
self.post_init()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

import numpy as np

from ...audio_utils import AudioInput, make_list_of_audio
from ...audio_utils import AudioInput, make_list_of_audio_chat_template
from ...feature_extraction_utils import BatchFeature
from ...processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
from ...tokenization_utils_base import TextInput
Expand Down Expand Up @@ -200,14 +200,9 @@ def apply_transcription_request(

"""

if isinstance(audio, str):
audio_items: list[str | np.ndarray] = [audio]
elif isinstance(audio, (list, tuple)) and audio and all(isinstance(el, str) for el in audio):
audio_items = list(audio)
else:
audio_items = list(make_list_of_audio(audio))
if is_torch_available():
audio_items = [el.detach().cpu().numpy() if isinstance(el, torch.Tensor) else el for el in audio_items]
audio_items: list[str | np.ndarray] = list(make_list_of_audio_chat_template(audio))
if is_torch_available():
audio_items = [el.detach().cpu().numpy() if isinstance(el, torch.Tensor) else el for el in audio_items]

batch_size = len(audio_items)
if batch_size == 0:
Expand Down
5 changes: 5 additions & 0 deletions src/transformers/models/auto/auto_mappings.py
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,8 @@
("qwen3_5_moe_vision", "Qwen3_5MoeVisionConfig"),
("qwen3_5_text", "Qwen3_5TextConfig"),
("qwen3_5_vision", "Qwen3_5VisionConfig"),
("qwen3_asr", "Qwen3ASRConfig"),
("qwen3_asr_encoder", "Qwen3ASREncoderConfig"),
("qwen3_moe", "Qwen3MoeConfig"),
("qwen3_next", "Qwen3NextConfig"),
("qwen3_omni_moe", "Qwen3OmniMoeConfig"),
Expand Down Expand Up @@ -848,6 +850,7 @@
("qwen3_5_moe_vision", "qwen3_5_moe"),
("qwen3_5_text", "qwen3_5"),
("qwen3_5_vision", "qwen3_5"),
("qwen3_asr_encoder", "qwen3_asr"),
("qwen3_omni_moe_audio_encoder", "qwen3_omni_moe"),
("qwen3_omni_moe_talker_code_predictor", "qwen3_omni_moe"),
("qwen3_omni_moe_talker_text", "qwen3_omni_moe"),
Expand Down Expand Up @@ -959,6 +962,7 @@
("pe_audio", "PeAudioFeatureExtractor"),
("phi4_multimodal", "Phi4MultimodalFeatureExtractor"),
("pop2piano", "Pop2PianoFeatureExtractor"),
("qwen3_asr", "Qwen3ASRFeatureExtractor"),
("seamless_m4t", "SeamlessM4TFeatureExtractor"),
("speech_to_text", "Speech2TextFeatureExtractor"),
("speecht5", "SpeechT5FeatureExtractor"),
Expand Down Expand Up @@ -1071,6 +1075,7 @@
("qwen2_5_vl", "Qwen2_5_VLProcessor"),
("qwen2_audio", "Qwen2AudioProcessor"),
("qwen2_vl", "Qwen2VLProcessor"),
("qwen3_asr", "Qwen3ASRProcessor"),
("qwen3_omni_moe", "Qwen3OmniMoeProcessor"),
("qwen3_vl", "Qwen3VLProcessor"),
("sam", "SamProcessor"),
Expand Down
7 changes: 7 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,8 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("qwen3_5_moe_vision", "Qwen3_5MoeVisionModel"),
("qwen3_5_text", "Qwen3_5TextModel"),
("qwen3_5_vision", "Qwen3_5VisionModel"),
("qwen3_asr", "Qwen3ASRModel"),
("qwen3_asr_encoder", "Qwen3ASREncoder"),
("qwen3_moe", "Qwen3MoeModel"),
("qwen3_next", "Qwen3NextModel"),
("qwen3_vl", "Qwen3VLModel"),
Expand Down Expand Up @@ -612,6 +614,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("openai-gpt", "OpenAIGPTLMHeadModel"),
("paligemma", "PaliGemmaForConditionalGeneration"),
("qwen2_audio", "Qwen2AudioForConditionalGeneration"),
("qwen3_asr", "Qwen3ASRForConditionalGeneration"),
("roberta", "RobertaForMaskedLM"),
("roberta-prelayernorm", "RobertaPreLayerNormForMaskedLM"),
("roc_bert", "RoCBertForPreTraining"),
Expand Down Expand Up @@ -1112,6 +1115,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("phi4_multimodal", "Phi4MultimodalForCausalLM"),
("qwen2_5_omni", "Qwen2_5OmniForConditionalGeneration"),
("qwen2_audio", "Qwen2AudioForConditionalGeneration"),
("qwen3_asr", "Qwen3ASRForConditionalGeneration"),
("qwen3_omni_moe", "Qwen3OmniMoeForConditionalGeneration"),
("vibevoice_asr", "VibeVoiceAsrForConditionalGeneration"),
("voxtral", "VoxtralForConditionalGeneration"),
Expand Down Expand Up @@ -1266,6 +1270,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("plbart", "PLBartForConditionalGeneration"),
("prophetnet", "ProphetNetForConditionalGeneration"),
("qwen2_audio", "Qwen2AudioForConditionalGeneration"),
("qwen3_asr", "Qwen3ASRForConditionalGeneration"),
("seamless_m4t", "SeamlessM4TForTextToText"),
("seamless_m4t_v2", "SeamlessM4Tv2ForTextToText"),
("switch_transformers", "SwitchTransformersForConditionalGeneration"),
Expand All @@ -1290,6 +1295,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("moonshine", "MoonshineForConditionalGeneration"),
("moonshine_streaming", "MoonshineStreamingForConditionalGeneration"),
("pop2piano", "Pop2PianoForConditionalGeneration"),
("qwen3_asr", "Qwen3ASRForConditionalGeneration"),
("seamless_m4t", "SeamlessM4TForSpeechToText"),
("seamless_m4t_v2", "SeamlessM4Tv2ForSpeechToText"),
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
Expand Down Expand Up @@ -1608,6 +1614,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("qwen2_moe", "Qwen2MoeForTokenClassification"),
("qwen3", "Qwen3ForTokenClassification"),
("qwen3_5", "Qwen3_5ForTokenClassification"),
("qwen3_asr", "Qwen3ASRForTokenClassification"),
("qwen3_moe", "Qwen3MoeForTokenClassification"),
("qwen3_next", "Qwen3NextForTokenClassification"),
("rembert", "RemBertForTokenClassification"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,7 @@
("qwen3", "Qwen2Tokenizer" if is_tokenizers_available() else None),
("qwen3_5", "Qwen3_5Tokenizer" if is_tokenizers_available() else None),
("qwen3_5_moe", "Qwen3_5Tokenizer" if is_tokenizers_available() else None),
("qwen3_asr", "Qwen2Tokenizer" if is_tokenizers_available() else None),
("qwen3_moe", "Qwen2Tokenizer" if is_tokenizers_available() else None),
("qwen3_next", "Qwen2Tokenizer" if is_tokenizers_available() else None),
("qwen3_omni_moe", "Qwen2Tokenizer" if is_tokenizers_available() else None),
Expand Down
13 changes: 4 additions & 9 deletions src/transformers/models/glmasr/modular_glmasr.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import numpy as np

from ...activations import ACT2FN
from ...audio_utils import AudioInput, make_list_of_audio
from ...audio_utils import AudioInput, make_list_of_audio_chat_template
from ...cache_utils import Cache
from ...feature_extraction_utils import BatchFeature
from ...modeling_layers import GradientCheckpointingLayer
Expand Down Expand Up @@ -112,14 +112,9 @@ def apply_transcription_request(

"""

if isinstance(audio, str):
audio_items: list[str | np.ndarray] = [audio]
elif isinstance(audio, (list, tuple)) and audio and all(isinstance(el, str) for el in audio):
audio_items = list(audio)
else:
audio_items = list(make_list_of_audio(audio))
if is_torch_available():
audio_items = [el.detach().cpu().numpy() if isinstance(el, torch.Tensor) else el for el in audio_items]
audio_items: list[str | np.ndarray] = list(make_list_of_audio_chat_template(audio))
if is_torch_available():
audio_items = [el.detach().cpu().numpy() if isinstance(el, torch.Tensor) else el for el in audio_items]

batch_size = len(audio_items)
if batch_size == 0:
Expand Down
13 changes: 4 additions & 9 deletions src/transformers/models/glmasr/processing_glmasr.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

import numpy as np

from ...audio_utils import AudioInput, make_list_of_audio
from ...audio_utils import AudioInput, make_list_of_audio_chat_template
from ...feature_extraction_utils import BatchFeature
from ...processing_utils import ProcessingKwargs, ProcessorMixin, Unpack
from ...tokenization_utils_base import TextInput
Expand Down Expand Up @@ -209,14 +209,9 @@ def apply_transcription_request(

"""

if isinstance(audio, str):
audio_items: list[str | np.ndarray] = [audio]
elif isinstance(audio, (list, tuple)) and audio and all(isinstance(el, str) for el in audio):
audio_items = list(audio)
else:
audio_items = list(make_list_of_audio(audio))
if is_torch_available():
audio_items = [el.detach().cpu().numpy() if isinstance(el, torch.Tensor) else el for el in audio_items]
audio_items: list[str | np.ndarray] = list(make_list_of_audio_chat_template(audio))
if is_torch_available():
audio_items = [el.detach().cpu().numpy() if isinstance(el, torch.Tensor) else el for el in audio_items]

batch_size = len(audio_items)
if batch_size == 0:
Expand Down
22 changes: 10 additions & 12 deletions src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,10 +144,8 @@ class Qwen2_5OmniPreTrainedModel(PreTrainedModel):
def _init_weights(self, module):
super()._init_weights(module)
if isinstance(module, SinusoidsPositionEmbedding):
log_timescale_increment = np.log(module.max_timescale) / (module.channels // 2 - 1)
inv_timescales = torch.exp(-log_timescale_increment * torch.arange(module.channels // 2).float())
scaled_time = torch.arange(module.length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
init.copy_(module.positional_embedding, torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1))
position_embeddings = module.compute_default_singular_positional_embedding()
init.copy_(module.positional_embedding, position_embeddings)
elif isinstance(module, UpSample1d):
filter_tensor = kaiser_sinc_filter1d(0.5 / module.ratio, 0.6 / module.ratio, module.kernel_size)
init.copy_(module.filter, filter_tensor)
Expand Down Expand Up @@ -718,14 +716,14 @@ def __init__(self, length, channels, max_timescale=10000):
self.max_timescale = max_timescale
if channels % 2 != 0:
raise ValueError("SinusoidsPositionEmbedding needs even channels input")
log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2).float())
scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
self.register_buffer(
"positional_embedding",
torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1),
persistent=False,
)
position_embedding = self.compute_default_singular_positional_embedding()
self.register_buffer("positional_embedding", position_embedding, persistent=False)

def compute_default_singular_positional_embedding(self):
log_timescale_increment = np.log(self.max_timescale) / (self.channels // 2 - 1)
inv_timescales = torch.exp(-log_timescale_increment * torch.arange(self.channels // 2).float())
scaled_time = torch.arange(self.length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)

def forward(self, seqlen: int):
return self.positional_embedding[:seqlen, :]
Expand Down
22 changes: 10 additions & 12 deletions src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py
Original file line number Diff line number Diff line change
Expand Up @@ -760,10 +760,8 @@ class Qwen2_5OmniPreTrainedModel(Qwen2_5_VLPreTrainedModel):
def _init_weights(self, module):
PreTrainedModel._init_weights(self, module)
if isinstance(module, SinusoidsPositionEmbedding):
log_timescale_increment = np.log(module.max_timescale) / (module.channels // 2 - 1)
inv_timescales = torch.exp(-log_timescale_increment * torch.arange(module.channels // 2).float())
scaled_time = torch.arange(module.length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
init.copy_(module.positional_embedding, torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1))
position_embeddings = module.compute_default_singular_positional_embedding()
init.copy_(module.positional_embedding, position_embeddings)
elif isinstance(module, UpSample1d):
filter_tensor = kaiser_sinc_filter1d(0.5 / module.ratio, 0.6 / module.ratio, module.kernel_size)
init.copy_(module.filter, filter_tensor)
Expand Down Expand Up @@ -1283,14 +1281,14 @@ def __init__(self, length, channels, max_timescale=10000):
self.max_timescale = max_timescale
if channels % 2 != 0:
raise ValueError("SinusoidsPositionEmbedding needs even channels input")
log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2).float())
scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
self.register_buffer(
"positional_embedding",
torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1),
persistent=False,
)
position_embedding = self.compute_default_singular_positional_embedding()
self.register_buffer("positional_embedding", position_embedding, persistent=False)

def compute_default_singular_positional_embedding(self):
log_timescale_increment = np.log(self.max_timescale) / (self.channels // 2 - 1)
inv_timescales = torch.exp(-log_timescale_increment * torch.arange(self.channels // 2).float())
scaled_time = torch.arange(self.length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)

def forward(self, seqlen: int):
return self.positional_embedding[:seqlen, :]
Expand Down
29 changes: 29 additions & 0 deletions src/transformers/models/qwen3_asr/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright 2026 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_qwen3_asr import *
from .feature_extraction_qwen3_asr import *
from .modeling_qwen3_asr import *
from .processing_qwen3_asr import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading
Loading