Add qwen3 tts by ShahVandit · Pull Request #44517 · huggingface/transformers

ShahVandit · 2026-03-07T18:48:04Z

What does this PR do?

Adds Qwen3-TTS, a series of text-to-speech models by the Qwen team (Alibaba Group), to Transformers.

Architecture:

Qwen3TTSForConditionalGeneration — text to multi-codebook speech codes (talker)
Qwen3TTSTokenizerV2Model (12Hz) and Qwen3TTSTokenizerV1Model (25Hz) — codes to audio waveform
Qwen3TTSProcessor — text preprocessing

Features: voice presets, voice design via natural language, batch inference, 10 languages

Paper: Qwen3-TTS Technical Report

Before submitting

Did you read the contributor guideline, Pull Request section?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@eustlb @ebezzam @vasqu

ebezzam

@ShahVandit thanks a bunch for your contribution!

It's great that you're using modular 👏 I've put a few comments to improve its usage because there are a lot of lines of code here which I think we'll be able to iteratively reduce.

A general principle when adding a new model to Transformers is that we want to be very careful when adding new modeling components. If it exists elsewhere, we want to use modular to inherit from those existing components. This will significantly reduce the number of lines in the modular file! Moreover, we don't want to keep unused code paths. I pointed out a few cases. When going through your modular file, ask yourself (and a coding agent helps a lot here for navigating the quite large code base!), whether (1) this is being used in the final modeling code and (2) does something similar exist in another model.

Moreover, one file that is missing is a script to convert the existing QwenTTS checkpoints to Transformers-compatible ones. Here are some recent examples of conversion scripts:

VIbeVoice ASR: https://github.com/huggingface/transformers/blob/main/src/transformers/models/vibevoice_asr/convert_vibevoice_asr_to_hf.py
Qwen 3 ASR (ongoing): https://github.com/mbtariq82/transformers/blob/qwen3-asr/src/transformers/models/qwen3_asr/convert_qwen3_asr_to_hf.py
You'll see that when using components from other models (via modular), they may have a different state dict from the original checkpoint, and you will have to map the original state dict to the new one. For example, with such a mapping) in the conversion script.

If it helps, below is my typical workflow when it comes to model integration:

Write an integration test(s). This will be our sanity check to make sure that the modeling code (generated from modular) doesn't deviate from the original. For example, VibeVoice ASR.
Write a reproducer script that generates expected outputs with the original checkpoint + code. For example: VIbeVoiceASR, Qwen 3 ASR. We will add a link to this reproducer in the integration test like this, as we won't add this file to the repo. If you look at the VibeVoice and Qwen3ASR reproducers, they write the expected outputs directly in the repo as JSON files (rather than having to copy and pasted EXPECTED_OUTPUT lists).
Get to a functional modular and conversion script. Note that the modular file can be used to generate the modeling AND configuration files, see how Qwen3ASR is using existing configs in the library here.
Iteratively prune, clean, and conform the modular file to Transformers conventions, while running the integration test to ensure that you aren't deviating from the original model's outputs.

RUN_SLOW=1 pytest tests/models/qwen3_tts/test_modeling_qwen3_ttspy::Qwen3TTSForConditionalGenerationIntegrationTest

Coding agents are very helpful for this process by giving targeted tasks and pointing to similar models/files in the repo.

I hope that helps! Let me know if you have any questions and thanks for your valuable contribution 🤗

ebezzam · 2026-03-20T18:00:00Z

+class Qwen3TTSIntegrationTest(unittest.TestCase):
+    """Integration tests for Qwen3-TTS (require real weights, run with --slow)."""
+
+    model_id = "Qwen/Qwen3-TTS-12Hz-0.6B-Base"


in the integration test, we will rather test your converted checkpoint, instead of the original. See Qwen3ASR.

When the model is ready merge, then we may contact the original Qwen team, to upload a Transformers compatible version to their org. For example, with VibeVoice ASR:

Original checkpoint: https://huggingface.co/microsoft/VibeVoice-ASR

Transformers compatible one: https://huggingface.co/microsoft/VibeVoice-ASR-HF

ebezzam · 2026-03-20T18:02:04Z

+
+    @slow
+    @require_torch_accelerator
+    def test_small_model_integration_text_to_codes(self):


take a look these examples for recent approaches in writing the integration test:

VibeVoice ASR (merged):

transformers/tests/models/vibevoice_asr/test_modeling_vibevoice_asr.py

Line 298 in d4f88c2

def test_single(self):

Qwen3 ASR (ongoing): https://github.com/mbtariq82/transformers/blob/7ed8e5425a72ecb9594eba8a1852bacd9102a891/tests/models/qwen3_asr/test_modeling_qwen3_asr.py#L127

we can limit to generate around 50-100 tokens

ebezzam · 2026-03-20T18:05:28Z

+class Qwen3TTSTokenizerV2LayerScale(MimiLayerScale):
+    pass


I don't see this module being used in the generated modeling? If so, we can remove.

ebezzam · 2026-03-20T18:06:24Z

+class Qwen3TTSTokenizerV1DiTCodecEmbedding(DiTCodecEmbedding):
+    pass
+
+
+class Qwen3TTSTokenizerV1DiTMLP(DiTMLP):
+    pass
+
+
+class Qwen3TTSTokenizerV1DiTAttention(DiTAttention):
+    pass


Similarly, I don't see these modules being used?

ebezzam · 2026-03-20T18:08:34Z

+class Qwen3TTSTokenizerV1DiTTimestepEmbedding(DiTTimestepEmbedding):
+    pass
+
+
+class Qwen3TTSTokenizerV1SinusoidsPositionEmbedding(SinusoidsPositionEmbedding):
+    pass
+
+
+class Qwen3TTSTokenizerV1AdaLayerNormZero_Final(Qwen2_5_OmniAdaLayerNormZero_Final):
+    pass


Similarly, are these being used?

ebezzam · 2026-03-20T18:13:34Z

+    decoder_past_key_values: Cache | None = None
+
+
+class Qwen3TTSConv1dPaddingCache:


We'll want to update the modular in the relevant place to use this newer padding cache object:

transformers/src/transformers/models/voxtral_realtime/modeling_voxtral_realtime.py

Line 103 in d4f88c2

class VoxtralRealtimeConv1dPaddingCache:

ebezzam

Sorry some of my comments went into the modeling file when I was jumping in between the modeling and modular!

ebezzam · 2026-03-20T21:34:08Z

+        return quantized.transpose(1, 2)
+
+
+class Qwen3TTSTokenizerV2ResidualVectorQuantization(nn.Module):


Sorry meant to put the comment here! -> Can we use RVQ from DAC or Mimi?

Hey @ebezzam , I tried using MimiModel for the V2 encoder but hit a converter issue.

The modular converter renames all Mimi* references to Qwen3TTSTokenizerV2* based on prefix voting. So MimiEncoder inside MimiModel.init becomes Qwen3TTSTokenizerV2Encoder, which is the same name as the class itself, causing infinite recursion.

I tried renaming the class to Qwen3TTSTokenizerV2AudioEncoder to avoid the collision. That fixed the recursion, but then MimiTransformerModel got renamed to Qwen3TTSTokenizerV2TransformerModel, which clashes with our Code2Wav decoder transformer that has the same name but expects a completely different config.

The RVQ classes (EuclideanCodebook, VectorQuantization, etc.) inherit from Mimi fine since there's no name collision there. The problem is specifically with MimiModel because it creates internal components whose names clash with classes we already define.

Is there a recommended way to handle this, or should we keep the V2 encoder standalone for now?

Thanks @ShahVandit for the detailed explanation. A couple points:

To simplify / compartmentalize things, we can make the QwenTTS Tokenizer(s) their own model(s). Similar to how Mimi is its own model and is used as a subconfig/model for Kyutai's STT (see here and here). Here are other examples: (VibeVoice tokenizer, VibeVoice ASR) and (Higgs tokenizer, Higgs model). That may help with the clashing names in modular, and also make the modular more readable for each model!

Similarly from what I understand in the paper, TokenizerV2 and TokenizerV1 are meant to be two types of tokenizers? A single codebook one (Qwen3-TTS-Tokenizer-25Hz) and multi-codebook (Qwen3-TTS-Tokenizer-12Hz). So let's use a more meaningful name for them, e.g. such as Qwen3TTSTokenizerSingleCodebook and Qwen3TTSTokenizerMultiCodebook, and make them two separate models. And from I understand in the paper Qwen3TTSTokenizerSingleCodebook will be able to inherit via modular from Qwen2Audio and Qwen3TTSTokenizerMultiCodebook from Mimi.

So there will be three models in totals, each with their own model folder (with configuration, modular, etc): Qwen3TTSTokenizerSingleCodebook, Qwen3TTSTokenizerMultiCodebook, Qwen3TTS. And the later will be able to use the modeling from the tokenizers, via AutoModel (like this).

Hope that's clear and that it helps!

ebezzam · 2026-03-20T21:42:26Z

+    return torch.log(torch.clamp(x, min=clip_val) * C)
+
+
+def mel_spectrogram(


There are mel_spectrogram methods within audio_utils. Although we may be able to use the feature extraction from Whisper, as is the case for Qwen ASR (but should be double-checked).

Moreover, we typically bundle the feature extractor and the tokenizer within the processor (as can be seen in the Qwen ASR example).

If a new feature extractor is needed however, it should be in a separate feature_extraction_MODEL.py file. For example: https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/feature_extraction_whisper.py

ebezzam · 2026-03-20T21:43:07Z

+        return self.supported_languages
+
+    @classmethod
+    def from_pretrained(


From pretrained methods shouldn't have to be overwritten

ebezzam · 2026-03-20T21:44:57Z

+                return text_embed + codec_embed, tts_pad_embed
+
+    @torch.no_grad()
+    def generate(


Generation will likely be in a separate generation.py file. See:

CSM (merged): https://github.com/huggingface/transformers/blob/main/src/transformers/models/csm/generation_csm.py

VibeVoice (ongoing): https://github.com/pengzhiliang/transformers/blob/main/src/transformers/models/vibevoice/generation_vibevoice.py

ebezzam · 2026-03-20T21:53:03Z

+    attributes = ["tokenizer"]
+    tokenizer_class = ("Qwen2Tokenizer", "Qwen2TokenizerFast")
+
+    def __init__(self, tokenizer=None, chat_template=None):


We'll want to move the feature extraction (what you were doing with computing mel spectrograms) to the processor.

Note that it may even be interesting to generate the processor from the modular file. (See Qwen3ASR)

ebezzam · 2026-03-20T21:55:29Z

+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to the tokenizer's batch_decode method.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to the tokenizer's decode method.
+        """
+        return self.tokenizer.decode(*args, **kwargs)


As we are passing directly to the tokenizer, we don't need to define these methods.

AlanPonnachan · 2026-05-19T15:42:02Z

@ShahVandit any update on this? Happy to take this up if you don’t have bandwidth.

ShahVandit · 2026-05-19T20:09:07Z

Hi @AlanPonnachan Thanks for asking! I'm almost done, just wrapping up the integration tests and a few cleanup items. Should have it ready soon.

…and lazy loading

… Qwen3TTS

…st for Qwen3TTS

…d convert checkpoint

…refactor

ShahVandit · 2026-05-24T07:01:54Z

Hi @ebezzam , the single codebook is not complete and will be added in a follow-up, it also doesn't have an existing checkpoint yet. For the main model and multicodebook, reproducer scripts and integration tests are in place. One thing worth noting: the main model test only compares codes[0, :6] against the original, beyond that the outputs diverge. Open to suggestions on how to handle this.

…m qwen3_tts sub-configs

…d variable

…ring into __init__

…ts to not require local checkpoint

…r repo check

github-actions · 2026-05-25T04:31:07Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, qwen3_tts, qwen3_tts_tokenizer_multi_codebook, qwen3_tts_tokenizer_single_codebook

…, add missing autodoc entries

github-actions · 2026-05-25T05:12:24Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44517&sha=2041c2

ebezzam · 2026-06-03T15:12:38Z

hi @ShahVandit, just got back from some time off and I'll try to give you some feedback by next week! thanks for your contribution 🙏

ebezzam

@ShahVandit thanks for the changes! Here are a set of comments to iterate on.

General comments:

let's make best use of modular for the modeling code but also for the configuration files
Although I put more comments on the TTS model, but let's first focus on the audio tokenizers to get those right. Because if model differences are coming there, it will affect the downstream TTS model. Note that for the single codebook tokenizer, even though a checkpoint does not exist for it, you can extract the model weights from the larger TTS model (as in this VibeVoice example). So no need to address all the comments for the TTS model before asking for feedback!

in your docs, please put your checkpoint for the examples (with a TODO to change them later)

ebezzam · 2026-06-11T15:09:58Z

+# Qwen3-TTS
+
+<div class="flex flex-wrap space-x-1">
+<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">


nit: we can remove as PyTorch is the only supported framework now

ebezzam · 2026-06-12T15:11:34Z

+        channels,
+        kernel_size=3,
+        dilation=(1, 3, 5),
+        causal_type="1",


as mentioned above, let's avoid magic number

ebezzam · 2026-06-12T15:11:57Z

+    _can_compile_fullgraph = False
+
+
+class Qwen3TTSTokenizerSingleCodebookDecoderDiTRotaryEmbedding(nn.Module):


can we use an existing rotary embedding module via modular?

ebezzam · 2026-06-12T15:13:18Z

+def _v1_get_mel_audio(audio, padding=False, audio_vq_ds_rate=1, n_mels=128):
+    audio_len = len(audio)
+    if padding:
+        reduction = 160 * 2 * audio_vq_ds_rate
+        audio_pad = math.ceil(audio_len / reduction) * reduction - audio_len
+        mel = _v1_log_mel_spectrogram(audio, n_mels=n_mels, padding=audio_pad)
+    else:
+        mel = _v1_log_mel_spectrogram(audio, n_mels=n_mels)
+    return mel


should be handled by the feature extractor

ebezzam · 2026-06-12T15:14:12Z

+#  VQ core classes (inference-only port of core_vq.py)
+
+
+class Qwen3TTSTokenizerSingleCodebookEuclideanCodebook(nn.Module):


can we use from Higgs, Mimi, or Encodec?

ebezzam · 2026-06-12T15:22:24Z

can we use modular to generate this file? as we are using modules from other models

ShahVandit mentioned this pull request Mar 7, 2026

Proposal to add Qwen3-TTS support #43671

Open

2 tasks

ebezzam added New model Audio labels Mar 9, 2026

ShahVandit mentioned this pull request Mar 10, 2026

[Feature] Add FT support for the Qwen3-TTS model. unslothai/unsloth#3951

Open

ArthurZucker requested a review from eustlb March 12, 2026 10:07

ebezzam reviewed Mar 20, 2026

View reviewed changes

ebezzam self-assigned this Apr 13, 2026

ShahVandit force-pushed the add-qwen3-tts branch from 1033b43 to 56e02b2 Compare April 18, 2026 16:13

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

ShahVandit added 15 commits May 23, 2026 22:52

Add Qwen3-TTS configuration files

138eec3

Add Qwen3-TTS modular model implementation with 4 main model classes …

5dc2937

…and lazy loading

Added processor_qwen3_tts

6d47610

Added model loading tests

c1db519

Rectified issues

8f388c3

Added processor tests

69bcf6f

Fix Qwen3-TTS RoPE

b2a202e

Add Qwen3-TTS integration tests

d16bbb0

Fixed Qwen3TTS talker

e456d4b

Updated Integration tests

8748321

Updated Integration tests for all frames

4f2f339

Updated Integration tests for all frames

e5b8bca

Add comprehensive tests for Qwen3-TTS tokenizers

939ce4f

Updated comprehensive tests for Qwen3-TTS tokenizers

59f7e55

Long audio test for Qwen3-TTS tokenizer

6c4d392

ShahVandit added 11 commits May 23, 2026 22:53

Fixed formatting errors

ae06bff

Fix linting, docstrings, and post_init in qwen3_tts

53bdcfd

Fix check_repo, check_docstrings, and style for Qwen3TTS

564ed6b

Fix modeling structure post_init, save/load test, and style fixes for…

647950a

… Qwen3TTS

Fix modeling structure, config docstring checkpoint, and save/load te…

33fcfd7

…st for Qwen3TTS

Add Qwen3TTS main model, MultiCodebook, SingleCodebook, generation an…

de7cc6e

…d convert checkpoint

Add qwen3_tts entries to auto_mappings.py after dynamic auto mapping …

c91e3eb

…refactor

Fix style and update configs after make fix-repo

681e06e

Add @strict to configs and fix modeling structure violations

3f1b2a9

Fix import formatting

2ac7326

fix: update qwen3_tts tests

649b784

ShahVandit force-pushed the add-qwen3-tts branch from 334e19f to 649b784 Compare May 24, 2026 03:43

fix: update qwen3_tts and multicodebook debug statements

785a894

ShahVandit added 8 commits May 24, 2026 14:50

fix: update auto_mappings to match upstream and remove model_type fro…

532f28d

…m qwen3_tts sub-configs

fix: remove model_type from qwen3_tts sub-configs

3b29507

fix: remove model_type from sub-configs, fix auto_mappings, fix unuse…

7517920

…d variable

fix: add base_config_key to sub-configs and move Qwen3TTSConfig docst…

31f3f0a

…ring into __init__

fix: sync SC modeling comment with modular file and fix processor tes…

aa58809

…ts to not require local checkpoint

fix: sort imports in processor test

54eca93

fix modeling

c4f0eeb

fix: add MC sub-models to PRIVATE_MODELS in check_repo.py

a328140

ShahVandit requested a review from ebezzam May 25, 2026 04:13

ShahVandit added 2 commits May 25, 2026 00:22

fix: add PRIVATE_MODELS/IGNORE_NON_TESTED entries and SC test file fo…

97424f5

…r repo check

fix: rename MC test file and add SC test file for repo check

22e7b03

Update qwen3_tts docs: replace V1/V2 with MC/SC tokenizer class names…

2041c2b

…, add missing autodoc entries

ebezzam reviewed Jun 12, 2026

View reviewed changes

		decoder_past_key_values: Cache \| None = None


		class Qwen3TTSConv1dPaddingCache:

		return quantized.transpose(1, 2)


		class Qwen3TTSTokenizerV2ResidualVectorQuantization(nn.Module):

		return torch.log(torch.clamp(x, min=clip_val) * C)


		def mel_spectrogram(

		_can_compile_fullgraph = False


		class Qwen3TTSTokenizerSingleCodebookDecoderDiTRotaryEmbedding(nn.Module):

		# VQ core classes (inference-only port of core_vq.py)


		class Qwen3TTSTokenizerSingleCodebookEuclideanCodebook(nn.Module):

Uh oh!

Conversation

ShahVandit commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

ebezzam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlanPonnachan commented May 19, 2026

Uh oh!

ShahVandit commented May 19, 2026

Uh oh!

ShahVandit commented May 24, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

ebezzam commented Jun 3, 2026

Uh oh!

ebezzam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ShahVandit commented Mar 7, 2026 •

edited

Loading

ebezzam left a comment •

edited

Loading

ebezzam Mar 20, 2026 •

edited

Loading

ebezzam Mar 20, 2026 •

edited

Loading

ebezzam Mar 20, 2026 •

edited

Loading

ebezzam Mar 20, 2026 •

edited

Loading

ebezzam left a comment •

edited

Loading