[Draft] Add Llasa TTS family of models by ebezzam · Pull Request #39760 · huggingface/transformers

ebezzam · 2025-07-29T14:42:04Z

What does this PR do?

This PR adds the Llasa TTS family of models:

Reproducers for integration tests: https://gist.github.com/ebezzam/1863ec8eb7ec4afff02c26bdcb7691f9

TODO

Batch integration tests
Tokenizer and processing tests like Dia?
Create public model cards (update text and add relevant tags and labels). Atm under my account (1B, 3B, 8B).
Integrate with XCodec2 (Transformer version) when Add xcodec2 model #37868 merged

Example usage

Below is example usage with my Hub checkpoint (compared to that of original authors)

"""
pip install torchao xcodec2==0.1.3
"""

import torch
from transformers import LlasaTokenizer, LlasaForCausalLM, LlasaProcessor
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model

model_repo = "bezzam/Llasa-1B"
# model_repo = "bezzam/Llasa-3B"
# model_repo = "bezzam/Llasa-8B"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

# load processor (tokenizer + audio codec)
processor = LlasaProcessor(
    LlasaTokenizer.from_pretrained(model_repo),
    XCodec2Model.from_pretrained("HKUSTAudio/xcodec2").eval().to(torch_device)
)
# # -- use below when `XCodec2Model` integrated into `transformers`
# processor = LlasaProcessor.from_pretrained(model_repo)

# load model
model = LlasaForCausalLM.from_pretrained(model_repo)
model.eval().to(torch_device)

# TTS, some text inputs don't work which shows limitations of this approach
input_text = "How much wood would a woodchuck chuck if a woodchuck could chuck speech tokens?"
with torch.no_grad():

    # Tokenize the text
    encoded_text = processor(input_text).to(torch_device)

    # Generate the speech autoregressively
    outputs = model.generate(
        encoded_text["input_ids"],
        do_sample=False,
        max_length=600,    # generates up to ~10s. Max allowed length is 2048, as Llasa was trained with max length 2048
        top_p=1,           # Adjusts the diversity of generated content
        temperature=0.8,   # Controls randomness in output
    )

# decode to audio
gen_wav = processor.decode(outputs, input_offset=encoded_text["input_offset"])
fn = f"gen_{model_repo.split('/')[-1]}.wav"
sf.write(fn, gen_wav.cpu().numpy(), model.config.sampling_rate)
print(f"Generated speech saved to {fn}")

ebezzam · 2025-07-29T14:45:34Z

+    model_config.max_length = config.original_model.model_max_length
+    model = LlasaForCausalLM(model_config)
+    if config.remote_repo.dtype == "bfloat16":
+        model.to(torch.bfloat16)


Is bf16 fine? Original models are trained in bf16 (config) and their Hub checkpoints are also in bf16 (e.g., 1B)

ebezzam · 2025-07-29T14:52:34Z

+    def from_pretrained_llm(cls, *args, **kwargs):
+        """
+        Load the tokenizer from a pre-trained LLM model, and add relevant speech and Llasa tokens.
+        """
+        tokenizer = super().from_pretrained(*args, **kwargs)
+        tokenizer.add_tokens(list(tokenizer.llasa_token.values()) + tokenizer.speech_tokens)
+        return tokenizer


Is something like this fine? (also for LlasaConfig)

The difference with conventional from_pretrained is that this one increases the vocab size according to the (speech and llasa tokens). These methods are useful for the conversion script to copy the tokenizer and config from Llama (an LLM).

But when using Llasa, from_pretrained will be used as usual, loading from actual Llasa tokeniers and configs that don't need explicit adding of tokens.

HuggingFaceDocBuilderDev · 2025-07-29T14:55:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-07-30T12:03:25Z

cc @eustlb for TTS

ebezzam · 2025-08-15T09:17:54Z

+    # TODO: how to overwrite generate method?
+    # Not necessary but could be nice to check max_length <= 2048 (what model was trained on)
+    # I get the following error (I think because `generate` isn't method of LlamaForCausalLM but its parent):
+    # ```
+    # File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 355, in replace_super_calls
+    # original_modeling_method_body = self.original_modeling_methods[func_name].body.body
+    # KeyError: 'generate'
+    # ```
+    # """
+    # @torch.no_grad()
+    # def generate(
+    #     inputs,
+    #     max_length=2048,
+    #     **kwargs,
+    # ):
+    #     """
+    #     Set specific parameters from Llasa processor output
+    #     """
+    #     if max_length > 2048:
+    #         raise ValueError("Max length should be less than or equal to 2048.")
+
+    #     # Call the parent class's generate method
+    #     return super().generate(
+    #         inputs,
+    #         max_length=inputs["max_length"],
+    #         **kwargs
+    #     )


I was trying to overwrite the generate method for adding a check that users don't request outputs larger than model.generation_config.max_length (=2048), which is the max length the models were trained on. But maybe there's another way to restrict output sizes that users request? Or maybe it's not needed to add such a check?

I was getting the error (below) when running the modular script. Maybe the issue is that generate is not a method of LlamaForCausalLM but of its parent class GenerationMixin?

# command: python utils/modular_model_converter.py --files-to-parse src/transformers/models/llasa/modular_llasa.py Traceback (most recent call last): File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 1779, in <module> converted_files = convert_modular_file(file_name) File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 1693, in convert_modular_file for file, module in create_modules(cst_transformers).items(): File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 1634, in create_modules nodes_to_add, file_type, new_imports = get_class_node_and_dependencies(modular_mapper, class_name, node, files) File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 1577, in get_class_node_and_dependencies updated_node = replace_class_node(mapper, node, renamed_super_class, super_class) File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 1064, in replace_class_node new_replacement_class = new_module.visit( File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/metadata/wrapper.py", line 204, in visit return self.module.visit(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/module.py", line 89, in visit result = super(Module, self).visit(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/base.py", line 228, in visit _CSTNodeSelfT, self._visit_and_replace_children(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/module.py", line 74, in _visit_and_replace_children body=visit_body_sequence(self, "body", self.body, visitor), File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/internal.py", line 227, in visit_body_sequence return tuple(visit_body_iterable(parent, fieldname, children, visitor)) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/internal.py", line 193, in visit_body_iterable new_child = child.visit(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/base.py", line 228, in visit _CSTNodeSelfT, self._visit_and_replace_children(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/statement.py", line 1989, in _visit_and_replace_children body=visit_required(self, "body", self.body, visitor), File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/internal.py", line 81, in visit_required result = node.visit(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/base.py", line 228, in visit _CSTNodeSelfT, self._visit_and_replace_children(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/statement.py", line 704, in _visit_and_replace_children body=visit_body_sequence(self, "body", self.body, visitor), File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/internal.py", line 227, in visit_body_sequence return tuple(visit_body_iterable(parent, fieldname, children, visitor)) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/internal.py", line 193, in visit_body_iterable new_child = child.visit(visitor) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_nodes/base.py", line 237, in visit leave_result = visitor.on_leave(self, with_updated_children) File "/home/eric_bezzam/transformers/py310/lib/python3.10/site-packages/libcst/_visitors.py", line 71, in on_leave updated_node = leave_func(original_node, updated_node) File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 369, in leave_FunctionDef new_body = self.replace_super_calls(updated_node.body, name) File "/home/eric_bezzam/transformers/utils/modular_model_converter.py", line 355, in replace_super_calls original_modeling_method_body = self.original_modeling_methods[func_name].body.body KeyError: 'generate'

ebezzam · 2025-08-15T09:19:18Z

+# TODO use "audio_tokenizer_class" when merged https://github.com/huggingface/transformers/pull/37868
+# audio_tokenizer_class = "XCodec2Model"


Several TODOs like this to switch to XCodec2 model from Transformers when #37868 is merged

github-actions · 2025-08-16T09:07:01Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, csm, llasa

ebezzam added 2 commits July 28, 2025 11:50

FIrst version of llasa files.

d6ff664

Full working version.

80af272

ebezzam marked this pull request as draft July 29, 2025 14:42

ebezzam added New model Audio labels Jul 29, 2025

ebezzam commented Jul 29, 2025

View reviewed changes

Comment thread src/transformers/models/llasa/convert_llasa_to_hf.py Outdated

ebezzam commented Jul 29, 2025

View reviewed changes

Comment thread src/transformers/models/llasa/modular_llasa.py

Remove unused files.

74f7f5e

ebezzam requested a review from eustlb July 29, 2025 15:06

ebezzam added 8 commits July 31, 2025 13:52

Start docs.

7e8d743

Update docs and add configs for 3B and 8B.

3727ab9

Update example and format.

72ffddd

Make make fixup happy

aff9176

Update tests.

04f704e

Merge branch 'main' into add_llasa

16adc29

Add integration tests.

f77b530

Format.

c36ce9d

ebezzam commented Aug 15, 2025

View reviewed changes

ebezzam added 2 commits August 16, 2025 11:00

Better batch decoding.

217c298

Format.

9a7dde1

ebezzam self-assigned this Apr 13, 2026

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] Add Llasa TTS family of models#39760

[Draft] Add Llasa TTS family of models#39760
ebezzam wants to merge 13 commits into
huggingface:mainfrom
ebezzam:add_llasa

ebezzam commented Jul 29, 2025 •

edited

Loading

Uh oh!

ebezzam Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

ebezzam Jul 29, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 29, 2025

Uh oh!

Uh oh!

Rocketknight1 commented Jul 30, 2025

Uh oh!

ebezzam Aug 15, 2025

Uh oh!

ebezzam Aug 15, 2025

Uh oh!

github-actions Bot commented Aug 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		# TODO use "audio_tokenizer_class" when merged https://github.com/huggingface/transformers/pull/37868
		# audio_tokenizer_class = "XCodec2Model"

Uh oh!

Conversation

ebezzam commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Example usage

Uh oh!

ebezzam Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ebezzam Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 29, 2025

Uh oh!

Uh oh!

Rocketknight1 commented Jul 30, 2025

Uh oh!

ebezzam Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

ebezzam Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Aug 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ebezzam commented Jul 29, 2025 •

edited

Loading

ebezzam Jul 29, 2025 •

edited

Loading