Skip to content

Add CSM model#36719

Merged
eustlb merged 163 commits into
huggingface:mainfrom
eustlb:add-csm
May 7, 2025
Merged

Add CSM model#36719
eustlb merged 163 commits into
huggingface:mainfrom
eustlb:add-csm

Conversation

@eustlb

@eustlb eustlb commented Mar 14, 2025

Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds CSM (audio)-text-to-speech model!
Original code
Hub model weights
Converted weights

@github-actions github-actions Bot marked this pull request as draft March 14, 2025 10:50
@github-actions

Copy link
Copy Markdown
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SeungyounShin

SeungyounShin commented Apr 4, 2025

Copy link
Copy Markdown

Do you mind if I contribute to CSMForConditionalGeneration? It looks like it's currently empty. Or are you working on it right now?
cc. @eustlb

@eustlb

eustlb commented Apr 4, 2025

Copy link
Copy Markdown
Collaborator Author

Hey @SeungyounShin, currently, conditional generation is handled through the ForCausalLM class. Actually, for CSM it makes no difference if the model is generated from context (text + audio) or from text only. I decided for go with the ForCausalLM naming because of that, usually, the ForConditionalGeneration class is for encoder-decoder like architectures, but this is up to debate with the core maintainers as soon as the PR is ready for review 😊
thanks a lot for offering your help 🤗

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!

Comment thread src/transformers/models/mimi/modeling_mimi.py
Comment thread src/transformers/models/mimi/modeling_mimi.py Outdated
Comment thread src/transformers/models/mimi/modeling_mimi.py Outdated
@eustlb eustlb merged commit 798f948 into huggingface:main May 7, 2025
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* draft structure

* depth decoder with forward pre hook

* full model forward draft

* draft update

* depth decoder update

* ConversationalSpeechModelForCausalLM udpates

* add generate

* max length criteria small fix

* udpate

* updates

* generation update

* update in loss compute

* conversion script

* update for correct input embeddings

* handle interleaved rope

* update

* update

* update

* support compile

* update training

* add doc

* update doc

* correct inits

* ConversationalSpeechModel -> Csm

* conf update

* name update

* tests CsmForCausalLMTest

* convert use cached_file

* conf + modeling updates

* generate utils handle third dim shape

* integration test

* modeling + conf updates

* common test handle more than 2 dims

* add nested audio list utils

* processing handle nested audio list

* csm processing draft

* mimi util

* init updates

* modular update

* convert modular

* processing update

* csm tests update

* generate tests handle third dim

* generate utils handle third dim

* propagate _get_initial_cache_position update

* tied_weight_keys update + convert correctly

* fix inputs_embeds

* revert audio nested list

* batch inference update + return audio

* audio_utils update

* processor update

* some more integration tests

* remove old test

* porcessing output labels

* improve

* fix

* update rope values with equivalent ones

* conversion update

* udpate tests

* handle depth decoder generation config

* remove default eos_token_id

* make style

* revert modeling_mimi

* add default generation_config

* remove sdpa since handled by default

* make

* fix conflict

* fix conflicts

* correct naming

* correct imports

* make

* causal -> conditional naming

* causal -> conditional naming

* auto update

* make

* make

* add doc

* test update

* fix weight init

* audio tokens offsets as buffer

* 4d mask in conditional class

* make

* doc update

* fix causal mask

* fix causal mask

* doc update

* doc update

* add processor doc

* update doc

* fix 4d causal mask

* update make_list_of_audio

* do not default to mutable

* remove duplicates

* remove useless reset_parameters

* use GradientCheckpointingLayer

* use can_return_tuple

* formatting

* prepend placeholder in _sample

* torch compile fix

* some more fixies

* convert modular

* fix

* default max_length in convert

* handle depth decoder generation config correctly

* clearer formulation

* handle output_loading_info

* handle softmax warning

* add doc

* propagate _get_initial_cache_position changes

* generation in its own module

* add processor tests

* fix compile witu cuda graphs

* fix compile with cuda graphs

* add csm.md

* include CSM loss

* doc nit

* doc nit

* doc nit

* Update docs/source/en/model_doc/csm.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* add save_audio to processor

* Update src/transformers/models/csm/modular_csm.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* doc update

* simplify audio_codes_mask computation

* doc update

* simplify loss computation

* fix static cache test

* fix

* remove comment

* simplify encoded length computation

* use hf-internal-testing

* doc update

* cast to float before numpy

* nit

* mem efficient codebook head

* nit

* cat input values with cutoffs

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@johnwick123f

Copy link
Copy Markdown

Not sure if I should ask here but this implementation of csm seems considerably better quality then the official implementation for some reason? Any reason, why?

@ArthurZucker

Copy link
Copy Markdown
Collaborator

Not sure how we should answer 😅 happy that you like it 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants