Add chatterbox support#42413
Conversation
…ansformers into add-s3gen-hifinet
…ansformers into add-s3gen-hifinet
…ansformers into add-s3gen-hifinet
…ansformers into add-s3gen-hifinet
|
This PR is very overwhelming, like a lot of code agent PRs. It's not clear to me how novel some of these architectures are, or whether we really need three whole new architectures in the codebase! 😅 Ideally we could cut down the PR size a lot by using |
|
@Rocketknight1 t3 model is very similar to llama model but because this is a whole Text-to-Speech pipeline we have tokenizer(s3tokenizer different pr which compresses the audio to 25 tok/ sec we need this for conditioning), encoder(T3- mod-llama model (main changes are speech tokens and conditioning )), decoder(s3gen cfm based model from cozyvoice2) and hifinet(from cozyvoice2) which converts mel to wav. |
ebezzam
left a comment
There was a problem hiding this comment.
@manmay-nakhashi I've done an initial review of some things stuck out to me, and to help you familiarize with different Transformers conventions.
Most notably, there are several modules that already exist within the Transformers library from other models, and those should be used via modular to create your modeling files.
Moreover, here are some PRs of other TTS models that may also help you see how to prepare the various files:
There was a problem hiding this comment.
The modeling tests should contain integration tests to compare the outputs of Transformers version with original model. For example:
- DAC:
- AudioFlamingo3 (more recent):
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, chatterbox, s3gen, s3tokenizer |
No description provided.