-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add generate_streaming for Streaming Audio Generation #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add generate_streaming for Streaming Audio Generation #262
Conversation
| ) | ||
| start_time = time.time() | ||
|
|
||
| if current_step_idx - last_yield_step >= chunk_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to delay patterns, the first chunk is smaller than other chunks
| for i in range(batch_size): | ||
| generated_codes[i, : total_lens[i], :] = all_tokens[i] | ||
| lengths_Bx = torch.tensor(total_lens, device=self.device) | ||
| audio_chunks = self._generate_output(generated_codes, lengths_Bx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inefficient. Only process newly generate tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some artifacts when passing only the new tokens to the vocoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay can you fix the delay pattern problem?
|
Thanks a ton ! |
|
Thanks, when will this feature be release? |
|
When Can we expect this feature? |
This PR introduces the
generate_streamingfunction, which enables streaming audio generation. As soon as audio tokens are generated by the model, the vocoder is run on the entire sequence (using the full sentence as context for best quality), and only the newly generated audio chunk is yielded. The implementation closely mirrors the existinggeneratefunction, but is kept as a separate function (without extracting shared logic) to make the review process easier.Hopefully adding support for #11, #93, #237, making it usable in conversational use cases #181 and accelerating ttfb #153.