Skip to content

Commit 66d550b

Browse files
committed
init project_page
1 parent 1dd3199 commit 66d550b

16 files changed

+923
-81
lines changed

README.md

Lines changed: 1 addition & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1 @@
1-
# VibeVoice: A Frontier Open-Source Text-to-Speech Model
2-
3-
<p align="center">
4-
<a href="https://microsoft.github.io/VibeVoice">
5-
<img src="https://img.shields.io/badge/🌐_Project_Page-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white" alt="Project Page">
6-
</a>
7-
<a href="https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f">
8-
<img src="https://img.shields.io/badge/🤗_Hugging_Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black" alt="Hugging Face">
9-
</a>
10-
<a href="https://aka.ms/VibeVoiceDemo">
11-
<img src="https://img.shields.io/badge/🎵_Demo-FF6B6B?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo">
12-
</a>
13-
</p>
14-
15-
16-
VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
17-
18-
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
19-
20-
The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models.
21-
22-
Try it out via [Demo](https://aka.ms/VibeVoiceDemo).
23-
24-
## Models
25-
| Model | Context Length | Generation Length | Weight |
26-
|-------|----------------|----------|----------|
27-
| VibeVoice-0.5B-Streaming | - | - | On the way |
28-
| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |
29-
| VibeVoice-7B| 32K | ~45 min | On the way |
30-
31-
## Installation
32-
We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment.
33-
34-
1. Launch docker
35-
```bash
36-
# NVIDIA PyTorch Container 24.07 / 24.10 / 24.12 verified.
37-
# Later versions are also compatible.
38-
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3
39-
40-
## If flash attention is not included in your docker environment, you need to install it manually
41-
## Refer to https://github.com/Dao-AILab/flash-attention for installation instructions
42-
# pip install flash-attn --no-build-isolation
43-
```
44-
45-
2. Install from github
46-
```bash
47-
git clone https://github.com/microsoft/VibeVoice.git
48-
cd VibeVoice/
49-
50-
pip install -e .
51-
```
52-
53-
## Usages
54-
55-
### Usage 1: Launch Gradio demo
56-
```bash
57-
apt update && apt install ffmpeg -y # for demo
58-
python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share
59-
```
60-
61-
### Usage 2: Inference from files directly
62-
```bash
63-
# We provide some LLM generated example scripts under demo/text_examples/ for demo
64-
# 1 speaker
65-
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice
66-
67-
# or more speakers
68-
python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan
69-
```
70-
71-
## Risks and limitations
72-
73-
Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.
74-
75-
English and Chinese only: Transcripts in language other than English or Chinese may result in unexpected audio outputs.
76-
77-
Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
78-
79-
Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
80-
81-
We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
1+
# VibeVoice Demo Page

assets/MOS-preference.png

65.6 KB
Loading

assets/VibeVoice.jpg

334 KB
Loading

assets/audio/1p_CH2EN.mp3

458 KB
Binary file not shown.

assets/audio/1p_EN2CH.mp3

508 KB
Binary file not shown.

assets/audio/2p_argument.mp3

268 KB
Binary file not shown.

assets/audio/2p_goat.mp3

699 KB
Binary file not shown.

assets/audio/2p_see_u_again.mp3

1.17 MB
Binary file not shown.

assets/audio/3p_gpt5.mp3

2.82 MB
Binary file not shown.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
[
2+
{
3+
"start": 0.0,
4+
"speaker": "Speaker 1",
5+
"text": "Hello everyone, and welcome to the VibeVoice podcast channel. I'm your host, Linda, and today I want to share some very interesting and authentic Chinese expressions with you."
6+
},
7+
{
8+
"start": 9.85,
9+
"speaker": "Speaker 1",
10+
"text": "In Chinese, when you want to say something is super easy, just a simple task, you can use the phrase \"小菜一碟\". It literally means \"a small dish of food\", but it means \"a piece of cake\". For example, if you want to say, \"Adding and subtracting three-digit numbers is a piece of cake for me\", you can say."
11+
},
12+
{
13+
"start": 28.86,
14+
"speaker": "Speaker 1",
15+
"text": "三位数的加减法对我来说小菜一碟."
16+
},
17+
{
18+
"start": 33.90,
19+
"speaker": "Speaker 1",
20+
"text": "The next phrase we’re going to learn is “你开玩笑吧”. It's a very common way to express disbelief, like \"Are you kidding me?\" or \"You must be joking\". For instance, when you hear an unbelievable piece of news such as your friend brought a T-shirt using 5000 dollars, you can say,"
21+
},
22+
{
23+
"start": 54.87,
24+
"speaker": "Speaker 1",
25+
"text": "你开玩笑吧, 你花五千块钱买了一件衣服."
26+
},
27+
{
28+
"start": 60.37,
29+
"speaker": "Speaker 1",
30+
"text": "Next, let's learn a phrase for when you suddenly understand something, like a \"lightbulb moment\". In Chinese, you can say \"恍然大悟\". It means you suddenly \"see the light\". For example, when you finally grasp a difficult math concept that has confused you for days, you can say."
31+
},
32+
{
33+
"start": 78.68,
34+
"speaker": "Speaker 1",
35+
"text": "我困惑这个公式好几天了, 但现在我恍然大悟, 终于明白了."
36+
},
37+
{
38+
"start": 86.00,
39+
"speaker": "Speaker 1",
40+
"text": "For our last one, when you want to say something is super easy, you can use a very vivid phrase: \"闭着眼睛都能做\". It literally means \"can do it with one's eyes closed\". For example, if you want to say, \"He can use this software with his eyes closed\", you can say."
41+
},
42+
{
43+
"start": 105.20,
44+
"speaker": "Speaker 1",
45+
"text": "这个软件他闭着眼都能用."
46+
},
47+
{
48+
"start": 108.35,
49+
"speaker": "Speaker 1",
50+
"text": "Well, that's all the time we have for today. Thank you for listening. Please subscribe to VibeVoice, where we share all the interesting things in this world with you."
51+
}
52+
]

0 commit comments

Comments
 (0)