microsoft
diff --git a/‎README.md‎
Lines changed: 1 addition & 81 deletions b/‎README.md‎
Lines changed: 1 addition & 81 deletions
diff --git a/‎assets/MOS-preference.png‎
65.6 KB b/‎assets/MOS-preference.png‎
65.6 KB
diff --git a/‎assets/VibeVoice.jpg‎
334 KB b/‎assets/VibeVoice.jpg‎
334 KB
diff --git a/‎assets/audio/1p_CH2EN.mp3‎
458 KB b/‎assets/audio/1p_CH2EN.mp3‎
458 KB
diff --git a/‎assets/audio/1p_EN2CH.mp3‎
508 KB b/‎assets/audio/1p_EN2CH.mp3‎
508 KB
diff --git a/‎assets/audio/2p_argument.mp3‎
268 KB b/‎assets/audio/2p_argument.mp3‎
268 KB
diff --git a/‎assets/audio/2p_goat.mp3‎
699 KB b/‎assets/audio/2p_goat.mp3‎
699 KB
diff --git a/‎assets/audio/2p_see_u_again.mp3‎
1.17 MB b/‎assets/audio/2p_see_u_again.mp3‎
1.17 MB
diff --git a/‎assets/audio/3p_gpt5.mp3‎
2.82 MB b/‎assets/audio/3p_gpt5.mp3‎
2.82 MB
diff --git a/‎assets/text/1p_CH2EN_gt_timestamp.json‎
Lines changed: 52 additions & 0 deletions b/‎assets/text/1p_CH2EN_gt_timestamp.json‎
Lines changed: 52 additions & 0 deletions
@@ -1,81 +1 @@
-# VibeVoice: A Frontier Open-Source Text-to-Speech Model
-
-<p align="center">
-  <a href="https://microsoft.github.io/VibeVoice">
-    <img src="https://img.shields.io/badge/🌐_Project_Page-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white" alt="Project Page">
-  </a>
-  <a href="https://huggingface.co/collections/microsoft/vibevoice-68a2ef24a875c44be47b034f">
-    <img src="https://img.shields.io/badge/🤗_Hugging_Face-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black" alt="Hugging Face">
-  </a>
-  <a href="https://aka.ms/VibeVoiceDemo">
-    <img src="https://img.shields.io/badge/🎵_Demo-FF6B6B?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo">
-  </a>
-</p>
-
-
-VibeVoice is a novel framework designed for generating **expressive**, **long-form**, **multi-speaker** conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
-
-A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https://arxiv.org/abs/2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
-
-The model can synthesize speech up to **90 minutes** long with up to **4 distinct speakers**, surpassing the typical 1-2 speaker limits of many prior models. 
-
-Try it out via [Demo](https://aka.ms/VibeVoiceDemo).
-
-## Models
-| Model | Context Length | Generation Length |  Weight |
-|-------|----------------|----------|----------|
-| VibeVoice-0.5B-Streaming | - | - | On the way |
-| VibeVoice-1.5B | 64K | ~90 min | [HF link](https://huggingface.co/microsoft/VibeVoice-1.5B) |
-| VibeVoice-7B| 32K | ~45 min | On the way |
-
-## Installation
-We recommend to use NVIDIA Deep Learning Container to manage the CUDA environment. 
-
-1. Launch docker
-```bash
-# NVIDIA PyTorch Container 24.07 / 24.10 / 24.12 verified. 
-# Later versions are also compatible.
-sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it  nvcr.io/nvidia/pytorch:24.07-py3
-
-## If flash attention is not included in your docker environment, you need to install it manually
-## Refer to https://github.com/Dao-AILab/flash-attention for installation instructions
-# pip install flash-attn --no-build-isolation
-```
-
-2. Install from github
-```bash
-git clone https://github.com/microsoft/VibeVoice.git
-cd VibeVoice/
-
-pip install -e .
-```
-
-## Usages
-
-### Usage 1: Launch Gradio demo
-```bash
-apt update && apt install ffmpeg -y # for demo
-python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share
-```
-
-### Usage 2: Inference from files directly
-```bash
-# We provide some LLM generated example scripts under demo/text_examples/ for demo
-# 1 speaker
-python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice
-
-# or more speakers
-python demo/inference_from_file.py --model_path microsoft/VibeVoice-1.5B --txt_path demo/text_examples/2p_zh.txt --speaker_names Alice Yunfan
-```
-
-## Risks and limitations
-
-Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.
-
-English and Chinese only: Transcripts in language other than English or Chinese may result in unexpected audio outputs.
-
-Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.
-
-Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.
-
-We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.
+# VibeVoice Demo Page
@@ -0,0 +1,52 @@
+[
+    {
+        "start": 0.0,
+        "speaker": "Speaker 1",
+        "text": "Hello everyone, and welcome to the VibeVoice podcast channel. I'm your host, Linda, and today I want to share some very interesting and authentic Chinese expressions with you."
+    },
+    {
+        "start": 9.85,
+        "speaker": "Speaker 1",
+        "text": "In Chinese, when you want to say something is super easy, just a simple task, you can use the phrase \"小菜一碟\". It literally means \"a small dish of food\", but it means \"a piece of cake\". For example, if you want to say, \"Adding and subtracting three-digit numbers is a piece of cake for me\", you can say."
+    },
+    {
+        "start": 28.86,
+        "speaker": "Speaker 1",
+        "text": "三位数的加减法对我来说小菜一碟."
+    },
+    {
+        "start": 33.90,
+        "speaker": "Speaker 1",
+        "text": "The next phrase we’re going to learn is “你开玩笑吧”. It's a very common way to express disbelief, like \"Are you kidding me?\" or \"You must be joking\". For instance, when you hear an unbelievable piece of news such as your friend brought a T-shirt using 5000 dollars, you can say,"
+    },
+    {
+        "start": 54.87,
+        "speaker": "Speaker 1",
+        "text": "你开玩笑吧, 你花五千块钱买了一件衣服."
+    },
+    {
+        "start": 60.37,
+        "speaker": "Speaker 1",
+        "text": "Next, let's learn a phrase for when you suddenly understand something, like a \"lightbulb moment\". In Chinese, you can say \"恍然大悟\". It means you suddenly \"see the light\". For example, when you finally grasp a difficult math concept that has confused you for days, you can say."
+    },
+    {
+        "start": 78.68,
+        "speaker": "Speaker 1",
+        "text": "我困惑这个公式好几天了, 但现在我恍然大悟, 终于明白了."
+    },
+    {
+        "start": 86.00,
+        "speaker": "Speaker 1",
+        "text": "For our last one, when you want to say something is super easy, you can use a very vivid phrase: \"闭着眼睛都能做\". It literally means \"can do it with one's eyes closed\". For example, if you want to say, \"He can use this software with his eyes closed\", you can say."
+    },
+    {
+        "start": 105.20,
+        "speaker": "Speaker 1",
+        "text": "这个软件他闭着眼都能用."
+    },
+    {
+        "start": 108.35,
+        "speaker": "Speaker 1",
+        "text": "Well, that's all the time we have for today. Thank you for listening. Please subscribe to VibeVoice, where we share all the interesting things in this world with you."
+    }
+]