Official PyTorch implementation for the following paper:
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
Jeongsoo Choi*, Ji-Hoon Kim*, Jinyu Li, Joon Son Chung, Shujie Liu
ICASSP 2025
[Paper] [Project]
For inference, download the checkpoints and place them in checkpoints directory.
| Name | Train Dataset | Model |
|---|---|---|
| v2sflow_encoder.pt | LRS3 | download |
| v2sflow_decoder.pt | LRS3 | download |
| hifigan_vocoder.pt | LRS3 | download |
We provide audio samples generated by several methods for test videos from LRS2 and LRS3.
| Train Dataset | Test Dataset | IntelligibleL2S | DiffV2S | V2SFlow-A | V2SFlow-V |
|---|---|---|---|---|---|
| LRS3 | LRS3 | download | download | download | download |
| LRS3 | LRS2 | download | download | download | download |
| LRS2 | LRS2 | download | download | - | - |
conda create -y -n v2sflow python=3.10 && conda activate v2sflow
git clone https://github.com/kaistmm/V2SFlow.git && cd V2SFlow
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
cd third_party/fairseq/data && python setup.py build_ext --inplace && cd -
-
25 Hz continuous features from AV-HuBERT Large
reference: Auto-AVSR for pre-processing videos, AV-HuBERT for extracting features
-
50 Hz discrete units from mHuBERT Base, layer 11, km 1000
reference: textless_s2st_real_data
-
12.5 Hz discrete units from YAAPT-based F0, VQ-VAE
reference: DDDM-VC, speech-resynthesis
-
global continuous speaker embedding
reference: Real-Time-Voice-Cloning
-
100 Hz continuous mel-spectrogram
filter_length: 640 hop_length: 160 win_length: 640 n_mel_channels: 80 sampling_rate: 16000 mel_fmin: 0.0 mel_fmax: 8000.0reference: TacotronSTFT
Sample data and directory structure are provided in data directory.
To train the encoders of V2SFlow, please refer to IntelligibleL2S.
To train the RFM speech decoder of V2SFlow:
bash train.sh
Logs and checkpoints will be saved in save/train directory by default. Modify the configurations in v2sflow/config/train.py as needed.
To generate speech from video feature:
bash inference.sh
Logs and checkpoints will be saved in save/inference directory by default. Modify the configurations in v2sflow/config/inference.py as needed.
This repository is built using Open-Sora, CosyVoice, Fairseq. We appreciate the open source of the projects.
If our work is useful for your research, please cite the following paper:
@inproceedings{choi2025v2sflow,
title={V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow},
author={Choi, Jeongsoo and Kim, Ji-Hoon and Li, Jinyu and Chung, Joon Son and Liu, Shujie},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2025}
}