V2SFlow

Official PyTorch implementation for the following paper:

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
Jeongsoo Choi*, Ji-Hoon Kim*, Jinyu Li, Joon Son Chung, Shujie Liu
ICASSP 2025
[Paper] [Project]

Model Checkpoints

For inference, download the checkpoints and place them in checkpoints directory.

Name	Train Dataset	Model
v2sflow_encoder.pt	LRS3	download
v2sflow_decoder.pt	LRS3	download
hifigan_vocoder.pt	LRS3	download

Test Samples

We provide audio samples generated by several methods for test videos from LRS2 and LRS3.

Train Dataset	Test Dataset	IntelligibleL2S	DiffV2S	V2SFlow-A	V2SFlow-V
LRS3	LRS3	download	download	download	download
LRS3	LRS2	download	download	download	download
LRS2	LRS2	download	download	-	-

Installation

conda create -y -n v2sflow python=3.10 && conda activate v2sflow

git clone https://github.com/kaistmm/V2SFlow.git && cd V2SFlow

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

cd third_party/fairseq/data && python setup.py build_ext --inplace && cd -

Data Preparation

Video Feature

25 Hz continuous features from AV-HuBERT Large

reference: Auto-AVSR for pre-processing videos, AV-HuBERT for extracting features

Content

50 Hz discrete units from mHuBERT Base, layer 11, km 1000

reference: textless_s2st_real_data

Pitch

12.5 Hz discrete units from YAAPT-based F0, VQ-VAE

reference: DDDM-VC, speech-resynthesis

Speaker

global continuous speaker embedding

reference: Real-Time-Voice-Cloning

Audio Feature

100 Hz continuous mel-spectrogram

filter_length: 640
hop_length: 160
win_length: 640
n_mel_channels: 80
sampling_rate: 16000
mel_fmin: 0.0
mel_fmax: 8000.0

reference: TacotronSTFT

Sample data and directory structure are provided in data directory.

Training

To train the encoders of V2SFlow, please refer to IntelligibleL2S.

To train the RFM speech decoder of V2SFlow:

bash train.sh

Logs and checkpoints will be saved in save/train directory by default. Modify the configurations in v2sflow/config/train.py as needed.

Inference

To generate speech from video feature:

bash inference.sh

Logs and checkpoints will be saved in save/inference directory by default. Modify the configurations in v2sflow/config/inference.py as needed.

Acknowledgement

This repository is built using Open-Sora, CosyVoice, Fairseq. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@inproceedings{choi2025v2sflow,
  title={V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow},
  author={Choi, Jeongsoo and Kim, Ji-Hoon and Li, Jinyu and Chung, Joon Son and Liu, Shujie},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

V2SFlow

Model Checkpoints

Test Samples

Installation

Data Preparation

Video Feature

Content

Pitch

Speaker

Audio Feature

Training

Inference

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
checkpoints		checkpoints
data/lrs3_debug		data/lrs3_debug
imgs		imgs
third_party		third_party
v2sflow		v2sflow
README.md		README.md
inference.sh		inference.sh
requirements.txt		requirements.txt
train.sh		train.sh

kaistmm/V2SFlow

Folders and files

Latest commit

History

Repository files navigation

V2SFlow

Model Checkpoints

Test Samples

Installation

Data Preparation

Video Feature

Content

Pitch

Speaker

Audio Feature

Training

Inference

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages