Skip to content

[ICASSP 2025] V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Notifications You must be signed in to change notification settings

kaistmm/V2SFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

V2SFlow

Official PyTorch implementation for the following paper:

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
Jeongsoo Choi*, Ji-Hoon Kim*, Jinyu Li, Joon Son Chung, Shujie Liu
ICASSP 2025
[Paper] [Project]

Model Checkpoints

For inference, download the checkpoints and place them in checkpoints directory.

Name Train Dataset Model
v2sflow_encoder.pt LRS3 download
v2sflow_decoder.pt LRS3 download
hifigan_vocoder.pt LRS3 download

Test Samples

We provide audio samples generated by several methods for test videos from LRS2 and LRS3.

Train Dataset Test Dataset IntelligibleL2S DiffV2S V2SFlow-A V2SFlow-V
LRS3 LRS3 download download download download
LRS3 LRS2 download download download download
LRS2 LRS2 download download - -

Installation

conda create -y -n v2sflow python=3.10 && conda activate v2sflow

git clone https://github.com/kaistmm/V2SFlow.git && cd V2SFlow

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

cd third_party/fairseq/data && python setup.py build_ext --inplace && cd -

Data Preparation

Video Feature

  • 25 Hz continuous features from AV-HuBERT Large

    reference: Auto-AVSR for pre-processing videos, AV-HuBERT for extracting features

Content

Pitch

Speaker

Audio Feature

  • 100 Hz continuous mel-spectrogram

    filter_length: 640
    hop_length: 160
    win_length: 640
    n_mel_channels: 80
    sampling_rate: 16000
    mel_fmin: 0.0
    mel_fmax: 8000.0
    

    reference: TacotronSTFT

Sample data and directory structure are provided in data directory.

Training

To train the encoders of V2SFlow, please refer to IntelligibleL2S.

To train the RFM speech decoder of V2SFlow:

bash train.sh

Logs and checkpoints will be saved in save/train directory by default. Modify the configurations in v2sflow/config/train.py as needed.

Inference

To generate speech from video feature:

bash inference.sh

Logs and checkpoints will be saved in save/inference directory by default. Modify the configurations in v2sflow/config/inference.py as needed.

Acknowledgement

This repository is built using Open-Sora, CosyVoice, Fairseq. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@inproceedings{choi2025v2sflow,
  title={V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow},
  author={Choi, Jeongsoo and Kim, Ji-Hoon and Li, Jinyu and Chung, Joon Son and Liu, Shujie},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2025}
}

About

[ICASSP 2025] V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published