This is the official repository for the Paper "Incorporating Visual Correspondence into Diffusion Model for Visual Try-On"
We novelly propose to explicitly capitalize
on visual correspondence as the prior to tame diffusion process instead of simply
feeding the whole garment into UNet as the appearance reference.
Create a conda environment & Install requirments
conda create -n SPM-Diff python==3.9.0
conda activate SPM-Diff
cd SPM-Diff-main
pip install -r requirements.txt
In SPM, a set of semantic points on the garment are first sampled and matched to the corresponding points on the target person via local flow warping. Then, these 2D cues are augmented into 3D-aware cues with depth/normal map, which act as semantic point matching to supervise diffusion model.
You can directly download the Semantic Point Feature or follow the instructions in preprocessing.md to extract the Semantic Point Feature yourself.
You can download the VITON-HD dataset from here)
For inference, the following dataset structure is required:
test
|-- image
|-- masked_vton_img
|-- warp-cloth
|-- cloth
|-- cloth_mask
|-- point
Please download the pre-trained model from Link.
sh inference.sh
Thanks the contribution of LaDI-VTON and GP-VTON.
@inproceedings{
wan2025incorporating,
title={Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On},
author={Siqi Wan and Jingwen Chen and Yingwei Pan and Ting Yao and Tao Mei},
booktitle={ICLR},
year={2025},
}