Skip to content

Visual-AI/CodeBind

Repository files navigation

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

Zeyu Chen   Jie Li   Kai Han

Visual AI Lab, The University of Hong Kong

Paper Project Page

Target modalities are partially aligned with bridging modalities via codebooks, resulting in a shared space. Unique features from both bridging and target modalities are preserved in specific space. Compositional VQ utilizes a combination of multiple low-dimensional codevectors to reconstruct a complete embedding.

📣 Updates

  • [April 16, 2026] Initial Release

✨ Overview

Multimodal representation alignment is crucial for large language models and robotics. Traditional methods often struggle with cross-modal information discrepancies and data scarcity, resulting in suboptimal alignment spaces that neglect modality-unique features.

We introduce CodeBind, a novel framework that optimizes multimodal representation spaces using a modality-shared-specific codebook design.

Unlike conventional hard alignment approaches, CodeBind decomposes features into:

  • Shared Components: Ensuring semantic consistency across modalities.
  • Specific Components: Preserving modality-unique details.

This approach employs a compositional vector quantization scheme, where a shared codebook bridges modality gaps, and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

📝 TODOs

  • Release the training code
  • Release CodeBind-IB checkpoints
  • Release applications code

🔨 Installation

First, clone the repository and install the required packages.

git clone https://github.com/Visual-AI/codebind.git
cd codebind
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

📚 Quick Start

You can use CodeBind to extract and compare features across modalities. An example snippet is provided below:

# TBD

📦 Datasets

Please refer to Doc/DATASETS.md for dataset preparation.

🧩 Model Zoo

Please refer to Doc/MODEL_ZOO.md for details on available CodeBind checkpoints.

🚀 Training & Inference

Please refer to Doc/TRAINING.md for details on CodeBind training scripts for different modalities.

🙏 Acknowledgements

This repository builds upon the invaluable contributions of the open-source community. We extend our sincere appreciation to the following projects for their foundational work:

📜 Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{chen2026codebind,
    title={CodeBind: Decoupled Representation Learning for Multimodal Alignment
    with Unified Compositional Codebook},
    author={Zeyu Chen and Jie Li and Kai, Han},
    journal={arXiv preprint arXiv:},
    year={2026}
}

About

[ACL 2026 Findings] CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors