CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook
Target modalities are partially aligned with bridging modalities via codebooks, resulting in a shared space. Unique features from both bridging and target modalities are preserved in specific space. Compositional VQ utilizes a combination of multiple low-dimensional codevectors to reconstruct a complete embedding.
- [April 16, 2026] Initial Release
Multimodal representation alignment is crucial for large language models and robotics. Traditional methods often struggle with cross-modal information discrepancies and data scarcity, resulting in suboptimal alignment spaces that neglect modality-unique features.
We introduce CodeBind, a novel framework that optimizes multimodal representation spaces using a modality-shared-specific codebook design.
Unlike conventional hard alignment approaches, CodeBind decomposes features into:
- Shared Components: Ensuring semantic consistency across modalities.
- Specific Components: Preserving modality-unique details.
This approach employs a compositional vector quantization scheme, where a shared codebook bridges modality gaps, and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.
- Release the training code
- Release CodeBind-IB checkpoints
- Release applications code
First, clone the repository and install the required packages.
git clone https://github.com/Visual-AI/codebind.git
cd codebind
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txtYou can use CodeBind to extract and compare features across modalities. An example snippet is provided below:
# TBDPlease refer to Doc/DATASETS.md for dataset preparation.
Please refer to Doc/MODEL_ZOO.md for details on available CodeBind checkpoints.
Please refer to Doc/TRAINING.md for details on CodeBind training scripts for different modalities.
This repository builds upon the invaluable contributions of the open-source community. We extend our sincere appreciation to the following projects for their foundational work:
If you find this repository useful, please consider giving a star ⭐ and citation:
@article{chen2026codebind,
title={CodeBind: Decoupled Representation Learning for Multimodal Alignment
with Unified Compositional Codebook},
author={Zeyu Chen and Jie Li and Kai, Han},
journal={arXiv preprint arXiv:},
year={2026}
}