Vision-centric Token Compression in Large Language Model

This repository contains the code for our NeurIPS 2025 spotlight paper Vision-centric Token Compression in Large Language Model. In this work, we propose VIST — Vision Centric Token Compression, a slow-fast token compression framework that mirrors human skimming.

👁 About VIST

VIST first converts loosely relevant long context into images, which are processed by a frozen vision encoder and a trainable Resampler to produce semantically compact visual tokens. These compressed tokens and the main input tokens are then consumed by the LLM. In this slow-fast setup, the vision encoder acts like the human eye—selectively attending to salient information—while the LLM functions as the brain, concentrating on the most informative content for deeper reasoning.

Quick Links

Vision-centric Token Compression in Large Language Model
- 👁 About VIST
- Quick Links
- Setup
- Data
- Training
- Evaluation
- Citation
- Acknowledgement

Setup

You can install the requirements with:

 pip install --r requirements.txt

Data

For dataset download and preprocessing, please follow the guidelines described in the CEPE project. Our data structure and preparation steps are consistent with that repository.

Training

During training, VIST activates the Resample and enables trainable cross-attention layers within the LLM. You can simply start pretraining with:

bash pretrain.sh

Evaluation

To evaluate VIST, you can run

#ICL tasks
bash scripts run_icl_ddp.sh

#Open-domain QA
bash scripts/run_qa.sh

Citation

Please cite our paper if you use VIST in your work:

@article{xing2025vision,
  title={Vision-centric Token Compression in Large Language Model},
  author={Xing, Ling and Wang, Alex Jinpeng and Yan, Rui and Shu, Xiangbo and Tang, Jinhui},
  journal={arXiv preprint arXiv:2502.00791},
  year={2025}
}

Acknowledgement

This project builds upon and is inspired by the following open-source works:

We sincerely thank the authors for their excellent contributions to the community!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
configs		configs
multimodal_model		multimodal_model
scripts		scripts
README.md		README.md
data.py		data.py
dataset_utils.py		dataset_utils.py
eval_downstream_copy.py		eval_downstream_copy.py
model_helper_v1.py		model_helper_v1.py
pretrain.sh		pretrain.sh
requirement.txt		requirement.txt
text_render.py		text_render.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision-centric Token Compression in Large Language Model

👁 About VIST

Quick Links

Setup

Data

Training

Evaluation

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

CSU-JPG/VIST

Folders and files

Latest commit

History

Repository files navigation

Vision-centric Token Compression in Large Language Model

👁 About VIST

Quick Links

Setup

Data

Training

Evaluation

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages