This repository contains the code for our NeurIPS 2025 spotlight paper Vision-centric Token Compression in Large Language Model. In this work, we propose VIST — Vision Centric Token Compression, a slow-fast token compression framework that mirrors human skimming.
VIST first converts loosely relevant long context into images, which are processed by a frozen vision encoder and a trainable Resampler to produce semantically compact visual tokens. These compressed tokens and the main input tokens are then consumed by the LLM. In this slow-fast setup, the vision encoder acts like the human eye—selectively attending to salient information—while the LLM functions as the brain, concentrating on the most informative content for deeper reasoning.
You can install the requirements with:
pip install --r requirements.txtFor dataset download and preprocessing, please follow the guidelines described in the CEPE project. Our data structure and preparation steps are consistent with that repository.
During training, VIST activates the Resample and enables trainable cross-attention layers within the LLM. You can simply start pretraining with:
bash pretrain.shTo evaluate VIST, you can run
#ICL tasks
bash scripts run_icl_ddp.sh
#Open-domain QA
bash scripts/run_qa.sh
Please cite our paper if you use VIST in your work:
@article{xing2025vision,
title={Vision-centric Token Compression in Large Language Model},
author={Xing, Ling and Wang, Alex Jinpeng and Yan, Rui and Shu, Xiangbo and Tang, Jinhui},
journal={arXiv preprint arXiv:2502.00791},
year={2025}
}This project builds upon and is inspired by the following open-source works:
We sincerely thank the authors for their excellent contributions to the community!
