中文文档 | English
VR-Bench is a comprehensive benchmark for evaluating Vision-Language Models (VLMs) on spatial reasoning and planning tasks through various puzzle games. It provides a unified framework for dataset generation, evaluation, and analysis.
If you encounter any difficulties in using or reproducing the code, please contact me directly (Email: [email protected], Wechat: 19883175660). The parameter settings during the evaluation process and the selection of crackers will affect the evaluation results.
- [2025.12.03] Refactored tracker code for improved standardization and added comprehensive tracker documentation (NCC, Optical Flow, CSRT) with usage examples.
📝 Note on Paper Reproduction: The results in our paper were obtained using CSRT tracker. If you want to exactly reproduce the paper results, use
--tracker-type csrt. However, we recommend NCC tracker for general use as it provides more stable and accurate trajectory extraction in puzzle game scenarios.
- [2025.11.26] We apologize for the earlier omission. We have now added all our current maze textures to the skin folder to enable normal generation. In future releases, we will use nanobanana to support automatic skin generation. Please follow our updates.
- [2025.11.24] We have released the training scripts and corresponding configurations used to train Wan-R1.
- [2025.11.19] We have released evaluation code for all tasks.
Overview of VR-Bench. (A) Maze Types. VR-Bench comprises five maze types—Regular Maze, Irregular Maze, 3D Maze, Trapfield, and Sokoban—covering both 2D and 3D settings as well as diverse task structures, yielding a broad range of spatial reasoning scenarios. (B) Reasoning via Video Paradigm. VR-Bench adopts a chain-of-frame reasoning paradigm, requiring models to produce frame-by-frame inferences that capture sequential visual reasoning. (C) Benchmark Performance. Leading VLMs and video models are evaluated on four core metrics across all maze types, revealing clear differences in spatial reasoning capability. (D) Additional Analysis. VR-Bench also supports evaluations on difficulty generalization, texture generalization, maze-type generalization, and test-time scaling, enabling a comprehensive assessment of model robustness and generalization.
To evaluate the generalization ability on the VTR task and enhance robustness in adapting to diverse maze scenarios, we introduce variations across two key dimensions: (1) Difficulty Level: We define three difficulty grades (Easy, Medium, and Hard) by adjusting the maze size (e.g., expanding from 5×5 to 7×7), modifying the number of maze branches, and adding obstacles; (2) Maze Texture: We vary the textures of maze obstacles, paths, and other components using textures generated via procedural methods and generative models, which exposes the policies to a broad visual distribution and mitigates overfitting to clean, synthetic environments.
VR-Bench includes five different puzzle games, each testing different aspects of visual reasoning:
- Regular Maze: Basic spatial navigation and path planning in grid-based mazes
- Sokoban: Push boxes to target positions, requiring understanding of object interactions and push mechanics (highest logical difficulty)
- 3D Maze: Multi-level maze with height and occlusion, testing reasoning ability in 3D space
- PathFinder (Irregular Maze): Navigate through irregular mazes with curved paths, testing pure visual perception without coordinate memory
- TrapField: Navigate from start to goal while avoiding specific trap regions, testing constraint-based reasoning
- Procedural Generation: Automatically generate diverse puzzle levels with configurable difficulty
- Texture Customization: Support for custom visual themes through texture skins
- Video Rendering: Generate solution videos with smooth animations (24 FPS)
- VLM Evaluation: Built-in framework for testing various VLMs (GPT, Gemini, Qwen, etc.)
- Comprehensive Metrics: SR (Success Rate), PR (Precision Rate), SD (Step Deviation), EM (Exact Match), MF (Mask Fidelity)
- Parallel Processing: Multi-threaded generation and evaluation for efficiency
- Deduplication: Automatic detection and removal of duplicate levels
- Python >= 3.10
- CUDA-compatible GPU (optional, for local VLM inference)
# Clone the repository
git clone https://github.com/ImYangC7/VR-Bench.git
cd VR-Bench
# Install dependencies
pip install -r requirements.txt# Download pre-generated dataset from Hugging Face
python dataset_init.py --output-dir ./dataset_VR# Option A: call Python directly
# Edit config/config.yaml to configure game type, skins_root, output_root, and difficulties
python -m generation.batch_generate config/config.yaml
python generation/generate_videos.py <DATASET_DIR> --workers <N> --skin <SKIN_PATH>
# Option B: use the helper shell scripts (equivalent to the above)
bash scripts/generate_by_skins.sh config/config.yaml
bash scripts/generate_videos.sh <DATASET_DIR> [workers]We use DiffSynth-Studio for diffusion model training and inference. To install:
git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .After installation, make sure to update your dataset paths, hyperparameters, and output directory in the training script before launching your experiment.
Here is a reference configuration:
accelerate launch examples/wanvideo/model_training/train.py \
--dataset_base_path data/example_video_dataset \
--dataset_metadata_path data/example_video_dataset/metadata.csv \
--height 512 \
--width 512 \
--num_frames 193 \
--dataset_repeat 100 \
--model_id_with_origin_paths "Wan-AI/Wan2.2-TI2V-5B:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.2-TI2V-5B:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.2-TI2V-5B:Wan2.2_VAE.pth" \
--learning_rate 1e-4 \
--num_epochs 5 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path "./models/train/Wan2.2-TI2V-5B_lora" \
--lora_base_model "dit" \
--lora_target_modules "q,k,v,o,ffn.0,ffn.2" \
--lora_rank 32 \
--extra_inputs "input_image" Edit the script above with your specific data locations.
After training your model, you can run inference with the provided script:
-
Copy the inference script: Copy the evaluation script from VR-Bench to DiffSynth-Studio:
cp VR-Bench/scripts/Wan2.2-TI2V-5B_lora.py DiffSynth-Studio/examples/wanvideo/model_inference/
-
Update paths: Edit the copied script to update the paths according to your setup:
- Update the LoRA checkpoint path
- Update the input image path
- Update the output video path
- Customize the prompt as needed
-
Run inference:
cd DiffSynth-Studio/examples/wanvideo/model_inference/ python Wan2.2-TI2V-5B_lora.py
The script will generate videos based on your trained model and save them to the specified output directory.
# Evaluate generated videos against GT trajectories (auto-matches difficulties)
bash scripts/videomodel_evaluate.sh
# Or run directly
python evaluation/videomodel_eval/batch_evaluate.py \
DATASET_DIR OUTPUT_DIR RESULT_DIR \ # DATASET_DIR=GT dataset root, OUTPUT_DIR=model outputs, RESULT_DIR=eval outputs
--gpu # optionalThe trajectory extraction system supports three tracking algorithms, selectable via --tracker-type:
| Tracker | Parameter | Algorithm | Best For |
|---|---|---|---|
| NCC | ncc |
Normalized Cross-Correlation | Fixed-appearance targets (default, recommended) |
| Optical Flow | optical_flow |
Lucas-Kanade Sparse Optical Flow | Smooth continuous motion |
| CSRT | csrt |
Discriminative Correlation Filter | Deformable targets, partial occlusion |
NCC Tracker (Default, Recommended)
- Algorithm: Template matching using
cv2.TM_CCOEFF_NORMED(normalized correlation coefficient) - Pros: Fast, highly accurate for fixed-appearance objects, more stable trajectory extraction
- Cons: Sensitive to rotation/scale changes
- Best for: Puzzle game videos where player icons have fixed appearance (our main use case)
Optical Flow Tracker
- Algorithm: Lucas-Kanade pyramid optical flow tracking feature points
- Pros: Handles continuous motion well, computationally efficient
- Cons: May drift over long sequences, requires good feature points
- Best for: Smooth trajectory videos with gradual movements
CSRT Tracker
- Algorithm: Channel and Spatial Reliability Tracking (OpenCV built-in)
- Pros: Robust to partial occlusion and deformation
- Cons: May occasionally lose target in maze environments (e.g., Sokoban), slower, requires
opencv-contrib-python - Best for: General-purpose tracking with appearance changes
Usage Examples:
# Use default NCC tracker (default search margin 50px)
python evaluation/videomodel_eval/batch_evaluate.py DATASET OUTPUT RESULT
# Use NCC with full-image search
python evaluation/videomodel_eval/batch_evaluate.py DATASET OUTPUT RESULT \
--tracker-type ncc --search-margin 0
# Use optical flow tracker
python evaluation/videomodel_eval/batch_evaluate.py DATASET OUTPUT RESULT \
--tracker-type optical_flow
# Use CSRT tracker
python evaluation/videomodel_eval/batch_evaluate.py DATASET OUTPUT RESULT \
--tracker-type csrt- Configure environment:
cp .env.example .envand fill API keys, dataset paths, CUDA, etc. - (Optional/local models) start the VLM service:
bash scripts/start_sglang_server.sh- Run VLM evaluation on the dataset results:
bash scripts/run_vlm_eval.sh- PR (Precision Rate): Fraction of resampled points that stay within a small tolerance to the GT path; measures path shape consistency.
- SR (Success Rate): Whether the generated trajectory (player or box for Sokoban) enters the goal bounding box at least once.
- SD (Step Deviation): Relative path-length overrun vs GT (
len_gen / len_gt - 1), only defined when SR=1 and non-negative. - EM (Exact Match): Perfect flag (1/0) when PR exceeds a threshold and |SD| is small, conditioned on SR=1.
- MF (Mask Fidelity): Background stability score [0,1]; compares sampled frames to the first frame while masking start/goal/player regions.
VR-Bench/
├── core/ # Core framework
├── games/ # Game implementations
├── generation/ # Dataset generation
├── evaluation/
│ ├── videomodel_eval/ # Evaluate video models’ trajectory reasoning
│ └── vlm_eval/ # Evaluate VLMs’ planning / action reasoning
├── config/ # Generation & evaluation configs
├── skins/ # Texture assets
└── scripts/ # Utility scripts
game_type: Game to generate (maze, sokoban, pathfinder, trapfield, maze3d)skins_root: Path to texture assetsdifficulties: Difficulty levels and parametersgeneration.max_attempts: Max attempts to generate valid levelparallel.max_workers: Number of parallel workers
game: Game type to evaluatedataset: Path to datasetmodels: List of VLMs to testworkers: Number of parallel evaluation workersmax_levels: Maximum levels to evaluate (-1 for all)
Each game supports custom texture skins for visual variety:
- Create a new folder under
skins/<game_name>/ - Add required texture images (PNG/JPG format)
- Specify the skin path in configuration
Required texture files vary by game. Refer to existing skin folders for examples.
- Maze: wall, floor, player, goal
- Sokoban: wall, floor, player, box, target
- PathFinder: Custom background and path textures
- TrapField: floor, trap, player, goal
VR-Bench uses an adapter pattern for easy extensibility:
- Create a new game directory under
games/ - Implement the
GameAdapterinterface:generate_level(): Level generation logicsave_level(): Save level data and render outputsget_level_hash(): For deduplicationis_duplicate(): Duplicate detection
- Implement game-specific logic and rendering
- Create an executor in
evaluation/vlm_eval/executors/ - Register in
generation/batch_generate.py
See existing game implementations for reference.
Issue: CUDA out of memory during VLM inference
- Solution: Reduce batch size or use tensor parallelism with multiple GPUs
Issue: Video generation fails
- Solution: Ensure ffmpeg is installed:
pip install imageio-ffmpeg
Issue: API rate limiting
- Solution: Reduce
workersin evaluation config or add delays
Issue: Duplicate levels generated
- Solution: Increase
max_duplicate_retriesin generation config
If you use VR-Bench in your research, please cite:
@article{yang2025vrbench,
title={Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks},
author={Cheng Yang and Haiyuan Wan and Yiran Peng and Xin Cheng and Zhaoyang Yu and Jiayi Zhang and Junchi Yu and Xinlei Yu and Xiawu Zheng and Dongzhan Zhou and Chenglin Wu},
journal={arXiv preprint arXiv:2511.15065},
year={2025}
}Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
VR-Bench builds upon various open-source projects and research in visual reasoning and VLM evaluation.
For questions and feedback, please open an issue on GitHub or contact the maintainers.