If you encounter any difficulties in usingthe code, please contact us at [email protected] or [email protected].
AutoEnv is an automated environment infrastructure for language-model agents, designed to scale both across environments and within each environment. Instead of hand-crafting a few fixed tasks, AutoEnv factorizes an environment into reward rules, transition dynamics, and observation “skins,” so that the same core world can be instantiated with different rule distributions and presentations (text-only, tabular, grid-based, etc.).
Figure 1. Process of automating environment generation in AutoEnv.Our long-term goal is to provide a unified way to automatically expand environments from text themes to richer modalities, including multimodal settings and 3D game worlds, while also scaling data inside each environment via level generators, validators, and large amounts of interaction trajectories.
Built on top of this infrastructure, we run cross-environment learning experiments with agents on the environments constructed by AutoEnv, and the results reveal robustness limitations in current agent learning methods. Beyond the original paper, however, AutoEnv is intended to be a general research platform—for studying environment generation, agent learning, reward design, and scaling laws in interactive worlds.
Figure 2. Impact of environment diversity on learning performance.Using AutoEnv, we generate 36 environments with fully distinct rule sets, forming the AutoEnv-36 dataset. These environments are represented in text, and each environment contains 10 test levels and 5 validation levels. We provide the source code together with level generation scripts in the benchmarks directory.
This example shows two observation “skins” for the same underlying gridworld. On the left, symbols like #, ., $, ^, and @ follow one semantic mapping (e.g., walls, free cells, goals, hazards, agent), while on the right we systematically invert this mapping (e.g., swapping walls and free space) without changing the true transition or reward rules. By comparing agent performance across these two views, we can test whether an agent is actually learning the environment dynamics rather than relying on fixed prior assumptions about what each symbol should mean.
###################### ......................
#....##....#....##...# .####..####.####..###.
#..$.....###.....#...# .##$#####...#####.###.
#..###....#....###..^# .##...####.####...##^.
##..#.^....##..#..#..# ..##.#^####..##.##.##.
#...##..##..^.##..#..# .###..##..##^#..##.##.
#.#...........@.#..#.# .#.###########@#.##.#.
#..##..^.#..##....#..# .##..##^#.##..####.##.
#.....##....#..##....# .#####..####.##..####.
###################### ......................
We generate multimodal skin for partial environments from AutoEnv-36 as listed below:
Figure 3. MultiModal Skin Generated For AutoEnv-36.We also generate multimodal skin based on the rules of the same maze. Generated multimodal environments are listed here:
Figure 4. MultiModal Skin Generated based on one Rule.- Add Environments Level Scaling Feature.
- Add Skin Control pipeline abstraction.
- Add Multimodal Environment Generation Pipelines.
- Add Three-stage Verification Pipeline for both text and multimodal environments.
- Add Learning Experiments Scripts.
- Add Coding Agents Option: Codex, Gemini Cli.
- Add 3D Environment Generation Pipelines.
AutoEnv/
├── autoenv/ # Environment generation logic and pipelines
├── base/ # Core abstractions (LLM client, pipeline, env)
├── benchmarks/ # AutoEnv-36 benchmark environments
├── config/ # Configuration files
├── scripts/ # Utility scripts
└── workspace/ # Runtime outputs (envs, logs, costs)
pip install -r requirements.txtPython 3.11+ recommended.
Fill config/model_config.yaml with your model names and endpoints.
Environment Generation (run_environment_generation.py): Generates text-based game environments from theme descriptions.
cp config/env_gen_example.yaml config/env_gen.yaml
# Edit config/env_gen.yaml with your settings
python run_environment_generation.pySkin Generation (run_environment_skin_generation.py): Generates visual assets for existing environments or from text instructions.
cp config/env_skin_gen_example.yaml config/env_skin_gen.yaml
# Edit config/env_skin_gen.yaml with your settings
python run_environment_skin_generation.pyCost summaries are automatically saved to workspace/costs/.
Evaluate agents on the 36 benchmark environments (scores for all; cost only for LLM branch). See benchmarks/README.md for details.
-
Built-in SolverAgent + LLMs (cost tracked):
python benchmarks/run.py \ --config config/benchmark/bench_llm_example.yaml \ --mode test \ --max-worlds 5--modeswitcheslevels/vsval_levels/;--max-worldslimits worlds per env. -
Custom agent (score only): implement
run(env, env_info), thenpython benchmarks/run.py \ --agent your_module:YourAgentAttr \ --agent-kwargs '{"foo": 1}' \ --mode val--agentacceptsmodule:Attror/path/to/file.py:Attr; Attr can be a class, factory, or pre-built instance.
Programmatic APIs are available in benchmarks/api.py (benchmark_llms, benchmark_custom_agent).
- Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Thanks to mini-swe-agent, codex, rembg, for providing basic support for this project!
If you find AutoEnv useful, we would appreciate it if you consider citing our work:
@article{zhang2025autoenv,
title={AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning},
author={Zhang, Jiayi and Peng, Yiran and Kong, Fanqi and Cheng, Yang and Wu, Yifan and Yu, Zhaoyang and Xiang, Jinyu and Ruan, Jianhao and Wang, Jinlin and Song, Maojia and others},
journal={arXiv preprint arXiv:2511.19304},
year={2025}
}



