AutoEnv: Automating Environment Generation For Language Model Agents

If you encounter any difficulties in usingthe code, please contact us at [email protected] or [email protected].

Overview

AutoEnv is an automated environment infrastructure for language-model agents, designed to scale both across environments and within each environment. Instead of hand-crafting a few fixed tasks, AutoEnv factorizes an environment into reward rules, transition dynamics, and observation “skins,” so that the same core world can be instantiated with different rule distributions and presentations (text-only, tabular, grid-based, etc.).

Figure 1. Process of automating environment generation in AutoEnv.

Our long-term goal is to provide a unified way to automatically expand environments from text themes to richer modalities, including multimodal settings and 3D game worlds, while also scaling data inside each environment via level generators, validators, and large amounts of interaction trajectories.

Built on top of this infrastructure, we run cross-environment learning experiments with agents on the environments constructed by AutoEnv, and the results reveal robustness limitations in current agent learning methods. Beyond the original paper, however, AutoEnv is intended to be a general research platform—for studying environment generation, agent learning, reward design, and scaling laws in interactive worlds.

Figure 2. Impact of environment diversity on learning performance.

Generated Environments

AutoEnv-36 Dataset.

Using AutoEnv, we generate 36 environments with fully distinct rule sets, forming the AutoEnv-36 dataset. These environments are represented in text, and each environment contains 10 test levels and 5 validation levels. We provide the source code together with level generation scripts in the benchmarks directory.

Inverse Semantic Control

This example shows two observation “skins” for the same underlying gridworld. On the left, symbols like #, ., $, ^, and @ follow one semantic mapping (e.g., walls, free cells, goals, hazards, agent), while on the right we systematically invert this mapping (e.g., swapping walls and free space) without changing the true transition or reward rules. By comparing agent performance across these two views, we can test whether an agent is actually learning the environment dynamics rather than relying on fixed prior assumptions about what each symbol should mean.

######################           ......................
#....##....#....##...#           .####..####.####..###.
#..$.....###.....#...#           .##$#####...#####.###.
#..###....#....###..^#           .##...####.####...##^.
##..#.^....##..#..#..#           ..##.#^####..##.##.##.
#...##..##..^.##..#..#           .###..##..##^#..##.##.
#.#...........@.#..#.#           .#.###########@#.##.#.
#..##..^.#..##....#..#           .##..##^#.##..####.##.
#.....##....#..##....#           .#####..####.##..####.
######################           ......................

MultiModal Environments

We generate multimodal skin for partial environments from AutoEnv-36 as listed below:

Figure 3. MultiModal Skin Generated For AutoEnv-36.

We also generate multimodal skin based on the rules of the same maze. Generated multimodal environments are listed here:

Figure 4. MultiModal Skin Generated based on one Rule.

RoadMap

Add Environments Level Scaling Feature.
Add Skin Control pipeline abstraction.
Add Multimodal Environment Generation Pipelines.
Add Three-stage Verification Pipeline for both text and multimodal environments.
Add Learning Experiments Scripts.
Add Coding Agents Option: Codex, Gemini Cli.
Add 3D Environment Generation Pipelines.

Repository Layout

AutoEnv/
├── autoenv/              # Environment generation logic and pipelines
├── base/                 # Core abstractions (LLM client, pipeline, env)
├── benchmarks/           # AutoEnv-36 benchmark environments
├── config/               # Configuration files
├── scripts/              # Utility scripts
└── workspace/            # Runtime outputs (envs, logs, costs)

Quick Start

1. Install Dependencies

pip install -r requirements.txt

Python 3.11+ recommended.

2. Configure Model Keys

Fill config/model_config.yaml with your model names and endpoints.

3. Run

Environment Generation (run_environment_generation.py): Generates text-based game environments from theme descriptions.

cp config/env_gen_example.yaml config/env_gen.yaml
# Edit config/env_gen.yaml with your settings
python run_environment_generation.py

Skin Generation (run_environment_skin_generation.py): Generates visual assets for existing environments or from text instructions.

cp config/env_skin_gen_example.yaml config/env_skin_gen.yaml
# Edit config/env_skin_gen.yaml with your settings
python run_environment_skin_generation.py

Cost summaries are automatically saved to workspace/costs/.

Benchmarking AutoEnv-36

Evaluate agents on the 36 benchmark environments (scores for all; cost only for LLM branch). See benchmarks/README.md for details.

Built-in SolverAgent + LLMs (cost tracked):

python benchmarks/run.py \
  --config config/benchmark/bench_llm_example.yaml \
  --mode test \
  --max-worlds 5

--mode switches levels/ vs val_levels/; --max-worlds limits worlds per env.

Custom agent (score only): implement run(env, env_info), then
```
python benchmarks/run.py \
  --agent your_module:YourAgentAttr \
  --agent-kwargs '{"foo": 1}' \
  --mode val
```
--agent accepts module:Attr or /path/to/file.py:Attr; Attr can be a class, factory, or pre-built instance.

Programmatic APIs are available in benchmarks/api.py (benchmark_llms, benchmark_custom_agent).

Awesome work powered by AutoEnv

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Acknowledgements

Thanks to mini-swe-agent, codex, rembg, for providing basic support for this project!

Citation

If you find AutoEnv useful, we would appreciate it if you consider citing our work:

@article{zhang2025autoenv,
  title={AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning},
  author={Zhang, Jiayi and Peng, Yiran and Kong, Fanqi and Cheng, Yang and Wu, Yifan and Yu, Zhaoyang and Xiang, Jinyu and Ruan, Jianhao and Wang, Jinlin and Song, Maojia and others},
  journal={arXiv preprint arXiv:2511.19304},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoEnv: Automating Environment Generation For Language Model Agents

Overview

Generated Environments

AutoEnv-36 Dataset.

Inverse Semantic Control

MultiModal Environments

RoadMap

Repository Layout

Quick Start

1. Install Dependencies

2. Configure Model Keys

3. Run

Benchmarking AutoEnv-36

Awesome work powered by AutoEnv

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
autoenv		autoenv
base		base
benchmarks		benchmarks
config		config
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_environment_generation.py		run_environment_generation.py
run_skin_generation.py		run_skin_generation.py

License

FoundationAgents/AutoEnv

Folders and files

Latest commit

History

Repository files navigation

AutoEnv: Automating Environment Generation For Language Model Agents

Overview

Generated Environments

AutoEnv-36 Dataset.

Inverse Semantic Control

MultiModal Environments

RoadMap

Repository Layout

Quick Start

1. Install Dependencies

2. Configure Model Keys

3. Run

Benchmarking AutoEnv-36

Awesome work powered by AutoEnv

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages