We introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that steer agents to continually learn in target environments with zero human data. To provide reliable reward signals during RL, we also introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments.
conda create -n ACuRL python=3.10
conda activate ACuRL
pip install -r requirements.txt
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-cache-dir
pip install -e .To ensure stable large-scale parallel execution, we recommend using a CPU server with at least 96 CPU cores and 384 GB RAM as the environment host. This configuration can reliably support up to 128 concurrent environments.
Please refer to this guideline for detailed instructions on how to set up the server.
To allow the training process to connect to the environment server, update the configuration file at:
./data/config_examples/environment_config.json
Set api_base_url to the IP address of your environment-hosting CPU server.
CUAJudge supports both API-only deployment and a hybrid setup that combines APIs with open-source models to reduce evaluation cost.
In particular, the key screenshot identification stage can be offloaded to an open-source vision-language model (e.g., Qwen3-VL-8B) while keeping other stages served via APIs.
If you choose to use CUAJudge purely through APIs (e.g., gpt-5-mini or other private models), simply specify the corresponding model names in the training script:
cuajudge_key_modelcuajudge_outcome_model
To reduce cost, you can replace the key screenshot identification model with an open-source VLM served via vLLM, while keeping the remaining components API-based.
Start a vLLM server for the key screenshot identification model:
vllm serve <MODEL_PATH> \
--served-model-name qwen3-vl-8b \
--data-parallel-size 4 \
--trust-remote-code \
--limit-mm-per-prompt.video 0 \
--max-model-len 8k \
--max-num-batched-tokens 8kModify the configuration generation script at: ./scripts/ACuRL/create_config.sh.
Update the following options:
use_vllm_for_key_screenshot=true
vllm_base_url = <VLLM_SERVER_URL>
cuajudge_key_model=SERVED_MODEL_NAME
{
"use_vllm_for_key_screenshot": true,
"vllm_base_url": "http://<IP>:8000/v1",
"cuajudge_key_model": "qwen3-vl-8b"
}
ACuRL consists of multiple stages: Environment Exploration, Context Review, Capability Evaluation, Curriculum Task Generation, and iterative RL training. The agent first interacts with the target environment to collect initial experience, then improves through iterative RL on curriculum tasks whose difficulty is tailored to the agent's current capability based on feedback from CUAJudge.
The provided scripts support multi-node using Ray to connect multiple nodes and submit jobs.
- Multi node:
- Step 1 (head node): Start the Ray head :
Record the head IP.
ray start --head --dashboard-host=0.0.0.0
- Step 2 (worker nodes): On each worker, join the head:
ray start --address="HEAD_IP:6379" - Step 3 (job submission): Submit jobs to the head node:
- Step 1 (head node): Start the Ray head :
If you only need to run on a single node, Ray is not required.
- Single-Node Setup (No Ray): Run locally without Ray (remove Ray-related code in the scripts).
- Simply remove the Ray job submission arguments
ray job submit --address="http://127.0.0.1:8265" - Run the training script directly:
python -m verl.trainer.main_ppo
- Simply remove the Ray job submission arguments
This stage aims to collect environment-specific experience for the task generator within a target environment, including its interface and functionalities, so that it can synthesize high-quality and valid tasks.
Run the following command to collect experience for a specific environment:
bash ./scripts/environment_exploration.sh SOFTWARE_NAME
Note: ./data/tasks/examples/libreoffice_impress/environment_exploration.json is the corresponding task configuration file for initializing the environment.
Conditioning the task generator on diverse user-created contexts significantly increases task diversity and better captures the complexity of real-world user requests.
Run the following command to let the agent review different contexts:
bash ./scripts/context_review.sh SOFTWARE_NAME
Note: ./data/tasks/examples/libreoffice_impress/context_review/*.json is the corresponding task configuration files for initializing the environment.
ACuRL training goes through an iterative RL. At the end of each iteration, we conduct a capability evaluation to assess the current agent’s proficiency. The evaluation results are then used by the curriculum generator to adjust task difficulty, tailoring subsequent training tasks to the agent’s current capabilities, thereby enabling effective continual learning.
Run the following script to conduct ACuRL Training:
bash ./scripts/ACuRL/run.sh
During training, ACuRL automatically generates task configurations and maintains task indices for each iteration.
-
Task configuration files Generated task configuration files are stored at:
./data/tasks/examples/<SOFTWARE>/<RUN_NAME>/ -
Task indices per iteration The task IDs used for training in each iteration are recorded at:
./data/tasks/task_index/<SOFTWARE>/<RUN_NAME>/
The curriculum generation logic is implemented in:
./curriculum_task_generation/curriculum_task_generator.py
After each iteration, task-level performance statistics are computed by:
./curriculum_task_generation/calculate_performance.py
This script aggregates results from the Capability Evaluation and produces task-level performance feedback, which is then used to guide the next round of curriculum generation.
You can directly run the following scripts to evaluate saved models in different formats.
- To evaluate a model saved in FSDP format:
./scripts/fsdp_model_evaluation.sh
- To evaluate a model saved in HF format:
./scripts/hf_model_evaluation.sh
To facilitate transparency and enable apples-to-apples comparisons within the community, we release our evaluation results here.
Our codebase is built upon veRL and verl-agent. The supported environments are adapted from OSWorld, Scienceboard, OfficeWorld. We extend our gratitude to the authors and contributors of these projects for their valuable work.
We also thank UI-TARS and QwenVL for providing open-source resources.
If you find our work or codebase useful in your research or applications, we kindly ask that you cite our work.
@misc{xue2026autonomouscontinuallearningcomputeruse,
title={Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation},
author={Tianci Xue and Zeyi Liao and Tianneng Shi and Zilu Wang and Kai Zhang and Dawn Song and Yu Su and Huan Sun},
year={2026},
eprint={2602.10356},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.10356},
}
