feat: Add TaskEvaluator for task generation inner loop by dzorlu · Pull Request #12 · fleet-ai/OpenEnv

dzorlu · 2026-03-15T18:31:14Z

Summary

Adds TaskEvaluator that submits generated (prompt, verifier) pairs to Fleet harness (POST /v1/jobs), polls for completion, and extracts per-model verifier scores
Inner loop for task-gen RL: generate a task → evaluate via Fleet harness → get variance/separation signal
Async polling with asyncio.sleep, model ID normalization for provider/bare name mismatches

Test plan

Submit a task with known verifier to Fleet harness, verify scores returned
Verify async polling doesn't block event loop
Test model ID matching with/without provider prefix

🤖 Generated with Claude Code

Runs k × m rollouts of generated tasks on Fleet environments. Given (prompt, verifier_code, env_key), creates FleetTaskEnv instances, runs agent loops with model inference, and returns structured results for reward computation (learnability variance + model separation). Used as the inner loop of the task-scaling RL pipeline in SkyRL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of calling Anthropic directly and running a local agent loop, the evaluator now: 1. Imports the generated task via fleet.import_task() 2. Creates a harness job via fleet.create_job() 3. Polls for completion 4. Extracts per-session verifier scores from job sessions Uses real Fleet model IDs (claude-sonnet-4.5, claude-opus-4.5) instead of the broken weak/strong mapping that required ANTHROPIC_API_KEY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The sync time.sleep() in _poll_job blocked the asyncio event loop, preventing trajectory timeouts from cancelling evaluations. Using asyncio.sleep() allows the event loop to properly handle cancellations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fleet returns session model IDs without provider prefix (e.g., "claude-sonnet-4.5") while we configure them with prefix ("anthropic/claude-sonnet-4.5"). Added _match_model_id() to normalize and match by bare model name, so scores land in the correct results_per_model bucket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-15T18:34:22Z

+        start = time.time()
+        while time.time() - start < self.max_poll_time_s:
+            try:
+                job = fleet.get_job(job_id)


Sync Fleet SDK calls block event loop in async methods

Medium Severity

The evaluate and _poll_job async methods use the synchronous Fleet client for all HTTP calls (import_single_task, create_job, get_job, list_job_sessions), which blocks the event loop despite the methods being async. The rest of the codebase wraps sync Fleet calls with asyncio.to_thread() or uses AsyncFleet for this exact reason. The asyncio.sleep between polls yields correctly, but every actual API call blocks, contradicting the stated design goal of non-blocking async polling.

Additional Locations (2)

src/envs/fleet_env/task_evaluator.py#L172-L191

src/envs/fleet_env/task_evaluator.py#L208-L209

Deniz and others added 6 commits March 15, 2026 11:28

Fix model ID format: use provider/model prefix for Fleet harness

fafc3ea

Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: Remove unused json import, defensive copy DEFAULT_MODELS

0051e92

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dzorlu merged commit 8605142 into deniz/fleet_client Mar 15, 2026
1 check passed

cursor Bot reviewed Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add TaskEvaluator for task generation inner loop#12

feat: Add TaskEvaluator for task generation inner loop#12
dzorlu merged 6 commits into
deniz/fleet_clientfrom
deniz/task-evaluator

dzorlu commented Mar 15, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Mar 15, 2026

Summary

Test plan

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 15, 2026

Choose a reason for hiding this comment

Sync Fleet SDK calls block event loop in async methods

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant