feat: Add TaskEvaluator for task generation inner loop#12
Conversation
Runs k × m rollouts of generated tasks on Fleet environments. Given (prompt, verifier_code, env_key), creates FleetTaskEnv instances, runs agent loops with model inference, and returns structured results for reward computation (learnability variance + model separation). Used as the inner loop of the task-scaling RL pipeline in SkyRL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of calling Anthropic directly and running a local agent loop, the evaluator now: 1. Imports the generated task via fleet.import_task() 2. Creates a harness job via fleet.create_job() 3. Polls for completion 4. Extracts per-session verifier scores from job sessions Uses real Fleet model IDs (claude-sonnet-4.5, claude-opus-4.5) instead of the broken weak/strong mapping that required ANTHROPIC_API_KEY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sync time.sleep() in _poll_job blocked the asyncio event loop, preventing trajectory timeouts from cancelling evaluations. Using asyncio.sleep() allows the event loop to properly handle cancellations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet returns session model IDs without provider prefix (e.g., "claude-sonnet-4.5")
while we configure them with prefix ("anthropic/claude-sonnet-4.5"). Added
_match_model_id() to normalize and match by bare model name, so scores land
in the correct results_per_model bucket.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| start = time.time() | ||
| while time.time() - start < self.max_poll_time_s: | ||
| try: | ||
| job = fleet.get_job(job_id) |
There was a problem hiding this comment.
Sync Fleet SDK calls block event loop in async methods
Medium Severity
The evaluate and _poll_job async methods use the synchronous Fleet client for all HTTP calls (import_single_task, create_job, get_job, list_job_sessions), which blocks the event loop despite the methods being async. The rest of the codebase wraps sync Fleet calls with asyncio.to_thread() or uses AsyncFleet for this exact reason. The asyncio.sleep between polls yields correctly, but every actual API call blocks, contradicting the stated design goal of non-blocking async polling.


Summary
TaskEvaluatorthat submits generated (prompt, verifier) pairs to Fleet harness (POST /v1/jobs), polls for completion, and extracts per-model verifier scoresasyncio.sleep, model ID normalization for provider/bare name mismatchesTest plan
🤖 Generated with Claude Code