Skip to content

feat: Add TaskEvaluator for task generation inner loop#12

Merged
dzorlu merged 6 commits into
deniz/fleet_clientfrom
deniz/task-evaluator
Mar 15, 2026
Merged

feat: Add TaskEvaluator for task generation inner loop#12
dzorlu merged 6 commits into
deniz/fleet_clientfrom
deniz/task-evaluator

Conversation

@dzorlu

@dzorlu dzorlu commented Mar 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds TaskEvaluator that submits generated (prompt, verifier) pairs to Fleet harness (POST /v1/jobs), polls for completion, and extracts per-model verifier scores
  • Inner loop for task-gen RL: generate a task → evaluate via Fleet harness → get variance/separation signal
  • Async polling with asyncio.sleep, model ID normalization for provider/bare name mismatches

Test plan

  • Submit a task with known verifier to Fleet harness, verify scores returned
  • Verify async polling doesn't block event loop
  • Test model ID matching with/without provider prefix

🤖 Generated with Claude Code

Deniz and others added 6 commits March 15, 2026 11:28
Runs k × m rollouts of generated tasks on Fleet environments.
Given (prompt, verifier_code, env_key), creates FleetTaskEnv instances,
runs agent loops with model inference, and returns structured results
for reward computation (learnability variance + model separation).

Used as the inner loop of the task-scaling RL pipeline in SkyRL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of calling Anthropic directly and running a local agent loop,
the evaluator now:
1. Imports the generated task via fleet.import_task()
2. Creates a harness job via fleet.create_job()
3. Polls for completion
4. Extracts per-session verifier scores from job sessions

Uses real Fleet model IDs (claude-sonnet-4.5, claude-opus-4.5) instead
of the broken weak/strong mapping that required ANTHROPIC_API_KEY.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet harness POST /v1/jobs requires model IDs in 'provider/model'
format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sync time.sleep() in _poll_job blocked the asyncio event loop,
preventing trajectory timeouts from cancelling evaluations. Using
asyncio.sleep() allows the event loop to properly handle cancellations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet returns session model IDs without provider prefix (e.g., "claude-sonnet-4.5")
while we configure them with prefix ("anthropic/claude-sonnet-4.5"). Added
_match_model_id() to normalize and match by bare model name, so scores land
in the correct results_per_model bucket.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dzorlu dzorlu merged commit 8605142 into deniz/fleet_client Mar 15, 2026
1 check passed

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

start = time.time()
while time.time() - start < self.max_poll_time_s:
try:
job = fleet.get_job(job_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync Fleet SDK calls block event loop in async methods

Medium Severity

The evaluate and _poll_job async methods use the synchronous Fleet client for all HTTP calls (import_single_task, create_job, get_job, list_job_sessions), which blocks the event loop despite the methods being async. The rest of the codebase wraps sync Fleet calls with asyncio.to_thread() or uses AsyncFleet for this exact reason. The asyncio.sleep between polls yields correctly, but every actual API call blocks, contradicting the stated design goal of non-blocking async polling.

Additional Locations (2)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant