Skip to content

feat: Add submit_final_answer synthetic tool for carlisle tasks#11

Merged
dzorlu merged 2 commits into
deniz/fleet_clientfrom
deniz/submit-final-answer
Mar 13, 2026
Merged

feat: Add submit_final_answer synthetic tool for carlisle tasks#11
dzorlu merged 2 commits into
deniz/fleet_clientfrom
deniz/submit-final-answer

Conversation

@dzorlu

@dzorlu dzorlu commented Mar 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Injects submit_final_answer as a synthetic tool for tasks whose prompt references it (mirrors the harness's ANSWER_SUBMISSION_TOOL from orchestrator/temporal/workflows/constants.py)
  • Intercepts calls locally (not routed to MCP), stores the answer, marks episode done
  • Passes final_answer to verifier via Fleet SDK's verify_detailed(**kwargs) so carlisle verifiers like verify(env, final_answer=None) receive the submitted answer
  • Runs verifier in close()/close_async() for orphaned rollouts (context overflow, max_turns) instead of defaulting to 0.0

Context

All 354 carlisle tasks reference submit_final_answer in their prompts, but this tool is a harness-level synthetic — not an MCP tool. OpenEnv connects directly to the MCP server (13 tools: bash, duckdb_query, etc.), so the tool was missing. Models would call it, get "Tool not found on any active MCP endpoint", and loop until max turns. This is why carlisle is 0% across all training iterations.

Test plan

  • 4 new unit tests pass (TestSubmitFinalAnswer)
  • All pre-existing tests unaffected (same 9 pre-existing failures)
  • Smoke test on a carlisle task to verify end-to-end flow

🤖 Generated with Claude Code

Carlisle tasks (354 total, 8 in eval) require models to call
submit_final_answer to submit results, but this tool is a
harness-level synthetic injected by the orchestrator's SessionWorkflow,
not an MCP tool. OpenEnv connects directly to MCP servers, so the tool
was missing — causing 0% scores across all carlisle tasks in training.

Changes:
- Inject submit_final_answer into tool list when prompt references it
- Intercept calls locally (not routed to MCP), store the answer
- Pass final_answer to verifier via Fleet SDK's verify_detailed()
- Run verifier in close()/close_async() for orphaned rollouts
- Add unit tests for the synthetic tool

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread src/envs/fleet_env/task_env.py
# so that models can submit answers during SkyRL training exactly as
# they would in a Fleet harness session.
if self.modality == "tool_use" and "submit_final_answer" in self.prompt:
self._tools_cache.append(SUBMIT_FINAL_ANSWER_TOOL)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared mutable dict reference risks cross-instance corruption

Low Severity

SUBMIT_FINAL_ANSWER_TOOL is a mutable module-level dict that gets appended by reference to _tools_cache. The tools list is then exposed in observations via obs["tools"]. If any downstream consumer (e.g., training framework, logging, serialization) mutates a tool dict in-place, it would corrupt the shared constant for all future FleetTaskEnv instances. A shallow copy (e.g., copy.deepcopy(SUBMIT_FINAL_ANSWER_TOOL)) at append time would prevent this.

Additional Locations (1)
Fix in Cursor Fix in Web

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dzorlu dzorlu merged commit cf91b04 into deniz/fleet_client Mar 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant