feat: Add submit_final_answer synthetic tool for carlisle tasks by dzorlu · Pull Request #11 · fleet-ai/OpenEnv

dzorlu · 2026-03-13T19:27:11Z

Summary

Injects submit_final_answer as a synthetic tool for tasks whose prompt references it (mirrors the harness's ANSWER_SUBMISSION_TOOL from orchestrator/temporal/workflows/constants.py)
Intercepts calls locally (not routed to MCP), stores the answer, marks episode done
Passes final_answer to verifier via Fleet SDK's verify_detailed(**kwargs) so carlisle verifiers like verify(env, final_answer=None) receive the submitted answer
Runs verifier in close()/close_async() for orphaned rollouts (context overflow, max_turns) instead of defaulting to 0.0

Context

All 354 carlisle tasks reference submit_final_answer in their prompts, but this tool is a harness-level synthetic — not an MCP tool. OpenEnv connects directly to the MCP server (13 tools: bash, duckdb_query, etc.), so the tool was missing. Models would call it, get "Tool not found on any active MCP endpoint", and loop until max turns. This is why carlisle is 0% across all training iterations.

Test plan

4 new unit tests pass (TestSubmitFinalAnswer)
All pre-existing tests unaffected (same 9 pre-existing failures)
Smoke test on a carlisle task to verify end-to-end flow

🤖 Generated with Claude Code

Carlisle tasks (354 total, 8 in eval) require models to call submit_final_answer to submit results, but this tool is a harness-level synthetic injected by the orchestrator's SessionWorkflow, not an MCP tool. OpenEnv connects directly to MCP servers, so the tool was missing — causing 0% scores across all carlisle tasks in training. Changes: - Inject submit_final_answer into tool list when prompt references it - Intercept calls locally (not routed to MCP), store the answer - Pass final_answer to verifier via Fleet SDK's verify_detailed() - Run verifier in close()/close_async() for orphaned rollouts - Add unit tests for the synthetic tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-13T19:32:39Z

+        # so that models can submit answers during SkyRL training exactly as
+        # they would in a Fleet harness session.
+        if self.modality == "tool_use" and "submit_final_answer" in self.prompt:
+            self._tools_cache.append(SUBMIT_FINAL_ANSWER_TOOL)


Shared mutable dict reference risks cross-instance corruption

Low Severity

SUBMIT_FINAL_ANSWER_TOOL is a mutable module-level dict that gets appended by reference to _tools_cache. The tools list is then exposed in observations via obs["tools"]. If any downstream consumer (e.g., training framework, logging, serialization) mutates a tool dict in-place, it would corrupt the shared constant for all future FleetTaskEnv instances. A shallow copy (e.g., copy.deepcopy(SUBMIT_FINAL_ANSWER_TOOL)) at append time would prevent this.

Additional Locations (1)

src/envs/fleet_env/task_env.py#L31-L53

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor Bot reviewed Mar 13, 2026

View reviewed changes

merge: resolve conflicts with deniz/fleet_client (_reward_computed)

5111c78

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dzorlu merged commit cf91b04 into deniz/fleet_client Mar 13, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add submit_final_answer synthetic tool for carlisle tasks#11

feat: Add submit_final_answer synthetic tool for carlisle tasks#11
dzorlu merged 2 commits into
deniz/fleet_clientfrom
deniz/submit-final-answer

dzorlu commented Mar 13, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

cursor Bot Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Mar 13, 2026

Summary

Context

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot Mar 13, 2026

Choose a reason for hiding this comment

Shared mutable dict reference risks cross-instance corruption

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant