Skip to content

Stage G: surface typed tool-result payloads to the planner#9

Closed
Anmolnoor wants to merge 1 commit into
stage-f/helpful-not-foundfrom
stage-g/surface-tool-results
Closed

Stage G: surface typed tool-result payloads to the planner#9
Anmolnoor wants to merge 1 commit into
stage-f/helpful-not-foundfrom
stage-g/surface-tool-results

Conversation

@Anmolnoor
Copy link
Copy Markdown
Owner

Context

The big one. Asked to "read the file in res and give me the data", the agent read the file 18 times and then crashed — looking dumb. It wasn't: fcli was hiding the tool output from the model.

_build_observation only ever read artifact["stdout"]. But typed capabilities don't use that field — file.read puts its data in content, search/files/git in structured fields. So every typed read came back blank to the planner. With file.read the model re-read the same file every iteration, its own thinking saying "the content wasn't being returned properly," and looped until the no-progress detector / an empty-response flake killed the turn.

This was latent the whole time — masked because the model fetched data via shell gh api, whose stdout is surfaced. The moment a task depended on a typed read, the agent was flying blind.

Note: based on stage-f/helpful-not-found; merge order #3#4#5#6#7#8 → this.

What changed

_tool_result_preview: for read-only result types (file.read/read_chunk, search, files, git inspect, man/tldr) surface the payload — the file content, or a compact JSON of the structured result — into the observation's stdout_preview, capped by the existing _truncate_preview limits (8 KB / 200 lines). Writes and mutations are excluded so we don't echo just-written content back and re-bloat the plan prompt.

Verification

Live: "read the report in res about anmolnoor and give me a 2-line summary" now reads once, sees the content, and answers correctly —

"…Vancouver-based developer with 111 public repos, 31 followers… fcli (local-first shell-native coding agent) and beekeeper (governed agent runtime)… created May 2026."

— in 4 iterations, versus the previous 18-read dead-loop + crash.

Tests

2 new (405 total, ruff clean): _tool_result_preview surfaces reads + search but not writes; an orchestrator read surfaces the file content into the next iteration's planner context.

🤖 Generated with Claude Code

The observation builder only ever read artifact["stdout"], but typed
capabilities don't use that field — file.read puts its data in "content",
search/files/git in structured fields. So every typed read came back BLANK to
the planner: it saw the action "executed" but none of the data. With file.read
in particular the model would re-read the same file every iteration, conclude
"the content isn't being returned," and loop until the no-progress detector or
an empty-response flake killed the turn. (Masked until now because the model
got its data from shell `gh api`, whose stdout *is* surfaced.)

Add _tool_result_preview: for read-only result types (file read/read_chunk,
search, files, git inspect, man/tldr) surface the payload (file content, or a
compact JSON of the structured result) into the observation's stdout_preview,
capped by the existing _truncate_preview limits. Writes/mutations are
deliberately excluded so we don't echo just-written content back and re-bloat
the prompt.

Verified live: "read the report and summarize" now reads once, sees the
content, and answers in a few iterations instead of looping ~20x and crashing.

Tests: _tool_result_preview surfaces reads + search but not writes; an
orchestrator read surfaces the file content into the next iteration's planner
context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Anmolnoor
Copy link
Copy Markdown
Owner Author

✅ Stages A–G merged to main (commit 27c97b0)

This PR chain is now on main as a single merge. Summary of the whole journey, which started from one bug report ("the report-writing run crashed with invalid JSON"):

Stage PR What it fixed
A #3 Truncation is now an explicit error + a repair nudge; raw response persisted to the event log
B #4 Large file bodies no longer inlined in the plan JSON — deferred via content_brief + a separate generation call
C #5 The agent can pause and ask a clarifying question instead of guessing
D #6 Out-of-scope reads ask permission and grant session-scoped, read-only access
E #7 Deferred-write generation no longer wraps output in a plan blob; repair retries use temperature jitter
F #8 FILE_NOT_FOUND lists sibling files so a wrong-name guess self-corrects
G this Typed tool results (file reads, search, git) are surfaced to the planner — previously only shell stdout was, so typed reads came back blank and the model looped

405 tests passing, ruff clean. Verified end-to-end against qwen3.5:397b-cloud: generate a GitHub report, then read it back and summarize — both work cleanly.

The throughline

Most of what looked like "the model is dumb" was actually the harness not feeding the model what it needed — truncated plans (A/B), plan-wrapped output (E), blank tool results (G). Those are all fixed. The genuinely model-side residue (intermittent empty completions on Ollama Cloud) is narrow and best addressed by a stronger model behind the same loop.

Closing #3#8 as well; all their commits are in main via 27c97b0.

@Anmolnoor
Copy link
Copy Markdown
Owner Author

Merged into main via commit 27c97b0 (stages A–G landed as one merge). See the summary on #9.

@Anmolnoor Anmolnoor closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant