docs(agent): direct-call tools design (Workstream A) by mmabrouk · Pull Request #4886 · Agenta-AI/agenta

mmabrouk · 2026-06-27T14:22:45Z

What

Design docs (no code) for direct-call tools: make each resolved agent tool carry its own
call target so the sidecar calls reference and platform tools directly, and only
gateway (Composio) tools route through the central /tools/call.

Lives under docs/design/agent-workflows/projects/direct-call-tools/. This is the Workstream A
design, up for review. Implementation is dispatched separately (the workstream split is in
plan.md).

Why

Today every callback tool POSTs to one endpoint, /tools/call, which re-parses a string and
re-dispatches. For a reference (a stored workflow used as a tool) or a platform tool (an
Agenta operation like create-workflow or annotate), the target is just an Agenta endpoint and the
sidecar already holds the caller credential, so the hop adds a step and no value. Gateway is
the one real exception: only the server can read the Composio secret. Seeded by the line-67
comment on #4863.

What's in it

context.md — the decision, goals, non-goals.
research.md — the actual code seams (config/resolved/dispatch, the reference-invoke path,
the platform-op catalog precedent, the in-process schema catalog) with file:line.
design.md — the call descriptor, the per-tool-type table (config -> resolved -> dispatch),
the platform-op catalog, the reference resolution.
plan.md — the two-workstream split (this is A; B drops the @ag.reference marker and
rebuilds the FE on type:"reference" on [feat] Reference a workflow as an agent tool (type:"reference") #4860 / feat(frontend): reference a workflow as an agent tool (type:"reference") #4877) plus the A execution phases.
status.md, README.md.

Decisions settled in the design

Reference/platform tools call directly; gateway stays server-side.
Drop the @ag.reference marker (keep type:"reference"); keep embed, hidden in the UI for
now. [Workstream B]
A reference can target an environment or a variant, latest or a pinned revision. [schema in B]
The service resolves the reference revision at resolve time and bakes it in (the sidecar is
always invoked fresh by the service, so it stays current).

Decision needed from you

See the 🔸 Decision needed comment: when to remove the /tools/call workflow.* routing
(my recommendation: in Workstream A, after B lands).

How to review

This is a design review, not a code review. Read design.md first (the shape), then
plan.md (the split + phases). No code in this PR.

Related: #4860 / #4877 (the reference tool, Workstream B), #4837 (embedref design), #4863
(custom-tools note, where this started).

https://claude.ai/code/session_01YXbQbNQvPn6GU3P8TvLMEw

vercel · 2026-06-27T14:22:52Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jun 27, 2026 6:24pm

coderabbitai · 2026-06-27T14:22:53Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2d96e728-bbd6-4f76-8a01-2546fd29cc1c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/direct-call-tools

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

mmabrouk · 2026-06-27T14:23:12Z

🔸 Decision needed — this is a design decision, not a code review.

When do we remove the /tools/call workflow.* routing?

Reference tools execute today through /tools/call's workflow.* branch. This design moves
them to a direct call to the invoke endpoint. Removing that branch is coupled to "reference goes
direct" (Phase 4) and to Workstream B (#4860 / #4877) landing first.

Option 1 (recommended): remove it in Workstream A, after B lands. Reference keeps working
through the existing routing until the direct path is in, so there is no window where reference
tools are broken.
Option 2: remove it in Workstream B now. Smaller diff there, but reference tools stop
executing until Workstream A's direct path lands.

My recommendation is Option 1. Once you pick, the orchestrator can sequence the implementation
accordingly.

(One lower-priority open item, no decision needed now: forwarding the trace context to the
sub-workflow on a reference invoke is marked a Phase 6 stretch and likely deferred.)

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

docs/design/agent-workflows/projects/direct-call-tools/context.md (1)
41-42: 📐 Maintainability & Code Quality | 🔵 Trivial

Fix extra whitespace in "Agenta base".

The static analysis flagged extra spaces between "Agenta" and "base" on line 42. Clean up the spacing for readability.
-  base. Off-Agenta targets are a deliberate later decision.
+  base. Off-Agenta targets are a deliberate later decision.
Source: Linters/SAST tools
docs/design/agent-workflows/projects/direct-call-tools/design.md (2)
12-18: 📐 Maintainability & Code Quality | 🔵 Trivial

Add language identifier to fenced code block.

The call descriptor schema block lacks a language specifier. Add json or ts for proper syntax highlighting.
+```json
call: {
method: "POST" | "GET",


_Source: Linters/SAST tools_

---

`125-128`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_

**Add language identifier to fenced code block.**

The platform-op catalog schema block lacks a language specifier.



```diff
+```json
op -> { method, path, input_schema_ref, default_permission, default_needs_approval }
Source: Linters/SAST tools
docs/design/agent-workflows/projects/direct-call-tools/status.md (1)

52-53: 📐 Maintainability & Code Quality | 🔵 Trivial

Explain coordination jargon or link to conventions.

References to "Row claimed on scratch/agent-coordination.md", "but write", and "BUT-LOCK" use team-specific jargon that may not be clear to all readers. Consider adding a brief explanatory note or linking to the team's coordination conventions documentation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 60ef4e5d-a6f7-4909-83e3-8001c1c671ea

📥 Commits

Reviewing files that changed from the base of the PR and between fd43144 and f607f03.

📒 Files selected for processing (6)

docs/design/agent-workflows/projects/direct-call-tools/README.md
docs/design/agent-workflows/projects/direct-call-tools/context.md
docs/design/agent-workflows/projects/direct-call-tools/design.md
docs/design/agent-workflows/projects/direct-call-tools/plan.md
docs/design/agent-workflows/projects/direct-call-tools/research.md
docs/design/agent-workflows/projects/direct-call-tools/status.md

mmabrouk · 2026-06-27T15:20:12Z

🤖 The AI agent says: ## Round 2 — changes addressing your comments (for a skim)

Second version pushed (design-only, no code). Each of your comments mapped to what changed, then
the decisions to be aware of.

1. Merge / ownership (context.md) — Added a "Merge, review, and ownership" note: you're
aligned with all decisions, merge order and organization are the orchestrator's call, and
reviewability before merge is the only requirement. I reframed the old "Open decisions for
Mahmoud" as "Sequencing notes (orchestrator's call)" in plan.md and status.md, so nothing is
waiting on a decision from you now.

2. Permissions (design.md) — Added "Permissions and approval (uniform across all tool
types)": permission/approval lives on the shared ToolSpecBase and is enforced before the call
for every type (gateway/reference/platform/code/client), not special-cased on the direct call
path. Noted the future typed permissions-config hierarchy as an out-of-scope refactor.

3. Input/messages + output schemas (design.md, research.md) — Researched. The reference
tool's input_schema is the referenced workflow's own input contract (revision.schemas.inputs
from /inspect); for chat it carries messages via x-ag-type-ref: "messages" (the shared
Messages type), so no hand-authored schema. The runner places the model's args at data.inputs,
where the invoke reads messages from inputs.messages.

4. Catalog to mirror (design.md, research.md) — Researched. Changed "mirror
platform_catalog.py" to mirror the reserved tools.agenta.* pattern from PR #4884
(find_capabilities) plus the evaluators catalog shape, reusing CATALOG_TYPES for schemas.
Confirmed platform_catalog.py is not being removed (your suspicion) but is heavier — for
skills — so we mirror its spirit, not its versioned-revision machinery. find_capabilities
becomes the first platform tool, tying into the deferred item already in plan.md.

5. CodeRabbit (schema sourcing) — Fixed plan.md (it still said schemas come "from
openapi.json"); all three docs now say the in-process CATALOG_TYPES catalog.

Decisions to be aware of (the only net-new ones)

Output schemas: deferred. Harness tool definitions are input-schema-only across
OpenAI/Anthropic, so the model gets the tool's output as result text. The workflow's
schemas.outputs exists and is kept available for a later structured-output pass, but no tool
output_schema rides the wire now.
Catalog pattern: mirror tools.agenta.* ([feat] Tool discovery: POST /tools/discover + reserved find_capabilities tool #4884) + evaluators, not platform_catalog.py
wholesale.

Everything else is alignment/cleanup. Nothing needs a decision from you — ready for your second
skim.

mmabrouk

Round 2: inline pointers to each change, mapped to your comments (context is in the PR summary comment too).

mmabrouk · 2026-06-27T15:25:58Z

+custom-tools design note (#4863) is waiting on the execution model. Settling the routing now
+lets both land on the right shape instead of on `/tools/call` string-routing.
+
+## Merge, review, and ownership


🤖 The AI agent says: Per your context.md:50 comment. Added this section: you're aligned with all decisions, merge order/organization is the orchestrator's call, and reviewability before merge is the only requirement. I also reframed the old "Open decisions for Mahmoud" as "Sequencing notes (orchestrator's call)" in plan.md + status.md, so nothing is waiting on a decision from you.

mmabrouk · 2026-06-27T15:25:58Z

+scope here. The requirement this design holds is narrower: permission handling stays consistent
+across all tool types, so the refactor can lift it uniformly later.
+
+## The platform-op catalog


🤖 The AI agent says: Per your design.md:123 comment (researched). Changed from "mirror platform_catalog.py" to mirroring the reserved tools.agenta.* pattern from PR #4884 (find_capabilities) + the evaluators catalog shape, reusing CATALOG_TYPES. Confirmed platform_catalog.py is NOT being removed (your suspicion) — it's current but heavier (for skills), so we mirror its spirit not its machinery. find_capabilities is the first platform tool.

I was actually thinking here of mirrorring patterns in the sdk

mmabrouk

🤖 The AI agent says: Run-context design added (Codex-reviewed). Two inline pointers below; the A-vs-C decision is a separate comment.

mmabrouk · 2026-06-27T16:04:00Z

+| C hybrid | clean | +1 | strong | optional | strong |
+| D propagate | trace-only | none | n/a | no | partial |
+
+## Recommendation (revised after the Codex review)


🤖 The AI agent says: New file answering "how does run context reach the auto-editing tools." I had it Codex-reviewed (xhigh). Headline: I recommend Option C (run-level runContext channel + server-side binding of protected fields), revised away from the static get_run_context tool (Option A) you leaned toward — Codex argues A is a boundary category-error and that B/C actually keep the sidecar's canonical-primitive set smaller. Security findings + a real gap (the annotations endpoint mints a new trace today, it doesn't attach to the active one) are in the "Security and guardrails" section. The A-vs-C call is yours — see the 🔸 Decision-needed comment.

mmabrouk · 2026-06-27T16:04:00Z

+scope here. The requirement this design holds is narrower: permission handling stays consistent
+across all tool types, so the refactor can lift it uniformly later.
+
+## Run context


🤖 The AI agent says: Added from the run-context work: this run-context pointer plus an SSRF guardrail on the call descriptor (method allowlist, relative-path regex, /api/ prefix, origin binding), which Codex flagged because the call makes the sidecar a structured HTTP dispatcher. Detail and options in run-context.md.

mmabrouk · 2026-06-27T16:04:02Z

🤖 The AI agent says: 🔸 Decision needed — design decision, not a code review.

How should run context (current trace, running workflow/variant + draft-state + latest revision, session_id) reach the auto-editing tools? Full options + tradeoffs in the new run-context.md; Codex-reviewed at xhigh.

Option A (your earlier lean): a static get_run_context tool. The service embeds the context as a tool that returns it; the model reads it and passes trace_id / variant_id as inputs. Clean-looking, but Codex (and I) read it as a boundary category-error: run context isn't tool behavior, and it makes identity/trace model-owned (spoofable, mitigated only by endpoint checks).
Option C (recommended): run-level runContext channel + server-side binding. Run context rides the /run contract; the runner/SDK binds the protected fields (trace_id, variant_id, latest_revision_id) into the call so the model can't override them; endpoints also harden. Adds no new sidecar primitive (so it's actually more general for the future local backend), and it's safer.

My recommendation is C. Surfaced regardless of A/C: own-variant binding + a revision-precondition on commit, and that add_trace_annotation needs a real backend (today /api/annotations/ mints a new trace rather than attaching to the active one).

Once you pick, the orchestrator folds it into the implementation. Leaning C unless you say otherwise.

mmabrouk

🤖 The AI agent says: Corrected the platform-tools framing per your note + re-ran Codex. Inline pointers below; the one decision is a separate comment.

mmabrouk · 2026-06-27T16:41:01Z

+Status: options for review. Re-asked Codex with the corrected framing (see the bottom). Open
+question; Mahmoud is weighing options.
+
+## What platform tools actually are (correcting a misframe)


🤖 The AI agent says: Rewrote this file after your correction. Platform tools are now framed as thin wrappers over existing endpoints — no new endpoints, no logic-wrapping tools, no update_own_workflow / add_trace_annotation. The real question is just how the harness gets its own context. I also marked #4863's custom-tools-design.md superseded, since the logic-tool names came from there.

mmabrouk · 2026-06-27T16:41:01Z

+| 5 context file | yes | none | yes (rewrite) | strong (folders) | out-of-band |
+| 6 system prompt | yes | none | weak | medium | simple but static |
+
+## Recommendation (after the Codex rev-2 review)


🤖 The AI agent says: Re-asked Codex (xhigh) with the corrected framing — verdict: pass with conditions. Headline: split run context by trust — server-bind the agent's own variant/trace into call.body (hidden from the model) for self-targeting ops; deliver general model-readable context as run-level data (MCP resource + file fallback, NOT a get_context canonical primitive); refresh per turn. The one decision for you is the 🔸 comment.

mmabrouk · 2026-06-27T16:41:01Z

+"update myself" is just committing a revision to the agent's own variant. The raw endpoints are
+exposed; a skill teaches composition.
+
+For a **self-targeting** op, the agent's own identity (its variant / its trace) is **server-bound**


🤖 The AI agent says: The security reconciliation Codex flagged: for a self-targeting op the own variant/trace is server-bound (locked in call.body, hidden from the model), because project-scoped auth does NOT stop within-project lateral movement (a model retargeting another variant/trace in the same project). Still a thin wrapper, just a fixed field. This adjusts the "own is convention" framing — see the decision comment.

mmabrouk · 2026-06-27T16:41:02Z

🤖 The AI agent says: 🔸 Decision needed — design decision, not a code review.

Do we server-bind the agent's own identity for self-targeting platform tools, or treat "own" as a convention the model follows?

Your framing: the model supplies its own trace/variant; "own" is a convention the skill teaches.
Codex (xhigh) + me: server-bind the self-identity (lock own-variant / own-trace into call.body, hide from the model) for the sensitive self-ops. Project-scoped auth stops cross-project access but NOT within-project lateral movement — a model could retarget a different variant/trace in the same project. Still a thin wrapper, just a fixed field.

My recommendation: server-bind the self-identity; deliver general (non-authority) context the model needs to read as run-level data via an MCP resource (+ context-file fallback), not a get_context tool; refresh per turn (or a revision-precondition on commit). Full writeup + Codex verdict in run-context.md.

Confirm server-bind, or say why convention is enough.

mmabrouk

🤖 The AI agent says: Added the context-binding mechanics (the 'other part' of the design). Inline pointer + summary below.

mmabrouk · 2026-06-27T16:59:52Z

+
+`callRef` stays for gateway. A spec has either `call` (direct) or `callRef` (gateway), not both.
+
+### Context binding (how a catalog entry fills a field from run context)


🤖 The AI agent says: This is the "how does the catalog encode the context transformation" mechanic you asked about. A catalog entry declares a bind map (endpoint field ← run-context key); the service strips bound fields from the model schema and emits call.context; the runner fills them at dispatch from a per-turn runContext blob on /run. Codex (xhigh) validated it with refinements, folded in: hybrid resolution (immutable identity baked at resolve time, mutable filled at dispatch), path-aware schema stripping, and merge hardening (path-conflict rejection, strict path parsing, prototype-pollution-safe deep-set, post-merge validation). runContext is its own /run field. Implementation checklist is in the subsection.

mmabrouk · 2026-06-27T16:59:53Z

🤖 The AI agent says: Added the context-binding mechanics (the "other part" of the run-context design): how a platform-op catalog entry declares a bind map, then resolve-time schema-strip → call.context → dispatch-time fill from a per-turn runContext blob.

Codex (xhigh) reviewed it — the mechanic is sound, and it correctly noted none of it is built yet (expected: this is a design PR, so that is the build plan, not a flaw). Folded its refinements: hybrid binding (immutable@resolve / mutable@dispatch), path-aware stripping, merge hardening (conflict rejection + safe deep-set + post-merge validation), and runContext as its own /run field with an MCP-resource / file fallback. Committed 4b9abe6d.

Still open — the one decision from the previous round: server-bind the self-identity vs "own is convention" (lean server-bind). Everything else in the design is settled.

mmabrouk

🤖 The AI agent says: Recorded your decision and did the cleanup round. Inline pointer + summary below.

mmabrouk · 2026-06-27T17:09:59Z

+Date: 2026-06-27 (rev 3 — DECIDED)
+Status: **decided** — this is the design to implement. Options weighed and dropped are at the end.
+
+## Decision


🤖 The AI agent says: Recorded your decision and cleaned the doc to match. Run context rides the run/session (runContext on /run, refreshed per turn); tool definitions declare a bind map and the runner fills bound fields server-side at dispatch, hidden from the model. Self-targeting tools bind their own variant/trace this way. No MCP resource, no get_context tool, no context file, no prompt injection — those are now in "Considered but not used" so the implementation agent isn't confused. This closes the last open decision; the design is complete.

mmabrouk · 2026-06-27T17:10:00Z

🤖 The AI agent says: Decision recorded + cleanup round done (9a28ae12).

Run-context delivery is settled: runContext on the run/session + tool bind (server-side, hidden from the model). Nothing else — MCP resource / get_context tool / context file / prompt injection are all moved to "Considered but not used" so the implementation agent isn't led astray.

run-context.md is now a decision doc; design.md (Run context + Context binding) and plan.md are reconciled; status.md marks the design complete — all decisions made. No code in the PR; it's ready to hand to implementation whenever you give the word. Nothing is pending a decision from you.

mmabrouk

🤖 The AI agent says: Added the interface reference + the endpoint-wrapper set + the catalog impl (Codex). Inline pointer + summary below.

mmabrouk · 2026-06-27T17:24:48Z

@@ -0,0 +1,97 @@
+# Interface reference — direct-call tools


🤖 The AI agent says: New interface reference — the concrete contract delta (the call descriptor, runContext, the type:"platform" config, the platform-op catalog entry, the reference config, and dispatch body-assembly), each with its file:line seam, so review and implementation point at the same spec. This is the references-that-support-the-review you asked for.

mmabrouk · 2026-06-27T17:24:49Z

🤖 The AI agent says: Addressed your four asks:

Interface reference — new interfaces.md with the exact contract shapes + the file:line seams. I also baked "design/contract PRs must carry an interface reference" into the queue-implement-feature skill. Note: the skills dir is gitignored in this repo, so that skill edit is live for my sessions but not git-committed here — propagating it to the team is a separate skills-source step.
Catalog (Codex): asked Codex what to mirror — a typed catalog in the SDK (platform/op_catalog.py, bridged via resolve.py, wired through tools/{models,resolver}.py), reusing PlatformConnection; stable tools.agenta.<slug> namespace + per-op permission defaults. Folded into interfaces.md §4 + plan.md.
Endpoint-wrapper set — added to plan.md (from docs(agent): agent-creation skills + custom-tools design note #4863, reframed as existing endpoints): discovery, workflow lifecycle, connections, secrets, annotations, inspect.
Plain + self/bind pairs — captured: commit-revision (plain vs update-self) and create-annotation (plain vs annotate-my-trace) are two catalog entries over the same endpoint, both shipping.

Two Codex endpoint-set fixes folded: "invoke" is the reference-tool path (not a duplicate); "inspect harnesses" is the services /inspect endpoint. Execution stays direct per your decision (Codex leaned /tools/call; the SSRF guardrails cover its concern; find_capabilities migrates to direct).

Committed 73a2b134. The design + plan are complete and ready for the orchestrator. Nothing is pending a decision from you — flag if I missed anything.

mmabrouk

lgtm

Design docs for making resolved agent tools carry their own call target, so the sidecar calls reference/platform tools directly and only gateway (Composio) tools route through /tools/call. Design + plan only; implementation dispatched separately.

Address Mahmoud's round-2 comments + CodeRabbit: - reference input_schema from the workflow's revision.schemas.inputs (messages via x-ag-type-ref); output schemas deferred - permissions uniform across all tool types (not special-cased on the call path) - platform-op catalog mirrors the reserved tools.agenta.* pattern (#4884 find_capabilities) + evaluators catalog, not platform_catalog.py wholesale - schema sourcing standardized on the in-process CATALOG_TYPES catalog (fixes the openapi.json inconsistency) - merge order/organization framed as the orchestrator's call

New run-context.md: how the run's context (trace_id, running workflow/variant + draft-state + latest revision, session_id) reaches the auto-editing tools. Four options (static get_run_context tool / runner injection / hybrid / traceparent), system boundaries, the canonical-primitive framing, tradeoffs. Codex-reviewed: revised recommendation = Option C (run-level runContext channel + server-side binding of protected fields), with security findings (confused-deputy on variant, revision-precondition, trace integrity, annotations-mints-new-trace gap) and call- descriptor SSRF guardrails. design.md: added the SSRF guardrail + a run-context pointer.

Platform tools are a THIN WRAPPER over EXISTING endpoints (no new endpoints, no logic-wrapping tools). Removed the fake update_own_workflow / add_trace_annotation tools; expose raw endpoints, compose multi-step via skills. run-context.md rewritten + Codex-reviewed (rev 2): run context is run-level data not a tool; protected self-identity (own variant/trace) is server-bound into call.body and hidden from the model (within-project lateral-movement risk); model-readable context via MCP resource + file fallback; refresh per turn. design.md/plan.md reconciled.

…ewed) How a platform-op catalog entry declares a context bind (endpoint field <- run-context key): the service strips bound fields from the model schema + emits call.context; the runner fills them at dispatch from a per-turn runContext blob on /run. Codex binding review folded in: hybrid resolution (immutable identity baked at resolve time, mutable filled at dispatch), path-aware schema stripping, merge hardening (path-conflict rejection, strict path parsing, prototype-pollution-safe deep-set, post-merge validation), runContext as its own /run field. Added the implementation checklist.

Decision (Mahmoud): run context rides the run/session (runContext on /run, refreshed per turn); tool definitions declare a bind map and the runner fills bound fields server-side at dispatch, hidden from the model. Self-targeting tools bind their own variant/trace this way. NO MCP resource, NO get_context tool, NO context file, NO prompt injection. Cleaned run-context.md to a decision doc (+ 'Considered but not used'); reconciled design.md (Run context + Context binding) and plan.md; status.md marks the design complete (last open decision closed).

…l (Codex) interfaces.md: the concrete interface delta (call descriptor, runContext, type:platform config, platform-op catalog entry, reference config, dispatch body-assembly) with the file:line seams, so review + implementation share one reference. plan.md: the endpoint-wrapper set to expose (from #4863, reframed as existing endpoints; plain + self/bind variants incl. the annotate pair) + catalog implementation per the Codex catalog review (typed catalog in SDK platform/op_catalog.py, reuse PlatformConnection, stable tools.agenta.* namespace, per-op permission defaults; execution stays direct, find_capabilities migrates to direct).

vercel Bot deployed to Preview June 27, 2026 14:23 View deployment

mmabrouk added the needs-review Agent updated; awaiting Mahmoud's review label Jun 27, 2026