Skip to content

feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309)#490

Open
mattmillerai wants to merge 2 commits into
mainfrom
matt/be-2309-arm-b-bench-runner
Open

feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309)#490
mattmillerai wants to merge 2 commits into
mainfrom
matt/be-2309-arm-b-bench-runner

Conversation

@mattmillerai

Copy link
Copy Markdown
Collaborator

ELI-5

We're comparing two ways to let an AI edit a ComfyUI workflow: arm A (the in-app
agent, already built by "child 1") vs arm B (a CLI that gives the model a handful
of tiny "micro-edit" tools on a JSON file on disk). Arm B's real subject — Kishore's
CLI-agent prototype — isn't ready yet, so this PR ships a sanctioned stand-in (proxy):
a small Claude loop that uses this repo's own comfy workflow slots / set-slot / vary
commands (+ the CQL node catalog) as its tools. It runs the exact same t1–t4 tasks as
arm A, measures tokens/cost/tool-bytes the same way, and writes results in the exact
same file format so child 1's existing report.mjs can put both arms in one table.
Every row is labelled "proxy": true so nobody mistakes it for the real prototype.

What this adds (BE-2309)

  • comfy_cli/bench/run_arm_b.py — a minimal agent loop (Claude Opus claude-opus-4-8,
    raw Anthropic Messages API) exposing exactly the arm-B toolset:
    • slots (read) + cql (read/validate against comfy_cli/cql/) → set_slot (edit) →
      vary (produce-variants), each dispatching to the real CQL engine against a temp
      frontend-format workflow on disk
      — the model only sees compact slot manifests, never
      the full JSON.
  • Telemetry — captures the Anthropic usage block per API call
    (input_tokens / output_tokens / cache_read_input_tokens /
    cache_creation_input_tokens), tool-call count, and per-call payload bytes; prices each
    turn with the same table as comfy-inapp-agent/agent-server/usage.mjs
    (Opus 4.8 $5/M in, $25/M out, cache read 0.1×, write 1.25×).
  • Output — per-(task, turn) NDJSON to comfy_cli/bench/out/arm-b.ndjson, shaped
    identically to arm A's arm-a.ndjson
    . Verified end-to-end: child 1's
    node bench/report.mjs --arm-b <path> renders the comparison table from these rows, and
    the recomputed cost matches this runner's cost_usd to the cent.
  • Offline-first — committed object_info + seed fixtures run t1/t2/t4 with no live
    ComfyUI. t3 (~150-node build) is recorded as a RESULT ceiling ("outcome": "ceiling"
    • note): arm B's edit-only toolset can't build 150 nodes and the fixture covers only the
      6 base SD1.5 classes.
  • Teststests/comfy_cli/bench/test_run_arm_b.py (21 tests) unit-test tool dispatch +
    usage-parsing + the full --dry-run against a stubbed client — no live calls, CI stays
    offline. bench extra (pip install -e '.[bench]') + bench/README.md document the live run.

Proxy caveat + swap-in seam

This is not Kishore's prototype. The swap-in is deliberately narrow and documented in
run_arm_b.py's docstring + README.md: either inject a different model client
(build_live_client) or replace the Driver, while keeping ToolDispatcher (the
comfy-cli substrate) and build_row (the telemetry/NDJSON shape) — so the comparison output
stays identical. When the real prototype lands, drop the "proxy": true stamp.

Verification

  • python -m comfy_cli.bench.run_arm_b --dry-run → writes well-formed arm-b.ndjson (6 rows).
  • ruff check / ruff format --check clean; 21/21 tests pass.
  • Cross-repo: node bench/report.mjs --arm-b out/arm-b.ndjson consumes it directly.

Judgment calls

  • tasks.json is a vendored copy, generated by evaluating child 1's tasks.mjs and
    dumping its prompt strings (byte-identical prompts), because Python can't import the .mjs
    and no shared .json exists upstream yet. It carries seedWorkflow + a sync note; re-vendor
    (don't hand-edit) if the shared matrix changes. A follow-up could promote this to a single
    shared JSON committed in both repos.
  • t1 "build small" is proxied as template + micro-edit (arm B has no from-scratch node
    creation), and t4 uses the SD1.5 seed's real nodes for its edits since object_info has
    no LoRA — the prompt strings stay byte-identical; the tool calls target what the fixture
    offers. Both are documented in-code and in the README as honest proxy behavior.
  • No forced prompt-caching on the live path: the system+tools prefix is well under Opus's
    1024-token cache minimum, so cache_control would be a live no-op; the runner captures
    whatever cache tokens the API returns. The dry-run's synthetic numbers demonstrate the
    regrowth shape.

All acceptance criteria in BE-2309 are met.

…y + CQL (BE-2309)

Add `comfy_cli/bench/run_arm_b.py`: a minimal Claude Opus (claude-opus-4-8,
Anthropic Messages API) agent loop that exposes the existing `comfy workflow`
micro-edit substrate as tools — read=slots+cql, edit=set-slot,
produce-variants=vary — and drives the shared BE-2302 t1–t4 task matrix
(vendored byte-identical from child 1's tasks.mjs into bench/tasks.json).

The loop reads/edits a temp frontend-format workflow JSON on disk (no full-JSON
round-trip to the model), captures the Anthropic usage block per API call plus
tool-call count and per-call payload bytes, and prices each turn with the same
table as comfy-inapp-agent/agent-server/usage.mjs so the two arms are
comparable. It emits per-(task, turn) NDJSON shaped identically to arm A's
arm-a.ndjson (verified: child 1's report.mjs consumes bench/out/arm-b.ndjson
directly).

Spike-sanctioned PROXY for Kishore's not-yet-accessible CLI-agent prototype:
every row is stamped `"proxy": true`, and README + module docstring document the
narrow swap-in seam (inject the client / replace Driver, keep ToolDispatcher +
build_row). Offline-first: committed object_info + seed fixtures run t1/t2/t4
without a live ComfyUI; t3 (~150-node build) is recorded as a RESULT ceiling
(CQL/object_info coverage limit). Unit tests stub the model client (no live
calls, CI stays offline); a `bench` extra + README cover the live run.
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 4 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 37d31342-f144-4b29-9862-b112f67f548b

📥 Commits

Reviewing files that changed from the base of the PR and between cb4b6f8 and e05522e.

📒 Files selected for processing (12)
  • comfy_cli/bench/README.md
  • comfy_cli/bench/__init__.py
  • comfy_cli/bench/fixtures/edit_session_seed.json
  • comfy_cli/bench/fixtures/object_info.json
  • comfy_cli/bench/fixtures/txt2img_seed.json
  • comfy_cli/bench/out/.gitignore
  • comfy_cli/bench/out/.gitkeep
  • comfy_cli/bench/run_arm_b.py
  • comfy_cli/bench/tasks.json
  • pyproject.toml
  • tests/comfy_cli/bench/__init__.py
  • tests/comfy_cli/bench/test_run_arm_b.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch matt/be-2309-arm-b-bench-runner
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch matt/be-2309-arm-b-bench-runner

Comment @coderabbitai help to get the list of available commands.

@mattmillerai mattmillerai added agent-coded PR authored by the agent-work loop cursor-review Request Cursor bot review labels Jul 3, 2026
@mattmillerai mattmillerai marked this pull request as ready for review July 3, 2026 01:48
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Jul 3, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 Cursor Review — Consolidated panel

Triggered by @mattmillerai.

Found 10 finding(s).

Severity Count
🟠 High 1
🟡 Medium 4
🟢 Low 4
⚪ Nit 1

Panel: 8/8 reviewers contributed findings.

Comment thread comfy_cli/bench/fixtures/edit_session_seed.json
Comment thread comfy_cli/bench/run_arm_b.py
Comment thread comfy_cli/bench/run_arm_b.py Outdated
Comment thread comfy_cli/bench/run_arm_b.py Outdated
Comment thread comfy_cli/bench/run_arm_b.py
Comment thread comfy_cli/bench/run_arm_b.py Outdated
Comment thread comfy_cli/bench/run_arm_b.py
Comment thread comfy_cli/bench/run_arm_b.py Outdated
Comment thread comfy_cli/bench/run_arm_b.py Outdated
Comment thread comfy_cli/bench/run_arm_b.py Outdated
Address the Cursor consolidated-panel findings on PR #490:

- fixtures: edit_session_seed KSampler was missing the control_after_generate
  widget value, shifting every slot after `seed` off by one (steps/cfg/sampler
  read + written to the wrong indices on t4). Add the 7th value; add a
  regression test asserting the slots align.
- run_turn: always answer emitted tool_use blocks before breaking, and stop the
  task's remaining turns when a turn is truncated (non-terminal stop reason or
  MAX_API_CALLS_PER_TURN exhaustion) — keeps the shared t4 transcript valid and
  stamps such turns outcome="truncated" instead of a false "ok".
- fail-fast: unknown --model (no pricing entry) now errors instead of silently
  pricing every turn at $0; unrecognized non-null seedWorkflow errors instead of
  silently running against the wrong seed graph.
- safety: validate task ids before interpolating them into filesystem paths
  (reject separators/`..`); cap `vary` variant count at MAX_VARIANTS and
  namespace each vary call's output dir so a second call can't overwrite the
  first; return a generic tool-error message (log detail locally) so raw
  exception strings / local paths aren't sent to the Anthropic API.
- telemetry: count tool payload bytes as UTF-8 (ensure_ascii=False) instead of
  escaped-ASCII lengths so the byte comparison against arm A isn't skewed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-coded PR authored by the agent-work loop cursor-review Request Cursor bot review enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant