feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309) by mattmillerai · Pull Request #490 · Comfy-Org/comfy-cli

mattmillerai · 2026-07-03T01:48:30Z

ELI-5

We're comparing two ways to let an AI edit a ComfyUI workflow: arm A (the in-app
agent, already built by "child 1") vs arm B (a CLI that gives the model a handful
of tiny "micro-edit" tools on a JSON file on disk). Arm B's real subject — Kishore's
CLI-agent prototype — isn't ready yet, so this PR ships a sanctioned stand-in (proxy):
a small Claude loop that uses this repo's own comfy workflow slots / set-slot / vary
commands (+ the CQL node catalog) as its tools. It runs the exact same t1–t4 tasks as
arm A, measures tokens/cost/tool-bytes the same way, and writes results in the exact
same file format so child 1's existing report.mjs can put both arms in one table.
Every row is labelled "proxy": true so nobody mistakes it for the real prototype.

What this adds (BE-2309)

comfy_cli/bench/run_arm_b.py — a minimal agent loop (Claude Opus claude-opus-4-8,
raw Anthropic Messages API) exposing exactly the arm-B toolset:
- slots (read) + cql (read/validate against comfy_cli/cql/) → set_slot (edit) →
  vary (produce-variants), each dispatching to the real CQL engine against a temp
  frontend-format workflow on disk — the model only sees compact slot manifests, never
  the full JSON.
Telemetry — captures the Anthropic usage block per API call
(input_tokens / output_tokens / cache_read_input_tokens /
cache_creation_input_tokens), tool-call count, and per-call payload bytes; prices each
turn with the same table as comfy-inapp-agent/agent-server/usage.mjs
(Opus 4.8 $5/M in, $25/M out, cache read 0.1×, write 1.25×).
Output — per-(task, turn) NDJSON to comfy_cli/bench/out/arm-b.ndjson, shaped
identically to arm A's arm-a.ndjson. Verified end-to-end: child 1's
node bench/report.mjs --arm-b <path> renders the comparison table from these rows, and
the recomputed cost matches this runner's cost_usd to the cent.
Offline-first — committed object_info + seed fixtures run t1/t2/t4 with no live
ComfyUI. t3 (~150-node build) is recorded as a RESULT ceiling ("outcome": "ceiling"
- note): arm B's edit-only toolset can't build 150 nodes and the fixture covers only the
  6 base SD1.5 classes.
Tests — tests/comfy_cli/bench/test_run_arm_b.py (21 tests) unit-test tool dispatch +
usage-parsing + the full --dry-run against a stubbed client — no live calls, CI stays
offline. bench extra (pip install -e '.[bench]') + bench/README.md document the live run.

Proxy caveat + swap-in seam

This is not Kishore's prototype. The swap-in is deliberately narrow and documented in
run_arm_b.py's docstring + README.md: either inject a different model client
(build_live_client) or replace the Driver, while keeping ToolDispatcher (the
comfy-cli substrate) and build_row (the telemetry/NDJSON shape) — so the comparison output
stays identical. When the real prototype lands, drop the "proxy": true stamp.

Verification

python -m comfy_cli.bench.run_arm_b --dry-run → writes well-formed arm-b.ndjson (6 rows).
ruff check / ruff format --check clean; 21/21 tests pass.
Cross-repo: node bench/report.mjs --arm-b out/arm-b.ndjson consumes it directly.

Judgment calls

tasks.json is a vendored copy, generated by evaluating child 1's tasks.mjs and
dumping its prompt strings (byte-identical prompts), because Python can't import the .mjs
and no shared .json exists upstream yet. It carries seedWorkflow + a sync note; re-vendor
(don't hand-edit) if the shared matrix changes. A follow-up could promote this to a single
shared JSON committed in both repos.
t1 "build small" is proxied as template + micro-edit (arm B has no from-scratch node
creation), and t4 uses the SD1.5 seed's real nodes for its edits since object_info has
no LoRA — the prompt strings stay byte-identical; the tool calls target what the fixture
offers. Both are documented in-code and in the README as honest proxy behavior.
No forced prompt-caching on the live path: the system+tools prefix is well under Opus's
1024-token cache minimum, so cache_control would be a live no-op; the runner captures
whatever cache tokens the API returns. The dry-run's synthetic numbers demonstrate the
regrowth shape.

All acceptance criteria in BE-2309 are met.

…y + CQL (BE-2309) Add `comfy_cli/bench/run_arm_b.py`: a minimal Claude Opus (claude-opus-4-8, Anthropic Messages API) agent loop that exposes the existing `comfy workflow` micro-edit substrate as tools — read=slots+cql, edit=set-slot, produce-variants=vary — and drives the shared BE-2302 t1–t4 task matrix (vendored byte-identical from child 1's tasks.mjs into bench/tasks.json). The loop reads/edits a temp frontend-format workflow JSON on disk (no full-JSON round-trip to the model), captures the Anthropic usage block per API call plus tool-call count and per-call payload bytes, and prices each turn with the same table as comfy-inapp-agent/agent-server/usage.mjs so the two arms are comparable. It emits per-(task, turn) NDJSON shaped identically to arm A's arm-a.ndjson (verified: child 1's report.mjs consumes bench/out/arm-b.ndjson directly). Spike-sanctioned PROXY for Kishore's not-yet-accessible CLI-agent prototype: every row is stamped `"proxy": true`, and README + module docstring document the narrow swap-in seam (inject the client / replace Driver, keep ToolDispatcher + build_row). Offline-first: committed object_info + seed fixtures run t1/t2/t4 without a live ComfyUI; t3 (~150-node build) is recorded as a RESULT ceiling (CQL/object_info coverage limit). Unit tests stub the model client (no live calls, CI stays offline); a `bench` extra + README cover the live run.

coderabbitai · 2026-07-03T01:48:38Z

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 4 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 37d31342-f144-4b29-9862-b112f67f548b

📥 Commits

Reviewing files that changed from the base of the PR and between cb4b6f8 and e05522e.

📒 Files selected for processing (12)

comfy_cli/bench/README.md
comfy_cli/bench/__init__.py
comfy_cli/bench/fixtures/edit_session_seed.json
comfy_cli/bench/fixtures/object_info.json
comfy_cli/bench/fixtures/txt2img_seed.json
comfy_cli/bench/out/.gitignore
comfy_cli/bench/out/.gitkeep
comfy_cli/bench/run_arm_b.py
comfy_cli/bench/tasks.json
pyproject.toml
tests/comfy_cli/bench/__init__.py
tests/comfy_cli/bench/test_run_arm_b.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch matt/be-2309-arm-b-bench-runner

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch matt/be-2309-arm-b-bench-runner

_{Comment @coderabbitai help to get the list of available commands.}

github-actions

🔍 Cursor Review — Consolidated panel

Triggered by @mattmillerai.

Found 10 finding(s).

Severity	Count
🟠 High	1
🟡 Medium	4
🟢 Low	4
⚪ Nit	1

Panel: 8/8 reviewers contributed findings.

Address the Cursor consolidated-panel findings on PR #490: - fixtures: edit_session_seed KSampler was missing the control_after_generate widget value, shifting every slot after `seed` off by one (steps/cfg/sampler read + written to the wrong indices on t4). Add the 7th value; add a regression test asserting the slots align. - run_turn: always answer emitted tool_use blocks before breaking, and stop the task's remaining turns when a turn is truncated (non-terminal stop reason or MAX_API_CALLS_PER_TURN exhaustion) — keeps the shared t4 transcript valid and stamps such turns outcome="truncated" instead of a false "ok". - fail-fast: unknown --model (no pricing entry) now errors instead of silently pricing every turn at $0; unrecognized non-null seedWorkflow errors instead of silently running against the wrong seed graph. - safety: validate task ids before interpolating them into filesystem paths (reject separators/`..`); cap `vary` variant count at MAX_VARIANTS and namespace each vary call's output dir so a second call can't overwrite the first; return a generic tool-error message (log detail locally) so raw exception strings / local paths aren't sent to the Anthropic API. - telemetry: count tool payload bytes as UTF-8 (ensure_ascii=False) instead of escaped-ASCII lengths so the byte comparison against arm A isn't skewed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mattmillerai added agent-coded PR authored by the agent-work loop cursor-review Request Cursor bot review labels Jul 3, 2026

mattmillerai marked this pull request as ready for review July 3, 2026 01:48

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Jul 3, 2026

github-actions Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309)#490

feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309)#490
mattmillerai wants to merge 2 commits into
mainfrom
matt/be-2309-arm-b-bench-runner

mattmillerai commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Review limit reached

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mattmillerai commented Jul 3, 2026

ELI-5

What this adds (BE-2309)

Proxy caveat + swap-in seam

Verification

Judgment calls

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 Cursor Review — Consolidated panel

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading