feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309)#490
feat: arm-B CLI micro-edit proxy bench runner over slots/set-slot/vary + CQL (BE-2309)#490mattmillerai wants to merge 2 commits into
Conversation
…y + CQL (BE-2309) Add `comfy_cli/bench/run_arm_b.py`: a minimal Claude Opus (claude-opus-4-8, Anthropic Messages API) agent loop that exposes the existing `comfy workflow` micro-edit substrate as tools — read=slots+cql, edit=set-slot, produce-variants=vary — and drives the shared BE-2302 t1–t4 task matrix (vendored byte-identical from child 1's tasks.mjs into bench/tasks.json). The loop reads/edits a temp frontend-format workflow JSON on disk (no full-JSON round-trip to the model), captures the Anthropic usage block per API call plus tool-call count and per-call payload bytes, and prices each turn with the same table as comfy-inapp-agent/agent-server/usage.mjs so the two arms are comparable. It emits per-(task, turn) NDJSON shaped identically to arm A's arm-a.ndjson (verified: child 1's report.mjs consumes bench/out/arm-b.ndjson directly). Spike-sanctioned PROXY for Kishore's not-yet-accessible CLI-agent prototype: every row is stamped `"proxy": true`, and README + module docstring document the narrow swap-in seam (inject the client / replace Driver, keep ToolDispatcher + build_row). Offline-first: committed object_info + seed fixtures run t1/t2/t4 without a live ComfyUI; t3 (~150-node build) is recorded as a RESULT ceiling (CQL/object_info coverage limit). Unit tests stub the model client (no live calls, CI stays offline); a `bench` extra + README cover the live run.
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 4 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (12)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
🔍 Cursor Review — Consolidated panel
Triggered by @mattmillerai.
Found 10 finding(s).
| Severity | Count |
|---|---|
| 🟠 High | 1 |
| 🟡 Medium | 4 |
| 🟢 Low | 4 |
| ⚪ Nit | 1 |
Panel: 8/8 reviewers contributed findings.
Address the Cursor consolidated-panel findings on PR #490: - fixtures: edit_session_seed KSampler was missing the control_after_generate widget value, shifting every slot after `seed` off by one (steps/cfg/sampler read + written to the wrong indices on t4). Add the 7th value; add a regression test asserting the slots align. - run_turn: always answer emitted tool_use blocks before breaking, and stop the task's remaining turns when a turn is truncated (non-terminal stop reason or MAX_API_CALLS_PER_TURN exhaustion) — keeps the shared t4 transcript valid and stamps such turns outcome="truncated" instead of a false "ok". - fail-fast: unknown --model (no pricing entry) now errors instead of silently pricing every turn at $0; unrecognized non-null seedWorkflow errors instead of silently running against the wrong seed graph. - safety: validate task ids before interpolating them into filesystem paths (reject separators/`..`); cap `vary` variant count at MAX_VARIANTS and namespace each vary call's output dir so a second call can't overwrite the first; return a generic tool-error message (log detail locally) so raw exception strings / local paths aren't sent to the Anthropic API. - telemetry: count tool payload bytes as UTF-8 (ensure_ascii=False) instead of escaped-ASCII lengths so the byte comparison against arm A isn't skewed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ELI-5
We're comparing two ways to let an AI edit a ComfyUI workflow: arm A (the in-app
agent, already built by "child 1") vs arm B (a CLI that gives the model a handful
of tiny "micro-edit" tools on a JSON file on disk). Arm B's real subject — Kishore's
CLI-agent prototype — isn't ready yet, so this PR ships a sanctioned stand-in (proxy):
a small Claude loop that uses this repo's own
comfy workflow slots / set-slot / varycommands (+ the CQL node catalog) as its tools. It runs the exact same t1–t4 tasks as
arm A, measures tokens/cost/tool-bytes the same way, and writes results in the exact
same file format so child 1's existing
report.mjscan put both arms in one table.Every row is labelled
"proxy": trueso nobody mistakes it for the real prototype.What this adds (BE-2309)
comfy_cli/bench/run_arm_b.py— a minimal agent loop (Claude Opusclaude-opus-4-8,raw Anthropic Messages API) exposing exactly the arm-B toolset:
slots(read) +cql(read/validate againstcomfy_cli/cql/) →set_slot(edit) →vary(produce-variants), each dispatching to the real CQL engine against a tempfrontend-format workflow on disk — the model only sees compact slot manifests, never
the full JSON.
usageblock per API call(
input_tokens/output_tokens/cache_read_input_tokens/cache_creation_input_tokens), tool-call count, and per-call payload bytes; prices eachturn with the same table as
comfy-inapp-agent/agent-server/usage.mjs(Opus 4.8 $5/M in, $25/M out, cache read 0.1×, write 1.25×).
(task, turn)NDJSON tocomfy_cli/bench/out/arm-b.ndjson, shapedidentically to arm A's
arm-a.ndjson. Verified end-to-end: child 1'snode bench/report.mjs --arm-b <path>renders the comparison table from these rows, andthe recomputed cost matches this runner's
cost_usdto the cent.object_info+ seed fixtures run t1/t2/t4 with no liveComfyUI. t3 (~150-node build) is recorded as a RESULT ceiling (
"outcome": "ceiling"6 base SD1.5 classes.
tests/comfy_cli/bench/test_run_arm_b.py(21 tests) unit-test tool dispatch +usage-parsing + the full
--dry-runagainst a stubbed client — no live calls, CI staysoffline.
benchextra (pip install -e '.[bench]') +bench/README.mddocument the live run.Proxy caveat + swap-in seam
This is not Kishore's prototype. The swap-in is deliberately narrow and documented in
run_arm_b.py's docstring +README.md: either inject a different model client(
build_live_client) or replace theDriver, while keepingToolDispatcher(thecomfy-cli substrate) and
build_row(the telemetry/NDJSON shape) — so the comparison outputstays identical. When the real prototype lands, drop the
"proxy": truestamp.Verification
python -m comfy_cli.bench.run_arm_b --dry-run→ writes well-formedarm-b.ndjson(6 rows).ruff check/ruff format --checkclean; 21/21 tests pass.node bench/report.mjs --arm-b out/arm-b.ndjsonconsumes it directly.Judgment calls
tasks.jsonis a vendored copy, generated by evaluating child 1'stasks.mjsanddumping its prompt strings (byte-identical prompts), because Python can't import the
.mjsand no shared
.jsonexists upstream yet. It carriesseedWorkflow+ a sync note; re-vendor(don't hand-edit) if the shared matrix changes. A follow-up could promote this to a single
shared JSON committed in both repos.
creation), and t4 uses the SD1.5 seed's real nodes for its edits since
object_infohasno LoRA — the prompt strings stay byte-identical; the tool calls target what the fixture
offers. Both are documented in-code and in the README as honest proxy behavior.
1024-token cache minimum, so
cache_controlwould be a live no-op; the runner captureswhatever cache tokens the API returns. The dry-run's synthetic numbers demonstrate the
regrowth shape.
All acceptance criteria in BE-2309 are met.