Summary of why low-scoring opencode/claude-sonnet-4-5 runs (from the benchmarks-sample-*.json batch) scored poorly. None of these runs crashed; each completed normally and was downgraded for concrete regressions the judges flagged.
- API signature swings came from actual parameter-order swaps and judge disagreement on renamed tests. When GPT-5 codex flipped the signature, all three judges (claude-4.5, gpt-5-codex, kimi) returned 0; when the signature matched but tests were reorganised, kimi alone flagged issues (e.g., missing docstring), yielding 0.66 averages.
- Logic-equivalence variability matched behavioural shifts. The guard clause
if failure_count == 0: returnled every judge to fail the run; when only the wrapper call moved, claude-4.5 often passed, gpt-5-codex failed, and kimi alternated, producing the 0.33/0.67 averages. - Test-coverage oscillation reflected conflicting empty-list assertions and wrapper scenarios. gpt-5-codex and kimi downgraded runs that asserted
assert_not_called()for empty lists, while claude-4.5 occasionally passed them, so outcomes toggled between 0 and 0.33/1.0. - Checks score fluctuations depended on pytest regressions. Runs keeping pytest green scored 1; runs introducing new failures dropped to 0. Partial averages (0.66) appeared when kimi counted new lint errors in an already-failing flake8 run while claude-4.5 and gpt-5-codex ignored them.
- Integration-points stayed at 0 whenever the wrapper call moved or arguments were reordered. gpt-5-codex and kimi consistently failed those diffs; claude-4.5 sometimes passed them, creating 0.66 averages.
- Replace fuzzy LLM scorers with deterministic analysis wherever possible: use AST or scripted checks for integration call order/signature differences, and synthesize inputs to compare actual side effects (all tied to the empty-list regression).
- Convert critical behaviours (e.g., emitting a metric with value 0 for an empty
batchItemFailureslist) into explicit regression tests rather than relying on post-hoc logic/coverage judgements. - Tighten LLM prompts to reference concrete diff sections (wrapper call order, empty-list branch) so judges answer binary questions instead of offering broad narratives.
- For
checks, compute regressions mechanically via exit-code comparisons before involving any judge; treat pytest 0→1 as a hard fail and decide lint policy in code rather than letting models reinterpret flake8 failures. - Cache the agent diff once per run and share it across scorers to ensure consistent input; today each scorer re-derives
git diff, which can shift contextual snippets. - Fail fast or rerun when an agent produces no diff (e.g., skipped GPT-5 codex run in benchmarks-sample-9) so aggregated results aren’t skewed by missing samples.
- Logic-equivalence: candidate stops emitting the metric for an empty
batchItemFailureslist; reference emits value0. All judges (claude-4.5, gpt-5-codex, kimi) failed it. (benchmarks-sample-1.json:228) - Integration-points: wrapper call moved later in
_after, arguments swapped to(context, response); gpt-5-codex and kimi scored 0 while claude-4.5 occasionally passed similar diffs. (benchmarks-sample-1.json:80-100) - Test-coverage: empty-list test now asserts
assert_not_called()instead ofassert_called_once_with(..., 0). (benchmarks-sample-1.json:300) - Checks: pytest regressed with eight failing
TestBatchItemFailuresMetriccases. (benchmarks-sample-1.json:320-339)
- Logic-equivalence: same empty-list regression; all judges failed it. (
benchmarks-sample-4.json:228-244) - Integration-points: claude-4.5 passed while gpt-5-codex and kimi failed, yielding the 0.666 average as the call moved relative to
extract_http_status_code_tag. (benchmarks-sample-4.json:248-272) - Test-coverage: empty-list assertion mismatch (no metric vs value 0) flagged by all three judges. (
benchmarks-sample-4.json:300-302) - Checks: pytest stayed green; run completed successfully.
- Integration-points: import moved to module level, call executes later, arguments reversed; gpt-5-codex and kimi failed it while claude-4.5 has passed similar diffs elsewhere. (
benchmarks-sample-8.json:86-100) - Test-coverage: empty-list test asserts the metric is not called, conflicting with reference; gpt-5-codex and kimi downgraded the run. (
benchmarks-sample-8.json:300-302) - Logic-equivalence averaged 0.0 because gpt-5-codex and kimi treated the empty-list change as a regression while claude-4.5 passed it in this run. (
benchmarks-sample-8.json:228-244)
- Only claude run recorded; GPT-5 run produced no diff and was skipped (expected behaviour when
finalizeAgentChangesfinds nothing to commit). - Logic-equivalence: candidate added
if failure_count == 0: return, removing the empty-list metric; claude-4.5, gpt-5-codex, and kimi all scored 0. (benchmarks-sample-9.json:57-72) - Test-coverage: same empty-list assertion mismatch and missing wrapper scenario flagged by gpt-5-codex and kimi. (
benchmarks-sample-9.json:121-130) - Checks: pytest stayed passing with additional tests; flake8 remained in its baseline failing state for every judge.
Overall, every low score stems from deliberate behavioural changes (primarily skipping the empty-list metric) or integration shifts, not from agent crashes.