Skip to content

fix(cursor-review): harden judge against parse_error (lenient extraction + retry + panel fallback) (BE-1916)#17

Merged
mattmillerai merged 1 commit into
mainfrom
matt/be-1916-harden-judge-parse
Jun 29, 2026
Merged

fix(cursor-review): harden judge against parse_error (lenient extraction + retry + panel fallback) (BE-1916)#17
mattmillerai merged 1 commit into
mainfrom
matt/be-1916-harden-judge-parse

Conversation

@mattmillerai

Copy link
Copy Markdown
Contributor

ELI-5

Sometimes the "judge" model that combines all the reviewers' notes answers in
plain English ("Based on my review, I think…") and then pastes the JSON we
need. Our old parser only knew a couple of rigid ways to find that JSON, so it
choked on the prose, threw everything away, and the whole PR review showed up as
a scary "Review failed" — even when the judge's answer was actually correct
(this happened on ComfyUI #487 and Alex's PR). This PR teaches the pipeline to
(1) dig the JSON out of the prose, (2) politely ask the judge once more if it
still can't, and (3) if even that fails, show the raw reviewer notes instead of
a blank failure.

What changed

Three layers, cheapest first, all in the cursor-review judge path:

  1. Lenient extraction (extract-findings.py) — parse_json_findings now
    recovers JSON wrapped in prose via a string/escape-aware balanced
    brace-scan
    for the first {…}/[…] region that actually parses, instead
    of the old naive first-[/last-] slice (which broke the moment the prose
    contained a stray bracket — exactly the #487 shape). New coerce_findings_list
    also unwraps a {"findings": […]}-style object. The fast path (clean JSON)
    and the fenced-json path are unchanged.

  2. Retry once (cursor-review.ymlRun judge) — after the judge runs we
    probe-parse its output; on a hard parse failure we re-run the judge exactly
    once
    with a terse "Return ONLY the JSON array, no prose" reminder appended.
    No loop — one retry, then we stop.

  3. Tighten the judge prompt (prompt-judge.md) — a bare-JSON-array demand is
    now the first line of the prompt.

Plus the degraded fallback: if the judge still can't be parsed after the
retry and the panel produced findings, we post the raw, de-duplicated per-cell
findings under a warning banner — never a bare "Review failed" when there
are real findings to show. A genuine total failure (no panel findings either)
still surfaces the honest error review.

Acceptance criteria

  • Prose-wrapped valid JSON is parsed successfully, not failed — parse_json_findings + balanced scan; regression-tested.
  • A malformed judge response triggers exactly one retry; only a second failure surfaces as an error, and even then posts the per-cell findings (deduped, with a banner) rather than a bare "Review failed".
  • Regression test feeding a #487-style prose-wrapped-JSON sample through the parser — .github/cursor-review/tests/test_extract_findings.py (test_prose_wrapped_array_487 + 13 more).

Testing

  • python3 -m unittest discover -s .github/cursor-review/tests14 tests, all green. New CI workflow test-cursor-review-scripts.yml runs them on any change under .github/cursor-review/**.
  • Simulated the consolidate step's three branches locally (judge-ok / judge-failed-with-panel-findings / no-contributors) and dry-ran post-review.py --notice against a stub gh: the degraded path posts a real review with inline comments + banner, not "Review failed".

Judgment calls / not done

  • JSON / structured-output mode (plan item 3, second half) — intentionally NOT
    enabled.
    The judge is invoked through the cursor-agent CLI, which has no
    reliable per-model knob to force the findings array shape; --output-format json wraps stdout in cursor-agent's own envelope, which would break the
    existing parser. Flipping it blind is a net regression risk, so I relied on the
    prompt tightening + the strict-JSON retry (which is structured-output pressure
    at the prompt level) + the lenient parser. The acceptance criteria are met
    without it. Happy to revisit if a confirmed JSON-mode flag exists.
  • Brace-scan first-match limitation: if the judge's prose contains a
    standalone valid JSON array of scalars before the real findings array, the
    scanner could pick the wrong one. This is contrived (covered case: prose with a
    non-JSON […] recovers correctly, tested), strictly better than the old
    behavior, and backstopped by the retry + prompt. Not over-engineered.

@mattmillerai mattmillerai added agent-coded Authored by the agent-work loop cursor-review Multi-model cursor review labels Jun 29, 2026
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR hardens the cursor-review AI code-review pipeline against malformed LLM judge output. extract-findings.py gains layered JSON parsing and a coerce_findings_list normalizer. cursor-review.yml adds a judge retry and a degraded fallback that unions per-cell panel findings. post-review.py gains a --notice banner flag. A regression test suite and CI workflow are added.

Changes

Robust Judge Output Parsing and Degraded Fallback

Layer / File(s) Summary
Judge prompt output constraint
.github/cursor-review/prompt-judge.md
Prepends a strict JSON-array-only directive to the judge prompt, forbidding prose and markdown fences.
Layered JSON parsing and coerce_findings_list
.github/cursor-review/extract-findings.py
Reworks parse_json_findings with fenced-block extraction, string-aware balanced-region scanning, and scalar rejection; adds coerce_findings_list to unwrap wrapper objects; simplifies main() branching to a single None check.
Regression tests and CI workflow
.github/cursor-review/tests/test_extract_findings.py, .github/workflows/test-cursor-review-scripts.yml
Adds ParseJsonFindingsTest covering all parse paths and MainEndToEndTest for status transitions; adds a unittest discover workflow triggered on PRs and pushes to main.
--notice banner in post-review.py
.github/cursor-review/post-review.py
Extends argument parser with optional --notice, sanitizes it against GitHub mentions, and prepends it to the review header.
Retry, degraded fallback, and post-review branching in workflow
.github/workflows/cursor-review.yml
Wraps cursor-agent in run_judge() with one retry on non-ok parse; replaces always-use-judge-findings with conditional union of deduplicated per-cell panel findings on failure; emits consolidated_count and degraded; branches post-review into three cases: normal, warning-banner --notice, or hard error — quite the trifecta of tricks up its sleeve!
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch matt/be-1916-harden-judge-parse
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch matt/be-1916-harden-judge-parse

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/cursor-review/extract-findings.py:
- Around line 76-77: The scanning logic in the bracket parser is incorrectly
advancing past the rest of the input when it encounters an unterminated opener,
which can hide a later valid JSON payload. Update the loop in the extraction
routine that uses the `i = j + 1` resume step so an unbalanced `[` or `{` only
skips the bad region and continues scanning from the next viable position
instead of jumping to EOF. Keep the fix localized to the parser’s main scan path
so later valid content can still be detected.
- Around line 120-124: The wrapper-unwrapping logic in coerce_findings_list() is
too permissive because it accepts any list under keys like results, items, or
reviews, which lets malformed payloads through as status="ok". Tighten the
validation in coerce_findings_list() so it only returns lists whose contents
match the expected findings schema, and otherwise let main() treat the payload
as a parse/schema failure instead of a successful result.

In @.github/workflows/test-cursor-review-scripts.yml:
- Around line 27-35: The workflow is using floating GitHub Actions tags for
Checkout and Set up Python, which should be pinned to full commit SHAs. Update
the uses entries in the workflow to replace actions/checkout@v6 and
actions/setup-python@v6 with their exact commit hashes so the referenced actions
in the checkout/setup steps remain immutable.
- Around line 8-17: Add a concurrency policy to the workflow so superseded runs
are canceled for the same branch or PR. Update the workflow definition in
test-cursor-review-scripts.yml near the existing on/pull_request and push
triggers by introducing a concurrency group keyed to the ref or PR identity, and
enable cancel-in-progress so force-pushes don’t queue duplicate runs. Use the
workflow’s existing trigger context to choose the group consistently across push
and pull_request events.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 3ab20d41-e367-44b2-8e29-e41b9d9f9e56

📥 Commits

Reviewing files that changed from the base of the PR and between 51e24ab and 6d4a388.

📒 Files selected for processing (6)
  • .github/cursor-review/extract-findings.py
  • .github/cursor-review/post-review.py
  • .github/cursor-review/prompt-judge.md
  • .github/cursor-review/tests/test_extract_findings.py
  • .github/workflows/cursor-review.yml
  • .github/workflows/test-cursor-review-scripts.yml

Comment on lines +76 to +77
# Resume scanning after this region (or after an unbalanced opener).
i = j + 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don’t stop scanning after an unterminated opener.

If the prose contains a stray [ or { that never balances, Line 77 jumps i to EOF and the parser never reaches a later valid JSON payload. One bad bracket, big racket: a response like note [draft...\n[{"file":...}] still ends up as parse_error.

Proposed fix
-        # Resume scanning after this region (or after an unbalanced opener).
-        i = j + 1
+        # Resume after a closed region; otherwise advance one char so later JSON
+        # candidates are still reachable.
+        if j < n and depth == 0:
+            i = j + 1
+        else:
+            i += 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Resume scanning after this region (or after an unbalanced opener).
i = j + 1
# Resume after a closed region; otherwise advance one char so later JSON
# candidates are still reachable.
if j < n and depth == 0:
i = j + 1
else:
i += 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/cursor-review/extract-findings.py around lines 76 - 77, The scanning
logic in the bracket parser is incorrectly advancing past the rest of the input
when it encounters an unterminated opener, which can hide a later valid JSON
payload. Update the loop in the extraction routine that uses the `i = j + 1`
resume step so an unbalanced `[` or `{` only skips the bad region and continues
scanning from the next viable position instead of jumping to EOF. Keep the fix
localized to the parser’s main scan path so later valid content can still be
detected.

Comment on lines +120 to +124
if isinstance(parsed, dict):
for key in ("findings", "results", "items", "reviews"):
value = parsed.get(key)
if isinstance(value, list):
return value

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Validate wrapper contents before returning status="ok".

coerce_findings_list() unwraps any list under results, items, or reviews, so payloads like {"results":["oops"]} are now treated as successful findings. That breaks the downstream contract in main() and can push malformed data into the judge/post-review path instead of surfacing a parse/schema failure.

Proposed fix
+def _is_findings_list(value):
+    required = {"file", "line", "side", "severity", "body"}
+    return isinstance(value, list) and all(
+        isinstance(item, dict) and required.issubset(item) for item in value
+    )
+
 def coerce_findings_list(parsed):
@@
-    if isinstance(parsed, list):
+    if _is_findings_list(parsed):
         return parsed
     if isinstance(parsed, dict):
         for key in ("findings", "results", "items", "reviews"):
             value = parsed.get(key)
-            if isinstance(value, list):
+            if _is_findings_list(value):
                 return value
     return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(parsed, dict):
for key in ("findings", "results", "items", "reviews"):
value = parsed.get(key)
if isinstance(value, list):
return value
def _is_findings_list(value):
required = {"file", "line", "side", "severity", "body"}
return isinstance(value, list) and all(
isinstance(item, dict) and required.issubset(item) for item in value
)
if _is_findings_list(parsed):
return parsed
if isinstance(parsed, dict):
for key in ("findings", "results", "items", "reviews"):
value = parsed.get(key)
if _is_findings_list(value):
return value
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/cursor-review/extract-findings.py around lines 120 - 124, The
wrapper-unwrapping logic in coerce_findings_list() is too permissive because it
accepts any list under keys like results, items, or reviews, which lets
malformed payloads through as status="ok". Tighten the validation in
coerce_findings_list() so it only returns lists whose contents match the
expected findings schema, and otherwise let main() treat the payload as a
parse/schema failure instead of a successful result.

Comment on lines +8 to +17
on:
pull_request:
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'
push:
branches: [main]
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🔵 Trivial

Add concurrency to cancel superseded runs.

Without a concurrency group, force-pushes can pile up duplicate test runs for the same branch. Don’t let CI breed like rabbits.

Suggested change
 on:
   pull_request:
@@
   push:
     branches: [main]
     paths:
       - '.github/cursor-review/**'
       - '.github/workflows/test-cursor-review-scripts.yml'
+
+concurrency:
+  group: test-cursor-review-scripts-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
on:
pull_request:
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'
push:
branches: [main]
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'
on:
pull_request:
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'
push:
branches: [main]
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'
concurrency:
group: test-cursor-review-scripts-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
🧰 Tools
🪛 zizmor (1.26.1)

[warning] 8-17: insufficient job-level concurrency limits (concurrency-limits): workflow is missing concurrency setting

(concurrency-limits)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test-cursor-review-scripts.yml around lines 8 - 17, Add a
concurrency policy to the workflow so superseded runs are canceled for the same
branch or PR. Update the workflow definition in test-cursor-review-scripts.yml
near the existing on/pull_request and push triggers by introducing a concurrency
group keyed to the ref or PR identity, and enable cancel-in-progress so
force-pushes don’t queue duplicate runs. Use the workflow’s existing trigger
context to choose the group consistently across push and pull_request events.

Source: Linters/SAST tools

Comment on lines +27 to +35
- name: Checkout
uses: actions/checkout@v6
with:
persist-credentials: false

- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.12'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Pin GitHub Actions to full commit SHAs.

Using floating @v6 tags here leaves the workflow open to upstream retags changing what executes in CI. A little hash makes the risk disappear in a flash.

Suggested change
-        uses: actions/checkout@v6
+        uses: actions/checkout@<full_commit_sha> # v6
         with:
           persist-credentials: false
@@
-        uses: actions/setup-python@v6
+        uses: actions/setup-python@<full_commit_sha> # v6
         with:
           python-version: '3.12'
🧰 Tools
🪛 zizmor (1.26.1)

[error] 28-28: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)


[error] 33-33: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test-cursor-review-scripts.yml around lines 27 - 35, The
workflow is using floating GitHub Actions tags for Checkout and Set up Python,
which should be pinned to full commit SHAs. Update the uses entries in the
workflow to replace actions/checkout@v6 and actions/setup-python@v6 with their
exact commit hashes so the referenced actions in the checkout/setup steps remain
immutable.

Source: Linters/SAST tools

@mattmillerai mattmillerai merged commit 499fe1d into main Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-coded Authored by the agent-work loop cursor-review Multi-model cursor review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant