fix(cursor-review): harden judge against parse_error (lenient extraction + retry + panel fallback) (BE-1916)#17
Conversation
…ion + retry + panel fallback) (BE-1916)
📝 WalkthroughWalkthroughThe PR hardens the cursor-review AI code-review pipeline against malformed LLM judge output. ChangesRobust Judge Output Parsing and Degraded Fallback
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/cursor-review/extract-findings.py:
- Around line 76-77: The scanning logic in the bracket parser is incorrectly
advancing past the rest of the input when it encounters an unterminated opener,
which can hide a later valid JSON payload. Update the loop in the extraction
routine that uses the `i = j + 1` resume step so an unbalanced `[` or `{` only
skips the bad region and continues scanning from the next viable position
instead of jumping to EOF. Keep the fix localized to the parser’s main scan path
so later valid content can still be detected.
- Around line 120-124: The wrapper-unwrapping logic in coerce_findings_list() is
too permissive because it accepts any list under keys like results, items, or
reviews, which lets malformed payloads through as status="ok". Tighten the
validation in coerce_findings_list() so it only returns lists whose contents
match the expected findings schema, and otherwise let main() treat the payload
as a parse/schema failure instead of a successful result.
In @.github/workflows/test-cursor-review-scripts.yml:
- Around line 27-35: The workflow is using floating GitHub Actions tags for
Checkout and Set up Python, which should be pinned to full commit SHAs. Update
the uses entries in the workflow to replace actions/checkout@v6 and
actions/setup-python@v6 with their exact commit hashes so the referenced actions
in the checkout/setup steps remain immutable.
- Around line 8-17: Add a concurrency policy to the workflow so superseded runs
are canceled for the same branch or PR. Update the workflow definition in
test-cursor-review-scripts.yml near the existing on/pull_request and push
triggers by introducing a concurrency group keyed to the ref or PR identity, and
enable cancel-in-progress so force-pushes don’t queue duplicate runs. Use the
workflow’s existing trigger context to choose the group consistently across push
and pull_request events.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 3ab20d41-e367-44b2-8e29-e41b9d9f9e56
📒 Files selected for processing (6)
.github/cursor-review/extract-findings.py.github/cursor-review/post-review.py.github/cursor-review/prompt-judge.md.github/cursor-review/tests/test_extract_findings.py.github/workflows/cursor-review.yml.github/workflows/test-cursor-review-scripts.yml
| # Resume scanning after this region (or after an unbalanced opener). | ||
| i = j + 1 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Don’t stop scanning after an unterminated opener.
If the prose contains a stray [ or { that never balances, Line 77 jumps i to EOF and the parser never reaches a later valid JSON payload. One bad bracket, big racket: a response like note [draft...\n[{"file":...}] still ends up as parse_error.
Proposed fix
- # Resume scanning after this region (or after an unbalanced opener).
- i = j + 1
+ # Resume after a closed region; otherwise advance one char so later JSON
+ # candidates are still reachable.
+ if j < n and depth == 0:
+ i = j + 1
+ else:
+ i += 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Resume scanning after this region (or after an unbalanced opener). | |
| i = j + 1 | |
| # Resume after a closed region; otherwise advance one char so later JSON | |
| # candidates are still reachable. | |
| if j < n and depth == 0: | |
| i = j + 1 | |
| else: | |
| i += 1 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/cursor-review/extract-findings.py around lines 76 - 77, The scanning
logic in the bracket parser is incorrectly advancing past the rest of the input
when it encounters an unterminated opener, which can hide a later valid JSON
payload. Update the loop in the extraction routine that uses the `i = j + 1`
resume step so an unbalanced `[` or `{` only skips the bad region and continues
scanning from the next viable position instead of jumping to EOF. Keep the fix
localized to the parser’s main scan path so later valid content can still be
detected.
| if isinstance(parsed, dict): | ||
| for key in ("findings", "results", "items", "reviews"): | ||
| value = parsed.get(key) | ||
| if isinstance(value, list): | ||
| return value |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Validate wrapper contents before returning status="ok".
coerce_findings_list() unwraps any list under results, items, or reviews, so payloads like {"results":["oops"]} are now treated as successful findings. That breaks the downstream contract in main() and can push malformed data into the judge/post-review path instead of surfacing a parse/schema failure.
Proposed fix
+def _is_findings_list(value):
+ required = {"file", "line", "side", "severity", "body"}
+ return isinstance(value, list) and all(
+ isinstance(item, dict) and required.issubset(item) for item in value
+ )
+
def coerce_findings_list(parsed):
@@
- if isinstance(parsed, list):
+ if _is_findings_list(parsed):
return parsed
if isinstance(parsed, dict):
for key in ("findings", "results", "items", "reviews"):
value = parsed.get(key)
- if isinstance(value, list):
+ if _is_findings_list(value):
return value
return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(parsed, dict): | |
| for key in ("findings", "results", "items", "reviews"): | |
| value = parsed.get(key) | |
| if isinstance(value, list): | |
| return value | |
| def _is_findings_list(value): | |
| required = {"file", "line", "side", "severity", "body"} | |
| return isinstance(value, list) and all( | |
| isinstance(item, dict) and required.issubset(item) for item in value | |
| ) | |
| if _is_findings_list(parsed): | |
| return parsed | |
| if isinstance(parsed, dict): | |
| for key in ("findings", "results", "items", "reviews"): | |
| value = parsed.get(key) | |
| if _is_findings_list(value): | |
| return value |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/cursor-review/extract-findings.py around lines 120 - 124, The
wrapper-unwrapping logic in coerce_findings_list() is too permissive because it
accepts any list under keys like results, items, or reviews, which lets
malformed payloads through as status="ok". Tighten the validation in
coerce_findings_list() so it only returns lists whose contents match the
expected findings schema, and otherwise let main() treat the payload as a
parse/schema failure instead of a successful result.
| on: | ||
| pull_request: | ||
| paths: | ||
| - '.github/cursor-review/**' | ||
| - '.github/workflows/test-cursor-review-scripts.yml' | ||
| push: | ||
| branches: [main] | ||
| paths: | ||
| - '.github/cursor-review/**' | ||
| - '.github/workflows/test-cursor-review-scripts.yml' |
There was a problem hiding this comment.
🚀 Performance & Scalability | 🔵 Trivial
Add concurrency to cancel superseded runs.
Without a concurrency group, force-pushes can pile up duplicate test runs for the same branch. Don’t let CI breed like rabbits.
Suggested change
on:
pull_request:
@@
push:
branches: [main]
paths:
- '.github/cursor-review/**'
- '.github/workflows/test-cursor-review-scripts.yml'
+
+concurrency:
+ group: test-cursor-review-scripts-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+ cancel-in-progress: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| on: | |
| pull_request: | |
| paths: | |
| - '.github/cursor-review/**' | |
| - '.github/workflows/test-cursor-review-scripts.yml' | |
| push: | |
| branches: [main] | |
| paths: | |
| - '.github/cursor-review/**' | |
| - '.github/workflows/test-cursor-review-scripts.yml' | |
| on: | |
| pull_request: | |
| paths: | |
| - '.github/cursor-review/**' | |
| - '.github/workflows/test-cursor-review-scripts.yml' | |
| push: | |
| branches: [main] | |
| paths: | |
| - '.github/cursor-review/**' | |
| - '.github/workflows/test-cursor-review-scripts.yml' | |
| concurrency: | |
| group: test-cursor-review-scripts-${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | |
| cancel-in-progress: true |
🧰 Tools
🪛 zizmor (1.26.1)
[warning] 8-17: insufficient job-level concurrency limits (concurrency-limits): workflow is missing concurrency setting
(concurrency-limits)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/test-cursor-review-scripts.yml around lines 8 - 17, Add a
concurrency policy to the workflow so superseded runs are canceled for the same
branch or PR. Update the workflow definition in test-cursor-review-scripts.yml
near the existing on/pull_request and push triggers by introducing a concurrency
group keyed to the ref or PR identity, and enable cancel-in-progress so
force-pushes don’t queue duplicate runs. Use the workflow’s existing trigger
context to choose the group consistently across push and pull_request events.
Source: Linters/SAST tools
| - name: Checkout | ||
| uses: actions/checkout@v6 | ||
| with: | ||
| persist-credentials: false | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v6 | ||
| with: | ||
| python-version: '3.12' |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Pin GitHub Actions to full commit SHAs.
Using floating @v6 tags here leaves the workflow open to upstream retags changing what executes in CI. A little hash makes the risk disappear in a flash.
Suggested change
- uses: actions/checkout@v6
+ uses: actions/checkout@<full_commit_sha> # v6
with:
persist-credentials: false
@@
- uses: actions/setup-python@v6
+ uses: actions/setup-python@<full_commit_sha> # v6
with:
python-version: '3.12'🧰 Tools
🪛 zizmor (1.26.1)
[error] 28-28: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
[error] 33-33: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/test-cursor-review-scripts.yml around lines 27 - 35, The
workflow is using floating GitHub Actions tags for Checkout and Set up Python,
which should be pinned to full commit SHAs. Update the uses entries in the
workflow to replace actions/checkout@v6 and actions/setup-python@v6 with their
exact commit hashes so the referenced actions in the checkout/setup steps remain
immutable.
Source: Linters/SAST tools
ELI-5
Sometimes the "judge" model that combines all the reviewers' notes answers in
plain English ("Based on my review, I think…") and then pastes the JSON we
need. Our old parser only knew a couple of rigid ways to find that JSON, so it
choked on the prose, threw everything away, and the whole PR review showed up as
a scary "Review failed" — even when the judge's answer was actually correct
(this happened on ComfyUI #487 and Alex's PR). This PR teaches the pipeline to
(1) dig the JSON out of the prose, (2) politely ask the judge once more if it
still can't, and (3) if even that fails, show the raw reviewer notes instead of
a blank failure.
What changed
Three layers, cheapest first, all in the cursor-review judge path:
Lenient extraction (
extract-findings.py) —parse_json_findingsnowrecovers JSON wrapped in prose via a string/escape-aware balanced
brace-scan for the first
{…}/[…]region that actually parses, insteadof the old naive first-
[/last-]slice (which broke the moment the prosecontained a stray bracket — exactly the #487 shape). New
coerce_findings_listalso unwraps a
{"findings": […]}-style object. The fast path (clean JSON)and the fenced-
jsonpath are unchanged.Retry once (
cursor-review.yml→ Run judge) — after the judge runs weprobe-parse its output; on a hard parse failure we re-run the judge exactly
once with a terse "Return ONLY the JSON array, no prose" reminder appended.
No loop — one retry, then we stop.
Tighten the judge prompt (
prompt-judge.md) — a bare-JSON-array demand isnow the first line of the prompt.
Plus the degraded fallback: if the judge still can't be parsed after the
retry and the panel produced findings, we post the raw, de-duplicated per-cell
findings under a warning banner — never a bare "Review failed" when there
are real findings to show. A genuine total failure (no panel findings either)
still surfaces the honest error review.
Acceptance criteria
parse_json_findings+ balanced scan; regression-tested..github/cursor-review/tests/test_extract_findings.py(test_prose_wrapped_array_487+ 13 more).Testing
python3 -m unittest discover -s .github/cursor-review/tests→ 14 tests, all green. New CI workflowtest-cursor-review-scripts.ymlruns them on any change under.github/cursor-review/**.post-review.py --noticeagainst a stubgh: the degraded path posts a real review with inline comments + banner, not "Review failed".Judgment calls / not done
enabled. The judge is invoked through the
cursor-agentCLI, which has noreliable per-model knob to force the findings array shape;
--output-format jsonwraps stdout in cursor-agent's own envelope, which would break theexisting parser. Flipping it blind is a net regression risk, so I relied on the
prompt tightening + the strict-JSON retry (which is structured-output pressure
at the prompt level) + the lenient parser. The acceptance criteria are met
without it. Happy to revisit if a confirmed JSON-mode flag exists.
standalone valid JSON array of scalars before the real findings array, the
scanner could pick the wrong one. This is contrived (covered case: prose with a
non-JSON
[…]recovers correctly, tested), strictly better than the oldbehavior, and backstopped by the retry + prompt. Not over-engineered.