Skip to content

Triage gate: classify complexity before dispatch, reject or sharpen specs instead of failing on ambiguous issues #46

Description

@khaliqgant

Why now

Two independent prospect calls (Nicole Turnage @ Apricot, 2026-06-23; Nango / Marcin, 2026-06-25) hit the same failure mode with factory-shape products:

Nicole: zero-touch Linear → PR pipeline produced "a giant plate of spaghetti" on an under-specified ticket. The agent made assumptions the spec didn't cover, didn't ask for clarification, shipped a PR that missed three branches of a five-step flow.

Marcin (verbatim):

"We started from, can we go from linear to a PR? Which is, like, low hanging fruit. But it is so complicated. A lot of the stuff is so complicated that eventually an engineer has to look. The prompting of an issue is the easy part. It takes 10% of the time, then the rest is 90%."

He then proposed the fix verbatim:

"It would be useful where at the point of creating linear issues, it could figure out how complicated a solution to this is. And suggest that I can just do this now. Like, if it just basically finds out that hey, this is pretty easy, then it triggers a run to give you a PR. Unless there are some things here that we might not be happy about, so let's not do it."

This is two-of-two confirmation that bare factory has a complexity ceiling on complex codebases. Triage today decides how to do work; it doesn't decide whether to try.

What we already have

  • src/triage/heuristic.ts — routing + scope (single / workflow / team) + thin-description detection (140-char threshold)
  • src/triage/llm.ts — LLM-backed triage decisions
  • src/triage/tiered.ts — tiered escalation
  • ClarificationRecord in src/state/store.ts — "wait for human in Slack thread" state
  • Slack mid-task clarification: when an agent gets stuck, it can ask in the thread and pause until answered

What that flow gets right: the agent CAN ask. What it gets wrong for the cases above: clarification happens too late — after dispatch, deep into the work, when half a PR is already wrong and the agent doesn't know enough to ask the right questions. Nicole's engineer let the agent run past every gate. Marcin's bare "linear → PR" run dispatched issues that no amount of mid-task clarification could rescue.

What to build

A pre-dispatch triage gate that runs before any coding agent is spawned. Reads the issue, the linked design/spec/screenshots, and a quick scan of the affected files. Outputs one of:

  • ship → dispatch to factory normally
  • needs-spec → don't dispatch. Post specific questions to the Slack thread (or Linear comments) describing what's ambiguous. Wait for answers. Re-evaluate.
  • too-complex → don't dispatch. Post a structured "why this isn't suitable for autonomy" message. Suggest splitting, scope reduction, or human pickup.

The decision must be visible — why an issue was classified the way it was, what specific signals tipped it.

Scope — concrete work

Phase 1: new triage/complexity.ts engine alongside existing engines

  • New decision type ComplexityVerdict = 'ship' | 'needs-spec' | 'too-complex' extending TriageDecision
  • LLM-backed classifier with structured-output schema (likely re-use the patterns in llm.ts)
  • Signals to weigh:
    • Description quality (length is a weak proxy; semantic completeness is better)
    • Linked design/Figma/spec presence
    • Number of files likely touched (route-detection already exists in heuristic.ts)
    • Cross-surface changes (UI + backend + state = higher risk)
    • Edge case enumeration in the description ("what about X" / "if Y then Z" patterns = good sign)
    • Acceptance criteria specificity
    • Whether the codebase has tests covering the touched paths
  • Output includes per-signal reasoning so failures are debuggable

Phase 2: wire it into the dispatch pipeline as a gate

  • In orchestrator/factory.ts (or wherever triage feeds dispatch): if verdict ≠ ship, don't dispatch
  • For needs-spec: post structured Slack thread / Linear comment with specific questions; persist a WaitingSpecRecord (mirrors ClarificationRecord); re-trigger when the issue is updated
  • For too-complex: post the rationale, label the issue (needs-human or similar), exit cleanly

Phase 3: feedback loop

  • Track outcomes: when we ship vs needs-spec vs too-complex, what actually happened to the PR? Merged clean / merged-with-fixes / closed-without-merge / human-took-over?
  • Per-repo calibration — the threshold for "too complex" in a small startup repo ≠ enterprise monorepo
  • Surface metrics: % of issues gated, distribution of verdicts, downstream merge rate by verdict

Phase 4 (stretch): proactive issue-creation gate

Marcin's framing was "at the point of creating linear issues." So beyond gating dispatch, optionally run the same classifier when an issue is CREATED:

  • Linear/GitHub webhook on issue creation
  • Inline comment from factory: "this looks like a 2-hour autonomous task — want me to handle it?" or "this needs more scoping before we can dispatch — here's what's missing"
  • Author can address the gaps and re-trigger evaluation

Acceptance / success

Measured outcomes after rollout on a real backlog:

  • Merge-rate of dispatched PRs goes up (because we stopped dispatching the doomed ones)
  • Time-to-clarification goes down (questions asked at the right moment, not 30 minutes into a wrong implementation)
  • "Plate of spaghetti" rate goes to ~0 — PRs that miss obvious requirements should be caught at the gate

Demonstrable to a prospect (Marcin specifically): show a complex issue → gate refuses → posts specific questions → after a human edit, gate accepts → factory dispatches → clean PR. The contrast vs bare factory is the demo.

Non-goals

  • Not replacing the existing heuristic/LLM triage engines — this is a NEW gate that runs after current triage classifies scope/routes
  • Not making the gate so conservative it refuses everything (defeats the purpose)
  • Not requiring perfect classification on day 1 — calibrate against real outcomes, learn

Open questions

  1. Where does the gate live in the pipeline? Most natural is after current triage (we know scope/routes) but before agent dispatch. Confirm with the orchestrator flow.
  2. What signals matter most? Description length is weak. Specific signals likely matter more (acceptance criteria, edge-case enumeration, linked design). Need real-data calibration.
  3. Does the classifier need codebase context? Routes already point at files — should the classifier read sample files via relayfile to assess complexity? Probably yes for v2, no for v1.
  4. Slack vs Linear for clarification questions? Slack is current pattern but Linear-native comments are where issue authors live. Probably both, configurable per repo.
  5. How does this interact with the existing ClarificationRecord mid-task flow? Two layers (pre-dispatch + mid-task) both legitimate, but worth thinking about whether mid-task clarification effectively becomes the failure case ("gate said ship, agent still hit ambiguity").

Related

  • Customer signal: sales/nicole-turnage/transcript-06-23-26.txt (Nicole / Apricot — zero-touch failure)
  • Customer signal: sales/nango/transcript-06-25-26.txt (Marcin — verbatim spec for this feature)
  • Cross-portfolio Push-vs-Pull doc: Notion "Push vs. Pull" page (gap labeled "Triage gate")
  • Existing triage: src/triage/heuristic.ts, src/triage/llm.ts, src/triage/tiered.ts
  • Existing clarification primitive: ClarificationRecord in src/state/store.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions