Skip to content

feat(statistics): paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)#254

Merged
drewstone merged 1 commit into
mainfrom
feat/paired-binary-stats
Jun 19, 2026
Merged

feat(statistics): paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)#254
drewstone merged 1 commit into
mainfrom
feat/paired-binary-stats

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Adds the binary-outcome estimators the continuous/paired statistics stack was missing, so any consumer comparing pass/fail rates between two conditions (model A vs B, tool on vs off, prompt v1 vs v2) has the statistically correct tools instead of reaching for the wrong continuous test.

Function Use Why it's needed
mcnemar(control, treatment) significance of a paired pass/fail change A paired t-test / two-proportion z-test is wrong for matched binary outcomes — only the discordant pairs carry signal. Exact doubled-binomial tail (correct at the small discordant counts typical of eval runs); continuity-corrected χ² returned for reference.
pairedRiskDifference(control, treatment) effect size p(treat) − p(control) = (b−c)/n + CI CI uses the paired (McNemar) variance, not the independent-samples formula (which overstates the interval by ignoring pairing).
wilson(successes, n) binomial proportion CI Correct near 0/1 and at small n where the Wald interval escapes [0,1]. Previously owned by a downstream consumer because the substrate lacked it — now canonical here.
passAtK(n, c, k) unbiased pass@k for code-gen evals The Chen et al. 2021 estimator 1 − C(n−c,k)/C(n,k) in stable product form; the naive "did any of the first k pass" is biased high at small n.

All four are exported from the root barrel and the statistics module, with McNemarResult / RiskDifferenceResult / ProportionInterval result types.

Verification

  • Cross-checked the compiled functions against independent ground truth: passAtK over a full sweep (n=1..30, every c, every k) vs exact-integer math.comb (max diff 2.2e-16); mcnemar vs both exact-integer 2·tail and scipy.stats.binomtest including a b=400, c=350 large-count numerical-stability case; wilson vs the canonical reimpl + the literature anchor (8/10 → [0.4901, 0.9433]); pairedRiskDifference vs canonical reimpl.
  • 24 new unit tests pinning reference values, edge cases (n=0, boundary, symmetric discordance, no-discordant-pairs), and throw paths.
  • Full suite green (2310 tests), tsc --noEmit clean, biome clean.

Notes

  • Exact binomial tails reuse the in-module lnGamma (log-space sum → stable at large discordant counts); z-quantiles reuse zQuantile. No new dependencies.
  • Reductions in scope vs Wald are deliberate: these are exact / Wilson-based precisely because the normal approximation is unsafe for proportions near 0/1 and small discordant counts, which is the regime eval A/Bs live in.

…sk-diff, Wilson, pass@k)

Add the binary-outcome estimators the continuous/paired stack was missing, so any consumer comparing pass/fail RATES between two conditions has the correct tools:

- mcnemar — exact paired-binary significance (binomial sign test on discordant pairs). A paired t-test / two-proportion z-test is wrong for matched binary outcomes; only discordant pairs carry signal. Exact doubled-binomial tail (correct at small discordant counts), with the continuity-corrected chi-square returned for reference.
- pairedRiskDifference — the paired effect size p(treatment)-p(control)=(b-c)/n with a CI from the paired (McNemar) variance, not the independent-samples formula.
- wilson — binomial proportion CI, correct near 0/1 and at small n where the Wald interval escapes [0,1]. Previously owned by a downstream consumer because the substrate lacked it; now canonical here.
- passAtK — the unbiased Chen et al. 2021 pass@k estimator for code-generation evals (1 - C(n-c,k)/C(n,k), stable product form).

Verified against exact-integer (math.comb) and scipy.binomtest ground truth across a full sweep (passAtK n=1..30 over all c,k matched to 2e-16; McNemar incl. b=400,c=350 large-count stability vs scipy). 24 new tests; full suite 2310 green; typecheck + biome clean.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 3d82d262

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T14:31:19Z

@drewstone drewstone merged commit 34b2d55 into main Jun 19, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 19, 2026
…hardening (#258)

Version-only bump of the trio (package.json + python pyproject + __init__).
Ships everything merged since 0.93.0:
- #253 data-plane reliability/concurrency/scale hardening
- #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)
- #255 trace-store append/load race + collision-free rollover
- #256 McNemar power + required-N
- #257 single source of truth for span-timestamp parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants