feat(statistics): paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k) by drewstone · Pull Request #254 · tangle-network/agent-eval

drewstone · 2026-06-19T14:31:10Z

What

Adds the binary-outcome estimators the continuous/paired statistics stack was missing, so any consumer comparing pass/fail rates between two conditions (model A vs B, tool on vs off, prompt v1 vs v2) has the statistically correct tools instead of reaching for the wrong continuous test.

Function	Use	Why it's needed
`mcnemar(control, treatment)`	significance of a paired pass/fail change	A paired t-test / two-proportion z-test is wrong for matched binary outcomes — only the discordant pairs carry signal. Exact doubled-binomial tail (correct at the small discordant counts typical of eval runs); continuity-corrected χ² returned for reference.
`pairedRiskDifference(control, treatment)`	effect size p(treat) − p(control) = (b−c)/n + CI	CI uses the paired (McNemar) variance, not the independent-samples formula (which overstates the interval by ignoring pairing).
`wilson(successes, n)`	binomial proportion CI	Correct near 0/1 and at small n where the Wald interval escapes [0,1]. Previously owned by a downstream consumer because the substrate lacked it — now canonical here.
`passAtK(n, c, k)`	unbiased pass@k for code-gen evals	The Chen et al. 2021 estimator `1 − C(n−c,k)/C(n,k)` in stable product form; the naive "did any of the first k pass" is biased high at small n.

All four are exported from the root barrel and the statistics module, with McNemarResult / RiskDifferenceResult / ProportionInterval result types.

Verification

Cross-checked the compiled functions against independent ground truth: passAtK over a full sweep (n=1..30, every c, every k) vs exact-integer math.comb (max diff 2.2e-16); mcnemar vs both exact-integer 2·tail and scipy.stats.binomtest including a b=400, c=350 large-count numerical-stability case; wilson vs the canonical reimpl + the literature anchor (8/10 → [0.4901, 0.9433]); pairedRiskDifference vs canonical reimpl.
24 new unit tests pinning reference values, edge cases (n=0, boundary, symmetric discordance, no-discordant-pairs), and throw paths.
Full suite green (2310 tests), tsc --noEmit clean, biome clean.

Notes

Exact binomial tails reuse the in-module lnGamma (log-space sum → stable at large discordant counts); z-quantiles reuse zQuantile. No new dependencies.
Reductions in scope vs Wald are deliberate: these are exact / Wilson-based precisely because the normal approximation is unsafe for proportions near 0/1 and small discordant counts, which is the regime eval A/Bs live in.

…sk-diff, Wilson, pass@k) Add the binary-outcome estimators the continuous/paired stack was missing, so any consumer comparing pass/fail RATES between two conditions has the correct tools: - mcnemar — exact paired-binary significance (binomial sign test on discordant pairs). A paired t-test / two-proportion z-test is wrong for matched binary outcomes; only discordant pairs carry signal. Exact doubled-binomial tail (correct at small discordant counts), with the continuity-corrected chi-square returned for reference. - pairedRiskDifference — the paired effect size p(treatment)-p(control)=(b-c)/n with a CI from the paired (McNemar) variance, not the independent-samples formula. - wilson — binomial proportion CI, correct near 0/1 and at small n where the Wald interval escapes [0,1]. Previously owned by a downstream consumer because the substrate lacked it; now canonical here. - passAtK — the unbiased Chen et al. 2021 pass@k estimator for code-generation evals (1 - C(n-c,k)/C(n,k), stable product form). Verified against exact-integer (math.comb) and scipy.binomtest ground truth across a full sweep (passAtK n=1..30 over all c,k matched to 2e-16; McNemar incl. b=400,c=350 large-count stability vs scipy). 24 new tests; full suite 2310 green; typecheck + biome clean.

tangletools

✅ Auto-approved PR — `3d82d262`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-19T14:31:19Z}

…hardening (#258) Version-only bump of the trio (package.json + python pyproject + __init__). Ships everything merged since 0.93.0: - #253 data-plane reliability/concurrency/scale hardening - #254 paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k) - #255 trace-store append/load race + collision-free rollover - #256 McNemar power + required-N - #257 single source of truth for span-timestamp parsing

tangletools approved these changes Jun 19, 2026

View reviewed changes

drewstone merged commit 34b2d55 into main Jun 19, 2026
1 check passed

This was referenced Jun 19, 2026

feat(statistics): McNemar power + required-N (pre-registration for paired-binary A/B) #256

Merged

chore(release): 0.94.0 #258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(statistics): paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)#254

feat(statistics): paired-binary + coding-eval estimators (McNemar, risk-diff, Wilson, pass@k)#254
drewstone merged 1 commit into
mainfrom
feat/paired-binary-stats

drewstone commented Jun 19, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 19, 2026

What

Verification

Notes

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 3d82d262

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `3d82d262`