Skip to content

Feat/faithful mcdonald age#15

Merged
JamesKane merged 13 commits into
mainfrom
feat/faithful-mcdonald-age
Jul 1, 2026
Merged

Feat/faithful mcdonald age#15
JamesKane merged 13 commits into
mainfrom
feat/faithful-mcdonald-age

Conversation

@JamesKane

Copy link
Copy Markdown
Owner

No description provided.

JamesKane and others added 13 commits June 29, 2026 04:25
…ode fixes

Pragmatic (baseline) STR-aging integration:

- FTDNA Y-STR import: genomics.biosample_str_profile (mig 0053), du_db::str_import
  parser/upsert, du-jobs `run-once ftdna-str` loader (kit→subject_id→sample_guid,
  manifest token match, FTDNA_STR_ALIAS override for own-genome accessions like
  WGS229=B5163). build_str_inputs UNIONs local + federated profiles.

- Per-branch STR ASR: tree.haplogroup_str_asr (mig 0054) persists the
  parsimony-reconstructed ancestral motif; ystr::branch_str_asr diffs node vs
  parent; Y-tree node sidebar shows the per-marker mutations (en/es/fr).

- Age correctness vs McDonald 2021:
  * §2.3 causality back-correction on COMBINED (parent ≥ child projection) —
    fixes parent-younger-than-child inversions introduced by the STR term.
  * §2.5.2 STR sparse-node gating (MIN_STR_TESTERS_FOR_COMBINE) — keeps
    reconstruction-collapsed (0–1 tester) nodes from dominating the SNP clock.
  * clear stale denormalised tmrca_ybp/formed_ybp on undatable nodes.

- documents/proposals/aging-pipeline-audit-mcdonald2021.md: full audit of the
  pipeline against McDonald 2021 (faithful items, divergences, refinements P1–P6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restores the paper's probabilistic age model for side-by-side comparison with the
pragmatic depth-quantile branch (feat/ftdna-str-aging):

- SNP Eq 7/8: replace the depth-quantile estimator with the bottom-up
  PDF-convolution build — a node's TMRCA is the normalised product over children
  of (child TMRCA ⊛ branch-time) and over tester tips; P(t<0)=0 keeps a parent
  older than its children by construction.
- Eq 9 causality: top-down reverse-convolution constraint (Pdf::convolve_sub) at
  nodes with ≥2 informative children (§3.4 multiplicative propagation).
- P1 σ_µ (Appendix A.2.2/A.4): fold the Helgason 95% rate band into each CI in
  quadrature — the paper's dominant error term.
- P2 STR max-reliable-age cap (A.5.2): drop STR past STR_MAX_RELIABLE_YBP where it
  saturates/underestimates; SNP dates the deep clades.
- P4 tester-birth offset (A.1): +63±14 yr once at each tip.
- P5 NRR population prior (Eq 25) mechanism, default off pending calibration.
- pdf::from_weights for custom priors.

Validated on the corrected FTDNA-refined tree: faithful ages reproduce accepted
YFull TMRCAs within ~5% for most backbone clades (CT-M168, R-L23/L51/P310/U106/
L21/DF13/M222). See documents/proposals/aging-pipeline-audit-mcdonald2021.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each terminal tester now contributes a Poisson age over its OWN callable
bp rather than a shared clade/joint denominator. A tester's private SNPs
are its own calls, so by construction they lie inside its own mask — using
a uniform denominator (or the Eq-4 second-highest coverage) instead is a
global rescale that pushed every clade ~28% older, diverging from
generally-accepted ages.

`Clade.tester_snps` becomes `Vec<(count, callable_bp)>`. Defining/branch
SNPs keep the joint-call mask they were ascertained over (region-consistent
with the SNP-region positive filter); only the tester denominator goes
per-sample. Net effect: deep branch-dominated nodes (BT, CT) are preserved
within ~1%, while recent tester-driven clades (M222, U106) move a few
percent older to reflect their real per-sample coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A node can be validly defined for PLACEMENT by a marker the SNP clock can't
count — a palindromic SNP that gene-converts onto both arms (e.g. ZZ11), or
a recurrent/non-callable site the masks exclude. Such a node legitimately
has branch_snps = 0, but it is NOT a short branch: we have no clock
information about that edge.

The bug: branch_time fed m=0 into Pdf::poisson_on, whose m=0 case is the
exponential exp(-t·b·µ) (~1/(b·µ) ≈ 90 yr mean), not a point mass. So every
stacked clock-unstable generation (Z46516 → ZZ11 → …) silently padded ~90 yr
onto every ancestor above it, over-aging the deep backbone of richly
sub-divided clades (P312 had two such phantom generations pinning it).

Fix: when branch_snps == 0, the branch time is a point mass at t=0 — a
zero-length, age-transparent edge. The node stays in the tree (placement
intact); its descendants' ages propagate straight through to the parent with
no per-generation padding. General: applies to any palindromic/recurrent/
non-callable-defined node, not just ZZ11.

Effect on the corrected tree: deep clock-stable backbone unchanged (BT/CT
move 0-2 yr), while the P312/U106 backbone shifts ~60-100 yr younger
(P312 5090→4994, U106 5115→5023) as the phantom padding is removed. Z46516
and ZZ11 are now coincident with P312, as they should be.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The regenerated FTDNA ingests carry per-variant confidence flags; the loader
excludes !jointConfirmed || !monophyletic from branch-age counting. But the
`monophyletic` flag is invalid for reverse-polarity SNPs: the reference
(CHM13/HG002) is haplogroup J, deep inside CT, so every backbone SNP above J
(BT/CT/F/CF) carries the derived allele AS the reference. Their derived
carriers are reference-matching (invisible to variant calling) while only the
ancestral A/B outgroup shows a variant — a paraphyletic set the monophyly test
(computed from variant calls) falsely flags non-monophyletic.

Gating on it zeroed the ENTIRE deep backbone (CT 289/289, F 163/163, BT 8/8
flagged), which — with the age engine's zero-SNP-node transparency — collapsed
every TMRCA below CT to a single ~16.8 kya depth (the whole tree descends from
CT, so the parent≥child constraint capped all of it).

Fix: exempt reverse-polarity SNPs from the monophyletic clause (jointConfirmed
still applies to all; monophyletic still excludes genuine forward homoplasy).
The deep backbone counts again — BT 95.7 kya, CT 63.2 kya, A-M31 11.6 kya
(matches YFull). The joint VCF confirms these SNPs are real (GQ=99, clean
AC=5 synapomorphies); the flag was wrong, not the calls.

The real fix belongs upstream: the ASR/Fitch monophyly computation should be
polarity-aware. Documented for the tree team in
documents/proposals/denovo-ingest-confidence-flags.md. The INDEL artifact
remains unverified — staying on the SNP artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
formed_ybp was written from the SNP-only propagation (age.formed.median),
while tmrca_ybp comes from the COMBINED product (SNP × STR × genealogical) and
is then raised by the causality back-correction. Two different sources: when
the STR term or the causality lift pushed the combined tmrca above the SNP-only
formed, the node ended up with formed < tmrca — an impossible ordering (a clade
can't split from its parent more recently than its own members coalesce).
Observed at R-L21 (formed 2795 < tmrca 3779) and R-DF13 (2667 < 3779).

Now formed = combined_tmrca + the node's own branch time, off the SAME
causality-corrected tmrca, so formed ≥ tmrca by construction. The branch offset
(formed − tmrca) is taken from the SNP propagation — a node's branch length is a
SNP-count property, independent of the STR/genealogical TMRCA evidence — and is
≥0 (0 for an age-transparent zero-SNP node).

Tree-wide formed<tmrca violations: 0 (was nonzero at STR-rich recent clades).
L21 now formed 4002 ≥ tmrca 3779; DF13 3919 ≥ 3779.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The indel path is used to debug sample placements visually in the local web
view; its known problem manifests there as placement collapsing. Note this is
upstream in extraction, NOT the loader/age engine — with the reverse-polarity
fix the indel artifact loads cleanly (backbone counts, +10,560 indel links), so
it'll drop in once extraction is corrected. Until then we stay on the SNP
artifact; load indel only to eyeball placements, revert before trusting ages.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
McDonald 2021 §2.2.1: the SNP clock (Eq 2/3, µ_SNP≈8e-10 SNP·bp⁻¹·yr⁻¹) applies
to SNPs; indels/complex variants may be used "provided a mutation rate for them
can be accurately defined" — i.e. as a SEPARATE Eq-1 evidence term with its own
µ, never folded into the SNP count m. build_clades counted every haplogroup_variant
/ biosample_private_variant row (SNPs AND indels) into one Poisson m at the SNP
rate, which on an indel-bearing tree biases every indel-dense branch older.

Fix: both the branch and tester count queries filter mutation_type = 'SNP'
(mutation_type cleanly separates SNP from INS/DEL/INDEL/MNP/STR). A node whose
defining variants are all indels then has 0 age-countable SNPs and is handled as
an age-transparent zero-length branch (existing branch_time δ(t=0) rule) — exactly
like the palindromic/empty backbone nodes.

Result: recompute over the indel tree now yields ages IDENTICAL to the SNP-only
tree (BT 95712, CT 63255, F 47833, A-M31 11584, P312 4405 — matching to the year)
while retaining the indel-resolved topology (11409 nodes vs 11363; +46 branches
that break SNP-only polytomies). 0 formed<tmrca violations. A dedicated indel
clock (µ_indel + its own callable denominator, separate term) remains future work.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Separates the two concerns for the tree team: the age engine handles indels
correctly (SNP-only clock, indel-only nodes age-transparent → indel-tree ages ==
SNP-tree ages), so any remaining indel-view oddities are the upstream
extraction/placement problem, not the clock.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two Y-tree sidebar symptoms, both rooted in the same inverted-block
liftover leaving affected catalog SNPs stored reverse-complemented
(CHM13/HG002 Y is haplogroup J, deep inside CT — inverted blocks land RC):

- Y18975 spurious "back-mutation" badge. The web reconstructs back-mutation
  from tree link.derived == catalog ancestral, but Y18975 is single-origin
  (recurrent=false); its link T>A vs catalog A>T is pure strand artifact.
  Gate back_mutation on `recurrent` — a non-recurrent SNP has nothing to
  revert from. Clears 1009 non-recurrent false positives incl. Y18975.

- A9005 shown as "chrY:21905878C>A". The de-novo loader minted a coordinate
  name because the catalog's A9005 (G>T) is the RC of the tree link (C>A),
  so the exact-allele join at load missed it. Recover the real name for
  display via an RC-aware lateral: match a real-named catalog SNP at the
  same hs1 position whose allele set equals the link's OR its reverse-
  complement. GIN containment on position keeps it cheap; runs only for
  coordinate-named/unnamed links.

No tree rebuild needed. See documents/proposals/denovo-ingest-confidence-flags.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t prune

`tree_sample::recompute_placements` places non-D2C samples under the node their
published paper call resolves to, then prunes every haplogroup_sample row not in
its candidate set. But the de-novo loader (`denovo::load`) places tree tips by
ML-tree *topology*, and those cohort samples (1000G/HGDP/PRJEB) carry no Y/mt call
under the keys `pick_original_call` reads — so a single recompute, or the daily
`tree-samples-recompute` cron, silently wiped all ~9.6k topology tips (observed:
0 Y_DNA rows after a load+recompute).

Protect de-novo-origin biosamples (loader stamps `source_attrs->>'denovo'='true'`
at creation) from both the prune and any overwrite, exactly like CURATED rows. No
new status value — keeps every existing rollup/consumer (ystr, dedup, cladogram)
unchanged. Regression test: a denovo-flagged PLACED tip with no call survives
recompute and still counts toward its node rollup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vate

The de-novo pipeline never leaves a private singleton's `label` null — it
names the branch after the single sample's own id. The old collapse predicate
only matched `label.is_none()`, so it caught none of them: 6,549 UUID-named
branches leaked into the tree as public nodes, each with a tip of the same id
hanging off it (e.g. `7d7e9716…` under R-FT88981).

Widen "no public name" to also mean `label == that tip's sample id`. On the
real ftdna-indel ingest this collapses all 8,226 self-labeled singletons onto
their public parent (seeding their SNPs into the discovery substrate) while
preserving the 437 genuinely SNP-named single-sample terminals (R-FT49699,
R-BY95127, …) as real public branches.

Add a preprod-only `--keep-private` flag (threaded through load_denovo →
denovo::load) that disables the collapse so private branches render as nodes
for visual placement debugging. Production omits it; a forgotten flag is safe
(default collapses).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SNP sidebar's "Placed samples" section listed every placed non-D2C
sample leaf at or below the node (capped at 50 with a "+N more" note). Replace
the list with just the count.

- du-db: add `tree_sample::count_under` — same subtree CTE + PLACED/CURATED
  filters as `samples_under`, but `count(*)` with no row materialization or
  publication join.
- du-web: `SnpSidebar` drops `samples`/`samples_more` (and the now-unused
  `LeafRow`/`SIDEBAR_SAMPLE_CAP`) for a single `sample_count`.
- template: render the bold count under the existing "Placed samples" header.
- locales: drop the orphaned `tree.samples.more` key (en/es/fr).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JamesKane JamesKane merged commit 583098d into main Jul 1, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant