Feat/faithful mcdonald age#15
Merged
Merged
Conversation
…ode fixes
Pragmatic (baseline) STR-aging integration:
- FTDNA Y-STR import: genomics.biosample_str_profile (mig 0053), du_db::str_import
parser/upsert, du-jobs `run-once ftdna-str` loader (kit→subject_id→sample_guid,
manifest token match, FTDNA_STR_ALIAS override for own-genome accessions like
WGS229=B5163). build_str_inputs UNIONs local + federated profiles.
- Per-branch STR ASR: tree.haplogroup_str_asr (mig 0054) persists the
parsimony-reconstructed ancestral motif; ystr::branch_str_asr diffs node vs
parent; Y-tree node sidebar shows the per-marker mutations (en/es/fr).
- Age correctness vs McDonald 2021:
* §2.3 causality back-correction on COMBINED (parent ≥ child projection) —
fixes parent-younger-than-child inversions introduced by the STR term.
* §2.5.2 STR sparse-node gating (MIN_STR_TESTERS_FOR_COMBINE) — keeps
reconstruction-collapsed (0–1 tester) nodes from dominating the SNP clock.
* clear stale denormalised tmrca_ybp/formed_ybp on undatable nodes.
- documents/proposals/aging-pipeline-audit-mcdonald2021.md: full audit of the
pipeline against McDonald 2021 (faithful items, divergences, refinements P1–P6).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Restores the paper's probabilistic age model for side-by-side comparison with the pragmatic depth-quantile branch (feat/ftdna-str-aging): - SNP Eq 7/8: replace the depth-quantile estimator with the bottom-up PDF-convolution build — a node's TMRCA is the normalised product over children of (child TMRCA ⊛ branch-time) and over tester tips; P(t<0)=0 keeps a parent older than its children by construction. - Eq 9 causality: top-down reverse-convolution constraint (Pdf::convolve_sub) at nodes with ≥2 informative children (§3.4 multiplicative propagation). - P1 σ_µ (Appendix A.2.2/A.4): fold the Helgason 95% rate band into each CI in quadrature — the paper's dominant error term. - P2 STR max-reliable-age cap (A.5.2): drop STR past STR_MAX_RELIABLE_YBP where it saturates/underestimates; SNP dates the deep clades. - P4 tester-birth offset (A.1): +63±14 yr once at each tip. - P5 NRR population prior (Eq 25) mechanism, default off pending calibration. - pdf::from_weights for custom priors. Validated on the corrected FTDNA-refined tree: faithful ages reproduce accepted YFull TMRCAs within ~5% for most backbone clades (CT-M168, R-L23/L51/P310/U106/ L21/DF13/M222). See documents/proposals/aging-pipeline-audit-mcdonald2021.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each terminal tester now contributes a Poisson age over its OWN callable bp rather than a shared clade/joint denominator. A tester's private SNPs are its own calls, so by construction they lie inside its own mask — using a uniform denominator (or the Eq-4 second-highest coverage) instead is a global rescale that pushed every clade ~28% older, diverging from generally-accepted ages. `Clade.tester_snps` becomes `Vec<(count, callable_bp)>`. Defining/branch SNPs keep the joint-call mask they were ascertained over (region-consistent with the SNP-region positive filter); only the tester denominator goes per-sample. Net effect: deep branch-dominated nodes (BT, CT) are preserved within ~1%, while recent tester-driven clades (M222, U106) move a few percent older to reflect their real per-sample coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A node can be validly defined for PLACEMENT by a marker the SNP clock can't count — a palindromic SNP that gene-converts onto both arms (e.g. ZZ11), or a recurrent/non-callable site the masks exclude. Such a node legitimately has branch_snps = 0, but it is NOT a short branch: we have no clock information about that edge. The bug: branch_time fed m=0 into Pdf::poisson_on, whose m=0 case is the exponential exp(-t·b·µ) (~1/(b·µ) ≈ 90 yr mean), not a point mass. So every stacked clock-unstable generation (Z46516 → ZZ11 → …) silently padded ~90 yr onto every ancestor above it, over-aging the deep backbone of richly sub-divided clades (P312 had two such phantom generations pinning it). Fix: when branch_snps == 0, the branch time is a point mass at t=0 — a zero-length, age-transparent edge. The node stays in the tree (placement intact); its descendants' ages propagate straight through to the parent with no per-generation padding. General: applies to any palindromic/recurrent/ non-callable-defined node, not just ZZ11. Effect on the corrected tree: deep clock-stable backbone unchanged (BT/CT move 0-2 yr), while the P312/U106 backbone shifts ~60-100 yr younger (P312 5090→4994, U106 5115→5023) as the phantom padding is removed. Z46516 and ZZ11 are now coincident with P312, as they should be. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The regenerated FTDNA ingests carry per-variant confidence flags; the loader excludes !jointConfirmed || !monophyletic from branch-age counting. But the `monophyletic` flag is invalid for reverse-polarity SNPs: the reference (CHM13/HG002) is haplogroup J, deep inside CT, so every backbone SNP above J (BT/CT/F/CF) carries the derived allele AS the reference. Their derived carriers are reference-matching (invisible to variant calling) while only the ancestral A/B outgroup shows a variant — a paraphyletic set the monophyly test (computed from variant calls) falsely flags non-monophyletic. Gating on it zeroed the ENTIRE deep backbone (CT 289/289, F 163/163, BT 8/8 flagged), which — with the age engine's zero-SNP-node transparency — collapsed every TMRCA below CT to a single ~16.8 kya depth (the whole tree descends from CT, so the parent≥child constraint capped all of it). Fix: exempt reverse-polarity SNPs from the monophyletic clause (jointConfirmed still applies to all; monophyletic still excludes genuine forward homoplasy). The deep backbone counts again — BT 95.7 kya, CT 63.2 kya, A-M31 11.6 kya (matches YFull). The joint VCF confirms these SNPs are real (GQ=99, clean AC=5 synapomorphies); the flag was wrong, not the calls. The real fix belongs upstream: the ASR/Fitch monophyly computation should be polarity-aware. Documented for the tree team in documents/proposals/denovo-ingest-confidence-flags.md. The INDEL artifact remains unverified — staying on the SNP artifact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
formed_ybp was written from the SNP-only propagation (age.formed.median), while tmrca_ybp comes from the COMBINED product (SNP × STR × genealogical) and is then raised by the causality back-correction. Two different sources: when the STR term or the causality lift pushed the combined tmrca above the SNP-only formed, the node ended up with formed < tmrca — an impossible ordering (a clade can't split from its parent more recently than its own members coalesce). Observed at R-L21 (formed 2795 < tmrca 3779) and R-DF13 (2667 < 3779). Now formed = combined_tmrca + the node's own branch time, off the SAME causality-corrected tmrca, so formed ≥ tmrca by construction. The branch offset (formed − tmrca) is taken from the SNP propagation — a node's branch length is a SNP-count property, independent of the STR/genealogical TMRCA evidence — and is ≥0 (0 for an age-transparent zero-SNP node). Tree-wide formed<tmrca violations: 0 (was nonzero at STR-rich recent clades). L21 now formed 4002 ≥ tmrca 3779; DF13 3919 ≥ 3779. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The indel path is used to debug sample placements visually in the local web view; its known problem manifests there as placement collapsing. Note this is upstream in extraction, NOT the loader/age engine — with the reverse-polarity fix the indel artifact loads cleanly (backbone counts, +10,560 indel links), so it'll drop in once extraction is corrected. Until then we stay on the SNP artifact; load indel only to eyeball placements, revert before trusting ages. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
McDonald 2021 §2.2.1: the SNP clock (Eq 2/3, µ_SNP≈8e-10 SNP·bp⁻¹·yr⁻¹) applies to SNPs; indels/complex variants may be used "provided a mutation rate for them can be accurately defined" — i.e. as a SEPARATE Eq-1 evidence term with its own µ, never folded into the SNP count m. build_clades counted every haplogroup_variant / biosample_private_variant row (SNPs AND indels) into one Poisson m at the SNP rate, which on an indel-bearing tree biases every indel-dense branch older. Fix: both the branch and tester count queries filter mutation_type = 'SNP' (mutation_type cleanly separates SNP from INS/DEL/INDEL/MNP/STR). A node whose defining variants are all indels then has 0 age-countable SNPs and is handled as an age-transparent zero-length branch (existing branch_time δ(t=0) rule) — exactly like the palindromic/empty backbone nodes. Result: recompute over the indel tree now yields ages IDENTICAL to the SNP-only tree (BT 95712, CT 63255, F 47833, A-M31 11584, P312 4405 — matching to the year) while retaining the indel-resolved topology (11409 nodes vs 11363; +46 branches that break SNP-only polytomies). 0 formed<tmrca violations. A dedicated indel clock (µ_indel + its own callable denominator, separate term) remains future work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Separates the two concerns for the tree team: the age engine handles indels correctly (SNP-only clock, indel-only nodes age-transparent → indel-tree ages == SNP-tree ages), so any remaining indel-view oddities are the upstream extraction/placement problem, not the clock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two Y-tree sidebar symptoms, both rooted in the same inverted-block liftover leaving affected catalog SNPs stored reverse-complemented (CHM13/HG002 Y is haplogroup J, deep inside CT — inverted blocks land RC): - Y18975 spurious "back-mutation" badge. The web reconstructs back-mutation from tree link.derived == catalog ancestral, but Y18975 is single-origin (recurrent=false); its link T>A vs catalog A>T is pure strand artifact. Gate back_mutation on `recurrent` — a non-recurrent SNP has nothing to revert from. Clears 1009 non-recurrent false positives incl. Y18975. - A9005 shown as "chrY:21905878C>A". The de-novo loader minted a coordinate name because the catalog's A9005 (G>T) is the RC of the tree link (C>A), so the exact-allele join at load missed it. Recover the real name for display via an RC-aware lateral: match a real-named catalog SNP at the same hs1 position whose allele set equals the link's OR its reverse- complement. GIN containment on position keeps it cheap; runs only for coordinate-named/unnamed links. No tree rebuild needed. See documents/proposals/denovo-ingest-confidence-flags.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t prune `tree_sample::recompute_placements` places non-D2C samples under the node their published paper call resolves to, then prunes every haplogroup_sample row not in its candidate set. But the de-novo loader (`denovo::load`) places tree tips by ML-tree *topology*, and those cohort samples (1000G/HGDP/PRJEB) carry no Y/mt call under the keys `pick_original_call` reads — so a single recompute, or the daily `tree-samples-recompute` cron, silently wiped all ~9.6k topology tips (observed: 0 Y_DNA rows after a load+recompute). Protect de-novo-origin biosamples (loader stamps `source_attrs->>'denovo'='true'` at creation) from both the prune and any overwrite, exactly like CURATED rows. No new status value — keeps every existing rollup/consumer (ystr, dedup, cladogram) unchanged. Regression test: a denovo-flagged PLACED tip with no call survives recompute and still counts toward its node rollup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vate The de-novo pipeline never leaves a private singleton's `label` null — it names the branch after the single sample's own id. The old collapse predicate only matched `label.is_none()`, so it caught none of them: 6,549 UUID-named branches leaked into the tree as public nodes, each with a tip of the same id hanging off it (e.g. `7d7e9716…` under R-FT88981). Widen "no public name" to also mean `label == that tip's sample id`. On the real ftdna-indel ingest this collapses all 8,226 self-labeled singletons onto their public parent (seeding their SNPs into the discovery substrate) while preserving the 437 genuinely SNP-named single-sample terminals (R-FT49699, R-BY95127, …) as real public branches. Add a preprod-only `--keep-private` flag (threaded through load_denovo → denovo::load) that disables the collapse so private branches render as nodes for visual placement debugging. Production omits it; a forgotten flag is safe (default collapses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SNP sidebar's "Placed samples" section listed every placed non-D2C sample leaf at or below the node (capped at 50 with a "+N more" note). Replace the list with just the count. - du-db: add `tree_sample::count_under` — same subtree CTE + PLACED/CURATED filters as `samples_under`, but `count(*)` with no row materialization or publication join. - du-web: `SnpSidebar` drops `samples`/`samples_more` (and the now-unused `LeafRow`/`SIDEBAR_SAMPLE_CAP`) for a single `sample_count`. - template: render the bold count under the existing "Placed samples" header. - locales: drop the orphaned `tree.samples.more` key (en/es/fr). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.