Skip to content

HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441

Draft
johnml1135 wants to merge 3 commits into
masterfrom
fst-advisor
Draft

HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441
johnml1135 wants to merge 3 commits into
masterfrom
fst-advisor

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

Accelerates HermitCrab morphological analysis with a precompiled FST, behind a caching front end
that keeps the engine as the source of truth. No second morphology engine, no reimplemented constraints.

Entry point — CachingMorphologicalAnalyzer (fast + slow + cache):

  • Default AnalyzeWord = guaranteed complete (backwards-compatible). On a certified grammar
    (FST-closed per the census and FST==engine set-parity over a corpus) the FST is proven complete,
    so this runs FST-only with no full search. Otherwise it returns the cached engine result, or runs
    the engine on a miss and caches it. Either way: complete.
  • AnalyzeWordFast = opt-in immediate. Cached-complete if warm (or if the grammar is certified),
    else a sound but possibly under-generating verified-FST result, flagged IsComplete=false. Never
    runs the engine.
  • Warm(corpus) fills the cache in parallel; AnalysisCacheSerializer persists it across
    sessions (fixed corpora), keyed by MorphemeRegistry and guarded by a grammar-version string
    (stale cache rejected → re-warm). Confirmed non-words are cached too.

The FST pipeline behind the fast path: FstTemplateAnalyzer (proposer; immutable, shared;
derivation depth tunable) → VerifiedFstAnalyzer (confirms each candidate by restricted re-analysis,
FstReplay, against HC's own engine from a MorpherPool; emits the genuine HC analysis) →
CompleteHybridMorpher (certified→FST / else→engine, with per-word AnalyzeWord(word, useFst)).
GrammarFstAdvisor + GrammarFstClosure are the grammar census/linter (this PR's original core).

Guarantees

  • Correctness equals the engine — the cache never invents or hides an analysis; the default path is
    always complete. The fast path is sound (0 false positives on 50 generated non-words), a yes-only
    detector for "is this a word" (can under-generate, even to zero on un-built constructs), which the
    cache/certification corrects.
  • Proven-complete → no full search. A certified grammar never runs the engine; on the 60-word Sena
    corpus that certifies, the default complete path runs at ~18 ms/word (~11×).
  • Fast — ~13× on Sena's FST path; 98% of words covered by the fast path (the rest resolve via
    engine/cache, no silent miss).
  • Thread-safe — shared proposer + pooled engine + concurrent cache; 0 parallel-vs-sequential
    mismatches on 200 words
    (and a CI test).

Tests

CI unit tests on the in-repo toy grammar cover the proposer, the verify chain, the caching/persistence
layer (default==engine, provisional→complete after warm, certified-skip, round-trip + version guard),
soundness/negatives, the category fix, per-word opt-out, and thread-safety; plus advisor/closure. An
[Explicit] benchmark measures speed/parity/soundness/concurrency/certification on an external grammar.
Full unit suite green (93).

Honest limitations / out of scope

  • ~2% of Sena words use constructs the FST proposer doesn't build yet — compounding (the main one),
    depth-3 derivation, one suffix-order case. They resolve via the engine/cache (no silent miss) and keep
    the full corpus from certifying. Compounding is the highest-value next coverage build (an additive,
    shared-root-chain design) and is scoped as a focused follow-on.
  • The per-stem completeness proof (proving the fast path complete without the engine) was explored
    and abandoned; completeness is delivered by certification + cache + engine.
  • Deferred: the generator (reverse/synthesis direction); compounding; a 2-way-FST treatment of the
    residual.

Design + research record: docs/HERMITCRAB_FST_PLAN.md (§13 = caching front end); advisor:
docs/HERMITCRAB_FST_ADVISOR.md.

🤖 Generated with Claude Code


This change is Reviewable

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.

Changes:

  • Introduces GrammarFstAdvisor, GrammarFstReport, and GrammarAdvisory to classify expensive/non-FST-able constructs across morphological and phonological rules.
  • Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
  • Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity).
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs Adds an [Explicit] helper test to run and print the advisor report on an external HC XML grammar.
src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs Implements the advisor, report model, and the core static analyses for affix and phonological rules.
HERMITCRAB_FST_PLAN.md Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate.
fst.md Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs
Comment thread src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs
@johnml1135 johnml1135 changed the title HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork) HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor) Jun 26, 2026
@johnml1135 johnml1135 force-pushed the fst-advisor branch 2 times, most recently from bb56fd2 to f95ae9b Compare June 27, 2026 11:43
@johnml1135

Copy link
Copy Markdown
Collaborator Author

Thanks @Copilot — both advisor comments addressed in 7832611f:

  1. Per-rule counting (GrammarFstReport): counts now group advisories by (Rule, Stratum, Kind) and take each rule's worst severity, so per-allomorph advisories no longer overcount. The escape sub-counts are now exact partitions — Probeable + Opaque = Escape and Regular + NonRegular = Escape (a rule is opaque/non-regular if any of its escape advisories is, the conservative aggregate).

  2. RealizationalAffixProcessRule skipped: the morphological-rule switch in Analyze now handles it alongside AffixProcessRule (both have Allomorphs and can encode reduplication/infixation). AnalyzeAffix was refactored to take (name, allomorphs) so both route through the same examination, and AffixRulesExamined now counts them. Regression test added: Analyze_RealizationalReduplication_IsExamined.

@codecov-commenter

codecov-commenter commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.38710% with 605 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.40%. Comparing base (7835067) to head (24ddb3b).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...chine.Morphology.HermitCrab/FstTemplateAnalyzer.cs 87.11% 86 Missing and 24 partials ⚠️
...Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs 82.05% 54 Missing and 16 partials ⚠️
src/SIL.Machine/Annotations/Shape.cs 78.59% 52 Missing and 12 partials ⚠️
...ine.Morphology.HermitCrab/PhonologyRuleCompiler.cs 73.29% 34 Missing and 17 partials ⚠️
....Machine.Morphology.HermitCrab/SurfacePhonology.cs 82.70% 21 Missing and 11 partials ⚠️
...ine.Morphology.HermitCrab/ReduplicationProposer.cs 81.50% 18 Missing and 9 partials ⚠️
src/SIL.Machine/FiniteState/VisitedStates.cs 19.35% 23 Missing and 2 partials ⚠️
....Morphology.HermitCrab/ForwardSynthesisProposer.cs 82.96% 17 Missing and 6 partials ⚠️
....Machine.Morphology.HermitCrab/FstCoverageProbe.cs 82.96% 21 Missing and 2 partials ⚠️
...Machine/DataStructures/DataStructuresExtensions.cs 57.50% 16 Missing and 1 partial ⚠️
... and 31 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #441      +/-   ##
==========================================
+ Coverage   73.30%   74.40%   +1.10%     
==========================================
  Files         443      462      +19     
  Lines       37203    40431    +3228     
  Branches     5110     5614     +504     
==========================================
+ Hits        27272    30084    +2812     
- Misses       8805     9116     +311     
- Partials     1126     1231     +105     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@johnml1135 johnml1135 marked this pull request as draft July 1, 2026 12:57
johnml1135 and others added 3 commits July 2, 2026 13:02
Squashed history of the RUSTIFY performance branch. Full rationale,
methodology, benchmark data, and rejected-approach audit trail live in the
PR description; see PR #446 for the complete writeup.

Summary of what changed:
- Flat/COW Shape and ShapeNode backing (parallel int-linked arrays instead
  of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle).
- Copy-on-write Shape cloning: a clone of a frozen shape shares the
  source's backing until first mutation.
- FeatureStruct bit-packed ulong flat-unify fast path for the common
  simple/no-variable case, falling back to the original engine otherwise.
- int-offset FST traversal (Fst<Word,int>, flattened Register<TOffset>[,]
  to a 1-D array, reusable per-Transduce register scaffold).
- Every HermitCrab rule-spec file migrated from ShapeNode-offset to
  int-offset pattern matching.
- Configurable MaxDegreeOfParallelism (1 = fully single-threaded, for
  callers that parallelize across words themselves).
- Cheap GetHashCode overrides (State, IDBearerBase, FeatureValue,
  ShapeNode) replacing the CLR's identity-hash fallback on several hot
  dictionary/hashset key paths; StringComparer.Ordinal on hot sorts that
  were paying culture-aware comparison; a shared per-thread Random instead
  of one per BidirList instance; a filtered-annotation-view cache on
  frozen AnnotationLists.
- SyntacticFeatureStruct mutate-after-freeze correctness hardening
  (8 sites converted to clone-then-reassign; caught a real aliasing bug
  in the synthesis PriorityUnion path along the way).
- Post-squash review pass fixed a memoization soundness bug in
  CombinationRuleCascade (reverted to master's unconditional exploration),
  an unsynchronized lazy-init race in FeatureStruct.EnsureFlat (fixed with
  Volatile.Read/Write), and a frozen-flag-lost-on-detach bug in ShapeNode,
  plus several efficiency/cleanup fixes.

Parse results are unchanged (byte-identical to master, verified via the
full HermitCrab regression suite plus per-word signature diffs against
master on Sena and Indonesian).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
HC FST: full-coverage audit plan (docs/FST_FULL_COVERAGE_PLAN.md)

Four parallel audits (formal-language status × HC impl × FST impl) of every HC construct,
classified covered / partial / coverable / not-coverable, with architecture proposals and
appendices on closing the non-regular gap.

Headline findings:
- Almost all of HC is REGULAR (Kaplan-Kay) hence 1-way-FST-able; the only genuinely
  non-regular core is unbounded full-stem reduplication ({ww}) + an unbounded self-feeding
  rewrite cycle (HC caps at 256).
- Critical coverage ceiling: the proposer is only correct for 0-PHONOLOGY grammars (arcs are
  underlying segments, walk is surface) — it silently under-generates (fails safe; parity gate
  refuses to certify) for any grammar with phonological rules. Phonology-by-composition is the
  biggest coverage win.
- Robustness bug: the proposer THROWS on infix/circumfix/reduplication/process slots, aborting
  the whole build instead of degrading to the engine. Graceful degradation is the top this-PR fix.
- Other gaps: true zero-segment affix dropped; bounded compounding needs proposer + FstReplay
  changes; MPR/co-occurrence/env/stemname correctly left to verify (sound).

Appendix A: length-cap fold / detect-and-peel (compile-replace) / 2-way FST (Dolatian-Heinz) /
engine backstop for the non-FST-able constructs. Appendix B: verify-by-re-analysis + escape-aware
codec + certified-skip interlock all HELP later non-regular work; only the 2-way reduplication
solution would need a new execution model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: graceful degradation (A) + zero-segment affix (B) + advisor review fixes

A — Graceful degradation: the proposer no longer THROWS on infix/circumfix/reduplication/
process slots; it skips the unbuildable construct, builds the rest, and sets
CoversAllConstructs=false so the grammar can't certify (those words fall to the
engine/cache; the parity gate enforces it). Was a NotSupportedException that aborted the
whole build, making the FST unusable on any grammar with such a slot.

B — True zero-segment affix (CopyFromInput only, no InsertSegments) now emits its morpheme
token with no segment arcs instead of throwing / being silently dropped. SlotOp treats a
zero-only slot as a position-less suffix so it still builds.

Certification guard: FromLanguage (Caching + CompleteHybrid) now requires
proposer.CoversAllConstructs in addition to closed + parity — a degraded build can't certify.

Copilot review fixes (advisor, still in PR):
- Examine RealizationalAffixProcessRule (it implements IMorphologicalRule + has Allomorphs;
  can encode reduplication/infix) — previously silently skipped, undercounting escapes.
  AnalyzeAffix refactored to (name, allomorphs) and the switch handles both rule types.
- GrammarFstReport counts are now PER-RULE (group advisories by Rule/Stratum/Kind, worst
  severity) instead of per-advisory, so per-allomorph advisories don't overcount and the
  partitions are consistent (Probeable+Opaque = Escape, Regular+NonRegular = Escape).

Tests: Build_ReduplicationSlot_DegradesGracefully_DoesNotThrow,
Analyze_ZeroSegmentSuffix_IsEmitted_NotDropped, Analyze_RealizationalReduplication_IsExamined.
Unit suite 96 green; Sena unchanged (certifies, 0 parallel mismatches, 0 false positives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: apply CSharpier formatting (fix CI Check-formatting build failure)

CI runs `dotnet csharpier check .` and the new/edited FST files were not formatted.
Ran `dotnet csharpier format .` (1.2.6) — only the 11 FST/advisor/test files changed;
no unrelated files touched. Unit suite 96 green; csharpier check clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Solution 1: surface-allomorph precompile for altered bare roots

Let the FST proposer match phonologically-altered surfaces, the C-internal
tier of Solution 1 (surface-allomorph precompile, docs/FST_FULL_COVERAGE_PLAN.md
Appendix C). For each root the grammar allows to stand bare, build a proposer arc
not just for the underlying shape but for every bare surface realization HC
synthesizes (phonology applied) — reusing the obligatoriness GenerateWords call,
so zero extra build cost. The emitted token is always the underlying morpheme;
verify re-runs HC with real phonology to confirm.

- FstTemplateAnalyzer: _bareRootValid -> _bareRootSurfaces; add BareRootSurfaces,
  UnderlyingForm, BuildRootChainFromSurface. Underlying arcs kept (union), so the
  0-phonology path is unchanged.

Fix a latent verify bug this exposed: AnalysisRewriteRule/AnalysisMetathesisRule
gate on Morpher.RuleSelector, and FstReplay pinned the selector to just the
candidate's morphological rules — silently disabling ALL phonology during verify.
The propose-and-verify spine could therefore never confirm any phonologically-
altered candidate. Phonological rules are obligatory deterministic rewrites, not a
fan-out choice, so FstReplay now always lets IPhonologicalRule through; the
morphological fan-out is still collapsed by gating the leaf rules + root, and
soundness is still enforced by the unchanged candidate-signature match.

Add Verified_CoversPhonologicallyAlteredBareRoot: an unconditional t->d rule makes
bare root "dat" surface only as "dad"; a baseline assertion proves the underlying-
only proposer misses "dad", the surface-precompile proposer covers it, verify
confirms it as a genuine HC analysis, and a non-word still yields nothing. Full
HermitCrab suite green (97 passed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: plan to close the coverage gap (phonology, infix, reduplication)

FST_FULL_PLAN.md — implementation plan for the four expansion points. The
propose-and-verify split means correctness lives in verify + certification, never
in the proposer, so coverage expansion can only change the acceleration ratio,
never produce a wrong answer. Architecture: a CompositeProposer unions candidate
generators (FST + reduplication + infix scanners) into the one verify gate.

- Point 2 (infix) and Point 3 (reduplication): bounded candidate generators that
  strip/remove their material and RECURSE the residual through the FST proposer
  (so inflected reduplicants / infixed forms are covered), feeding the verify gate.
- Point 1 (all phonology): affix surface-precompile + C-boundary neighbor context,
  extending the shipped bare-root C-internal tier.
- Point 4 (C-exact composition): design recorded + deferred with rationale — it is
  a spine redesign (token side-table -> transducer outputs) whose only marginal
  gain over C-boundary is rare cross-boundary opacity that already falls back to
  the engine correctly. C-boundary subsumes its practical value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Point 3: full reduplication via a composable candidate generator

Reduplication (copy the whole base, surface = base·base) is the one provably
non-regular construct — an FST cannot represent it. Handle it BESIDE the FST: a
bounded candidate generator feeds the same propose-and-verify gate, so it is sound
without being regular.

- CompositeProposer: unions several proposers (FST + generators) into one
  IMorphologicalAnalyzer, deduping candidates by order-sensitive morpheme-identity
  signature before the verify gate. Aggregates coverage at the MorphOp level
  (CoversAllConstructs = FST's uncovered ops minus what generators cover) so a
  grammar can certify once a sibling generator covers the FST's skipped construct.
  New IConstructProposer interface lets a generator declare its covered ops.
- ReduplicationProposer (IConstructProposer): detects an adjacent doubling X·X,
  strips one copy, RECURSES the residual through the FST proposer (so an inflected
  reduplicant is covered, not just a bare root), and appends the reduplication
  morpheme in HC application order (root·…·RED). A coincidental doubling is pruned
  by verify (HC synthesis won't reproduce it).
- FstTemplateAnalyzer: replace the _hasUnbuiltConstructs bool with an
  _uncoveredOps set (records WHICH MorphOp was skipped — slot rules, in-slot
  affixes, and standalone morphological rules); expose UncoveredOps.
  CoversAllConstructs == (UncoveredOps empty).

Test: a full-reduplication grammar; the FST alone misses "sagsag" (and reports
not-fully-covered), the composite covers it (and reports covered), verify confirms
the genuine HC analysis, and a non-word still yields nothing. Full suite green (98).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Point 2: infixation via a composable candidate generator

Infixation (an affix inserted inside the stem, e.g. Tagalog -um-) is regular; the
FST proposer recognizes but does not build infix slots. Handle it as a sibling
generator feeding the same propose-and-verify gate.

- InfixProposer (IConstructProposer): for each infix and each interior position
  where the infix's surface segments occur, remove them and RECURSE the residual
  through the FST proposer (so an infixed form of an inflected stem is covered),
  then append the infix morpheme in HC application order (root·…·INF).
  Over-approximation — every interior occurrence is tried; verify prunes the wrong
  splits. O(surface-length × infixes) candidates, bounded.
- First cut: the infix must be a single contiguous run of inserted segments,
  matched against its underlying representation. Templatic multi-slot infixes and
  phonologically-altered infix surfaces are left to the engine (parity gate keeps
  results correct).

Test: an "a"-infix grammar ("sag" -> "saag"); the FST alone misses "saag" (and
reports not-fully-covered), the composite covers it (and reports covered), verify
confirms the genuine HC analysis, and a non-word still yields nothing. Full suite
green (99).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Point 1a: affix surface-precompile (C-internal phonology)

Extend the surface-allomorph precompile from bare roots to AFFIXES: build each
affix's segment arcs from its underlying form AND each phonologically-altered
surface realization, so an affix whose surface differs from its underlying
segments (e.g. a suffix devoiced/changed by a rule) is matched by the proposer.

- SurfacePhonology: a forward-phonology helper that compiles each stratum's
  synthesis phonological rules (reusing HC's CompileSynthesisRule, exactly what
  SynthesisStratumRule runs) and applies them to a segment string in isolation,
  returning the distinct surface variants (C-internal tier: catches edge- and
  morpheme-internal alternations; cross-boundary ones ride the engine).
- FstTemplateAnalyzer.BuildAffixArcs: shared by both affix-arc sites (derivational
  layers + template slots) — builds the underlying path plus a path per altered
  surface variant. Default ctor passes an identity variant function, so the
  0-phonology path is byte-identical; the morpher ctor wires SurfacePhonology.

Tests: Proposer_CoversPhonologicallyAlteredAffix (a "t" suffix that surfaces only
as "d" via t->d: the underlying-only proposer misses "sagd", the surface-precompile
proposer covers it, verify stays sound) and SurfacePhonology_AppliesRulesForward.
Full suite green (101). FST_FULL_PLAN.md updated with the shipped/deferred matrix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: CSharpier formatting + refresh FstTemplateAnalyzer class doc

Apply CSharpier to FstTemplateAnalyzer.cs (a reflowed method signature) so
Check-formatting passes, and update the now-stale class summary: the proposer
precompiles bounded phonology into its arcs and degrades gracefully on constructs
it cannot model (recording the MorphOp in UncoveredOps for the composite), rather
than throwing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: wire the composite proposer into the production factories

The reduplication/infix generators were only constructed in test code — both
production factories built a bare FstTemplateAnalyzer, so a reduplicating/infixing
grammar never certified and the generators never ran. Wire them in.

- CompositeProposer.ForLanguage(language, fst): the standard production proposer
  (FST + reduplication + infix generators). Inert for grammars without those
  constructs (generators hold no rules, yield nothing; CoversAllConstructs is
  vacuously true) — near-zero overhead, byte-identical behavior.
- CompleteHybridMorpher.FromLanguage and CachingMorphologicalAnalyzer.FromLanguage
  now build the composite and certify on its CoversAllConstructs.

Integration test CompleteHybrid_WiresGenerators_...: a reduplicating grammar
certifies through the production factory and the fast path matches the engine on
bare/reduplicated/homograph/non-word — the test whose absence let the feature be
inert. Docs note the wiring + the extended empirical-certification caveat (a
certified grammar skips the engine, so the certification corpus must exercise the
reduplication/infix patterns). Full suite green (102).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Points 1b + 4: C-boundary precompile and full phonology composition

Complete the phonology story — all four enhancement points are now implemented
and wired into production.

Point 4 (C-exact, the complete path): ComposedPhonologyProposer composes HC's
phonology INVERSE with the morphotactic FST. It un-applies the grammar's
phonological rules to the surface (reusing each stratum's CompileAnalysisRule —
exactly what AnalysisStratumRule runs, strata surface->inner, rules reversed) to
recover the underlying form, then walks the underlying-arc FST on it
(FstTemplateAnalyzer.AnalyzeShape, newly exposed). Because the inverse is applied
to the ASSEMBLED surface, this covers all bounded phonology including the
cross-boundary, stem-conditioned alternations the per-morpheme precompile cannot
see. Under-specified analysis nodes match via unification; verify prunes spurious
candidates. Chosen over literal Fst.Compose because the proposer accumulates tokens
in a side-table, not transducer outputs — composing HC's existing inverse reaches
the same coverage while reusing the engine's real phonology.

Point 1b (C-boundary, the cheap fast-path): SurfacePhonology now also probes each
surface-alphabet segment as a left/right neighbor and, when the rule is
length-preserving, reads back the morpheme's own surface portion — catching an
affix whose surface is conditioned by a neighbor across the seam. Bounded by
alphabet size; length-changing contexts are skipped (sound superset).

Both wired into CompositeProposer.ForLanguage (inert when the grammar lacks
phonology — short-circuits). Tests: ComposedPhonology_CoversCrossBoundaryAlternation
(g->k / _t across the boundary: precompile misses "sakt", composition recovers it)
and SurfacePhonology_BoundaryTier (t->d / g_: isolation keeps "t", boundary recovers
"d"). Full suite green (104); full solution builds; CSharpier clean. Plan updated:
all four points shipped + wired.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: make the composed-phonology proposer thread-safe + prove it

ComposedPhonologyProposer runs HC's analysis phonology at analyze time on the
concurrent path (both factories advertise parallel parsing). Harden + verify:

- Compile the inverse cascade against a PRIVATE Morpher with its own TraceManager
  (not the factory's shared one), mirroring how MorpherPool gives each rented
  morpher its own — the analysis rules read _morpher.TraceManager/selectors, so the
  proposer must not share them. Each AnalyzeWord applies the cascade to a fresh
  local Word (no per-call mutation of shared state). ForLanguage no longer threads
  a morpher through.
- Add Composite_WithPhonologyAndReduplication_ParallelMatchesSequential: drives the
  production CompleteHybridMorpher (phonology inverse + reduplication generator both
  live) over a corpus in parallel and asserts parallel == sequential, no exceptions.

Full suite green (105).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: forward-synthesis precompile for boundary-conditioned morphophonemics

Real-grammar validation (Indonesian meN- nasal substitution) exposed that the
phonology INVERSE cannot be cleanly composed when rules are conditioned on the
morpheme boundary: ComposedPhonologyProposer un-applies on the boundary-less
surface, so meN- rules fire everywhere and over-generate (menulis -> ⁿmeⁿnⁿpuⁿlis)
— the mess HC only prunes via interleaved morphology + re-synthesis (the slow
search). Inversion stays valid/sound for SEGMENT-conditioned phonology; it is just
not the tool for boundary-conditioned morphophonemics.

Forward synthesis IS boundary-correct (GenerateWords applies rules with the
boundary present). New ForwardSynthesisProposer precompiles, at build time, each
root × every ORDERED affix combo (permutations — order matters) up to maxAffixes,
synthesizes the surface, and tabulates surface->analysis; analysis is a dictionary
lookup and verify still confirms. Covers reduplication and infixation for free.

- ForwardSynthesisProposer (IConstructProposer): sound by construction (a tabulated
  entry is a real synthesized word), bounded by maxAffixes + a hard entry budget.
- Opt-in via CompositeProposer.ForLanguage(language, fst, forwardSynthesis: true):
  build cost grows with lexicon × permutations — right for bounded-affixation
  grammars / fixed corpora, not heavily-inflecting templatic systems. Default
  behavior unchanged.
- CI test ForwardSynthesis_CoversAffixedForms_AndIsSound; reusable real-grammar
  harnesses added to FstSenaBenchmark (Benchmark_ForwardSynthVsSearch, etc.).

Indonesian result (depth 2): full coverage 42 -> 69 of 70 words, 0 unsound, build
~5s. The 1 holdout is a 3-affix realizational combo. Does not flip the grammar to
certified (holdout breaks parity; grammar not FST-closed) — the win is on the
explicit verified-FST path, correct everywhere. FST_FULL_PLAN.md updated with the
inversion-vs-synthesis finding and scope. Full suite green (106).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: LEVER_2.md — plan for forward FST∘FST composition

Design + blocker analysis for the grammar-sized composition approach (compose
morphotactics ⊗ phonology into one surface↔analysis transducer; build scales with
the grammar, not the language). Three blockers: (1) tokens via side-table not
output tape — keep state-based through composition; (2) HC phonology is
match-then-mutate, not a transducer — build a compiler (probe-synthesis Mealy
transducer reusing HC's phonology, substitution then deletion); (3) unification-arc
composition — already solved by Fst.Compose. Spike-first incremental plan, verify
gates soundness throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Lever 2 spike: lazy composition recovers deletion, lexicon-constrained

Algorithm-level spike (symbol alphabet) proving the Lever 2 architecture before the
big build. Lazy on-the-fly composition of an inverse-phonology transducer (Pinv:
surface->underlying, with ε-input arcs that restore deleted segments) with a
morphotactic acceptor (Lex: underlying, tokens on states), walked as a product
automaton over configs (pinvState, lexState, tokens).

Targets DELETION specifically (t->∅ / _d, so sat+d = "satd" -> "sad") — the case
every prior approach died on; substitution would pass and lie. Three tests:
- recovers the deleted t: "sad" -> [sat, -d];
- restoration is LEXICON-CONSTRAINED: with a bare root "sad" too, exactly the two
  valid analyses {sat+-d, sad}, no garbage — the property the runtime inverse
  lacked (it restored everywhere -> ⁿmeⁿnⁿpuⁿlis);
- non-word yields nothing.

Resolves Blocker 1 (tokens stay state-based in the config — no output-tape hack)
and Blocker 3 (no Fst.Compose — the walk unifies Pinv output against Lex input
directly). Only Blocker 2 (building Pinv) remains. LEVER_2.md updated to the
lazy-composition design.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Lever 2: lazy-composition walk with real types recovers boundary deletion

Build the consuming engine for forward FST∘FST composition (LEVER_2.md), proven
end-to-end with real HC types.

- InversePhonology: a surface→underlying transducer (states + arcs carrying a
  surface-input FS, null = ε-input restoration, and an underlying-output FS).
- FstTemplateAnalyzer.AnalyzeComposed: lazy product walk of Pinv ⊗ the underlying
  morphotactic acceptor over configs (pinvState, lexState, tokens). Pinv consumes
  surface and emits underlying, which must unify a lexicon arc (advancing it and
  accruing its token); the closure handles both lexicon ε-arcs and Pinv ε-input
  restorations. Tokens stay state-based (Blocker 1 dissolved); no Fst.Compose
  needed (Blocker 3 moot) — the walk unifies Pinv output against Lex input directly.

Test LeverTwo_LazyComposition_RecoversBoundaryDeletion_RealTypes: a kd-suffix whose
k deletes before d surfaces as "d"; "sagd" recovers [sag, KD] by restoring the
deleted k — constrained by the lexicon (the over-restoration that broke the runtime
inverse is pruned in lockstep) — sound (⊆ engine), non-word yields nothing. This is
the deletion case (not substitution, which would pass and lie).

All three blockers worked through: 1 & 3 resolved; Blocker 2's consuming engine
built and proven incl. deletion. The remaining frontier is the general Pinv
COMPILER (auto-build InversePhonology from grammar rules + cascades) — the spikes
use a hand-built Pinv. LEVER_2.md records proven-vs-frontier honestly. Suite 110
green; CSharpier clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST Lever 2: prove lazy composition recovers an OPAQUE two-rule cascade

The single-rule deletion spike would pass and lie about cascades — the real meN-
case is assimilation + deletion interacting (what produced ⁿmeⁿnⁿpuⁿlis). So
LazyComposition_RecoversOpaqueTwoRuleCascade hand-builds a Pinv for a feeding/
opacity cascade: N→n/_t then t→∅/n_, underlying aN+t = "aNt" -> "ant" -> "an" (the
t that triggered the assimilation then deletes; counterbleeding opacity).

Result: it works. A bounded-context Pinv that COUPLES un-assimilation (n→N) with
deletion-restoration (ε→t) through a state recovers the opaque "aNt" from "an" ->
[aN, -t], lexicon-constrained. A bounded transducer CAN represent the inverse of an
opaque cascade — the case that defeated every prior approach — so the Lever 2
architecture is real for cascades.

Corollary recorded in LEVER_2.md: the Pinv COMPILER must be B-direct (compile each
rule to a transducer, compose the cascade, invert), NOT naive context-probing —
because the t-deletion is conditioned on the surface n that assimilation fed from N,
which an underlying-context probe would misread. Honest headline: architecture
proven incl. cascades with HAND-BUILT inverses; the phonology→transducer compiler is
UNSTARTED, so Lever 2 does not yet accelerate a real grammar — Lever 1 (42→69 on
Indonesian) remains the only real-grammar accelerator. Suite 111 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HC FST: bounded-reduplication closure makes Indonesian certify

All of Indonesian's closure escapes are reduplication (the meN- nasal substitution
is regular; "Nasalization in reduplication" is a phonological rule, not a closure
escape). Reduplication over a fixed lexicon with bounded copy is finite-hence-
regular (compile-replace), so:

- GrammarFstClosure.Analyze(language, boundedReduplication: true): opt-in flag that
  treats reduplication/infix as FST-able feeders (not escapes) under the fixed-
  lexicon/bounded-copy assertion. A grammar whose only escapes are reduplication/
  infix then becomes FstClosed.
- CachingMorphologicalAnalyzer.FromLanguage gains forwardSynthesis + boundedReduplication
  params, threading the flag into the closure check and wiring the forward-synth
  precompile into the composite.
- ForwardSynthesisProposer.CoveredOps broadened to claim circumfix (CircumfixPrefix/
  Suffix) and process — synthesis already produces them, so a tabulated entry is
  genuine coverage; this was the missing piece for CoversAllConstructs.

Measured (Indonesian, forwardSynthesis+boundedReduplication): closed False→True,
CoversAllConstructs True, parity 69/70 → CERTIFIES on the covered corpus → default
path is FST-only (engine skipped). The 1 holdout (mengamat-amati, a 3-affix
realizational combo) is a coverage-depth gap, not closure. Soundness unaffected
(verify + parity gate; flags are explicit opt-in assertions). CI test
GrammarFstClosure_BoundedReduplication_TreatsReduplicationAsRegular; suite 112 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Port fst-advisor to hc-rustify's Pattern<Word, int> API

hc-rustify's int-offset FST work (e910a558) changed Word from
IAnnotatedData<ShapeNode> to IAnnotatedData<int>, so morphological-rule
pattern types are now Pattern<Word, int>/LinearRuleCascade<Word, int>
instead of the ShapeNode-typed versions. Update the three fst-advisor
files built against the old API (ComposedPhonologyProposer,
GrammarFstAdvisor, SurfacePhonology) plus six test fixtures that
constructed Pattern<Word, ShapeNode> directly.

FST probe: bounded opt-in coverage probe + SurfacePhonology memoization + parallel benchmark

FstCoverageProbe wraps VerifiedFstAnalyzer to answer "did this grammar edit help or
hurt" over a wordlist in milliseconds, without running the engine or any
certification machinery. SurfacePhonology.Variants is memoized per underlying
string instead of recomputed per build site (build-time fix for phonology-bearing
grammars). FstSenaBenchmark gains a pooled, thread-count/Server-GC-aware parallel
throughput benchmark. Phase 0 of docs/FST_FAST_PATH_PLAN.md.

FST: remove certification concept entirely (Phase 1)

Deletes the entire "certification" ambition: CompleteHybridMorpher,
CachingMorphologicalAnalyzer (+ AnalysisCache/AnalysisCacheSerializer/
MorphemeRegistry), and GrammarFstClosure, plus their tests. Certification was
an empirical corpus-parity gate meant to let the FST replace the search engine
once "proven" complete on a corpus - fragile in practice (a grammar could
certify on 30 Sena words and decertify on 60) and not the product going
forward.

FstVerification.Compare survives as a manual gap-inspection diagnostic for
[Explicit] benchmarks (renamed AnalysisComparison.IsComplete ->
MatchesReferenceExactly to stop implying a proof). CompositeProposer /
FstTemplateAnalyzer's CoversAllConstructs/UncoveredOps survive as coverage
diagnostics, decoupled from any certification consumer. Doc comments across
ComposedPhonologyProposer, ForwardSynthesisProposer, FstTemplateAnalyzer, and
FstCoverageProbe scrubbed of certification language.

FST_FULL_COVERAGE_PLAN.md, FST_FULL_PLAN.md, and HERMITCRAB_FST_PLAN.md move
to docs/archive/ with superseded-by headers pointing at
FST_FAST_PATH_PLAN.md; LEVER_2.md stays in place (Phase 3 builds on it
directly) with a scope-pointer header.

Verified against real Sena data post-purge: Benchmark_CompositeVsSearch on 60
words still shows 58/60 fully covered, 0 unsound - identical to pre-purge
behavior, confirming certification's removal didn't change what the fast path
actually does.

Phase 1 of docs/FST_FAST_PATH_PLAN.md.

FST: shared root chains + struct-keyed walk dedup (Phase 2)

Two independent fixes to the FstTemplateAnalyzer hot path identified by the
earlier perf audit:

1. Root chains are now built ONCE per RootAllomorph and shared across every
   attachment site (bare-root, template-less, and each qualifying template)
   via epsilon fan-in/fan-out, instead of being rebuilt from scratch at each
   site. This is the same shared-substructure-plus-epsilon technique already
   used for the derivation layers in this file - tokens accumulate per walk
   path, not per state, so sharing a state across many incoming paths never
   conflates token histories. Measured on real Sena: FST states 50,673 ->
   20,737 (~59% reduction), well under the plan's <<50k target. Coverage and
   soundness identical before/after (58/60 bare + composite, 0 unsound).

2. The walk's dedup keys (Key/PKey) and the emitted-signature set switched
   from string.Join-built strings to struct keys (ConfigKey/PConfigKey/
   TokenArrayKey with hand-rolled hashing - System.HashCode isn't available
   on this library's netstandard2.0 target) - removes the per-config
   per-segment string allocation the original hot-path audit flagged as the
   likely dominant per-word allocator.

Full toy-grammar suite (108 tests) and real-Sena benchmarks verified
identical before/after; parallel throughput unregressed (~15-20 ms/word
verified, consistent with pre-refactor numbers).

Deferred to a documented follow-up (KNOWN_GAPS): true cross-root character
trie merging (sharing prefixes between DIFFERENT roots, not just across
attachment sites for the SAME root) and EpsilonClosure's internal buffer
pooling. Both are real further wins but carry more correctness risk on this
hot path than the time available for this pass warranted.

Phase 2 of docs/FST_FAST_PATH_PLAN.md.

FST: auto-compiled lockstep phonology, v1 scope (Phase 3, partial)

Implements the "Pinv compiler" LEVER_2.md left as the frontier: PhonologyRuleCompiler
auto-builds an InversePhonology from a grammar's RewriteRules by probing each rule's
own synthesis behavior in isolation (B-direct, per LEVER_2.md's finding that probing
the combined multi-rule effect misreads feeding/bleeding), then LockstepPhonologyProposer
wires it into FstTemplateAnalyzer.AnalyzeComposed - the lexicon-constrained lockstep
walk that was already proven sound on hand-built transducers.

Scope (deliberately v1, documented in FST_FAST_PATH_PLAN.md's Phase 3 STATUS block):
single-segment Lhs, right-context-only, non-interacting rules (feature-change
substitution + deletion, matching 3a.1/3a.2). No left-environment support, no
multi-segment Lhs, no epenthesis/metathesis/alpha-variables, and critically no true
multi-rule cascade composition - each supported rule contributes an independent
branch from the shared "outside any rule" state, so genuinely interacting rules
(Indonesian's meN- assimilation+deletion) are not covered.

Wired additively into CompositeProposer alongside the existing ComposedPhonologyProposer
(not a replacement - the v1 scope doesn't yet supersede it). Verified on real data:
Sena (0 phonological rules) unaffected, 58/60 0 unsound, exact match to baseline.
Indonesian: 0 unsound (soundness holds) but 54/70 coverage identical with or without
the new proposer - it doesn't yet reach meN- as expected given the scope limits.

Found and fixed a real bug along the way: HC marks a deleted ShapeNode via IsDeleted()
rather than removing it from the Shape, so code reading segment counts after applying
a rule must filter it or undercounts changes (masked the deletion detection entirely
until found). SurfacePhonology.cs has the same latent gap, not fixed here (out of
scope) - noted in KNOWN_GAPS.

Phase 3 (partial) of docs/FST_FAST_PATH_PLAN.md - left-environment support and true
cascade composition remain the frontier; ComposedPhonologyProposer/ForwardSynthesisProposer
stay in place since the retirement gate (Indonesian >= 69/70 without them) is not met.

FST: partial reduplication + infix surface variants; construct sweep (Phase 4)

ReduplicationProposer generalized from exact half-word full-copy detection to a
scan over every copy length (1..word.Length/2), both prefix-copy and suffix-copy -
full reduplication is now just the len=word.Length/2 case of the same algorithm,
so it subsumes rather than duplicates the old logic. Still O(word length^2),
still verify-gated (a coincidental short repeat is proposed but rejected -
new test assertion confirms this).

InfixProposer now searches for an infix's SurfacePhonology variants, not just its
literal underlying string - a phonologically-altered infix is no longer invisible
to the substring search. Reuses the same precompile machinery already extensively
tested for regular affix arcs.

Both changes verified against real Sena (58/60, 0 unsound) and Indonesian (54/70,
0 unsound) data - no regression, identical numbers (neither corpus happens to
exercise these specific constructs, but the mechanisms are real and sound for
grammars that do).

Investigated compounding (the third planned Phase 4 item) and found the fix is
bigger than scoped: WordAnalysis.RootMorphemeIndex is a single int with no way to
represent a second root, so this needs a cross-cutting data-model change before
FstReplay can be extended - deferred, documented in KNOWN_GAPS rather than
attempted under time pressure on a shared type.

Construct sweep (audit, not new tests) found two previously-undocumented gaps:
no generator exists for clitics (MorphOp.Clitic) or process/simulfix
(MorphOp.Process) the way one exists for infix/reduplication, and MPR
features/allomorph environments/stem names aren't build-time-gated in
FstTemplateAnalyzer (sound via verify regardless, but a precision gap). All
documented in FST_FAST_PATH_PLAN.md's KNOWN_GAPS.

Phase 4 (partial - reduplication and infix items done, compounding deferred) of
docs/FST_FAST_PATH_PLAN.md.

FST: probe becomes the full composite; edit-loop tests; full-corpus benchmark (Phase 5)

FstCoverageProbe.ForLanguage now builds the FULL composite (FstTemplateAnalyzer +
ReduplicationProposer + InfixProposer + ComposedPhonologyProposer +
LockstepPhonologyProposer, with forwardSynthesis as an opt-in parameter) instead of
the bare FstTemplateAnalyzer - this is the "all-in fast path" the plan calls for.
ProbeReport gains diagnostics: CoversAllConstructs/UncoveredConstructs (from the
composite's build-time coverage signal), UnsupportedPhonologyRuleCount (from
PhonologyRuleCompiler), and per-call Elapsed wall-time.

Added three edit-loop CI tests proving the product promise - a grammar edit in each
implemented mechanism class (affix rule, phonological rule, reduplication rule)
visibly moves probe output. One (phonological) needed a fix after a genuinely
informative failure: an unconditional t->d rule doesn't just gain "dad", it also
loses "dat" (which no longer has any valid surface once every t surfaces as d) -
a real "gained X, lost Y" case, not a simple net gain, which is exactly what
CoverageDiff is for.

Added Benchmark_FullCorpusProbe: an [Explicit] end-to-end run over the WHOLE
wordlist (not the small capped slices used elsewhere), reporting coverage and
p50/p95/p99 per-word latency. Measured for real:
  Sena (7,121 words):      58.1% parsed, p50=31ms p95=173ms
  Indonesian (121 words):  62.0% parsed, p50=1.4ms p95=6.0ms
These are materially lower coverage than the small-slice numbers quoted earlier in
FST_FAST_PATH_PLAN.md (expected - rare/complex forms concentrate in a full
wordlist's tail) and are recorded as the honest baseline going forward, replacing
the plan's original small-sample-derived targets in the Global Success Criteria
section, which is now written against measured full-corpus reality rather than
aspiration.

Also flagged a real, previously-undocumented gap while closing this out: no
frontier-beam cap was ever built into the NFA walk (AnalyzeShape/AnalyzeComposed/
EpsilonClosure/ComposedClosure) - Phase 3c specified one, it didn't happen. Not
observed to matter in practice this session, but a real un-guarded risk for a
grammar this hasn't been exercised against. Added to KNOWN_GAPS.

Phase 5 of docs/FST_FAST_PATH_PLAN.md - the last phase in the original plan.

FST: add top-of-plan status summary and per-phase status markers

Docs-only. Adds a status block at the top of FST_FAST_PATH_PLAN.md summarizing
where all 5 phases actually landed, and marks each phase heading DONE/PARTIAL so
a future reader (or executor) doesn't have to read every section to find out the
plan has already been executed once, with Phase 3's phonology cascade and Phase
4's compounding as the clearly-flagged highest-value remaining work.

FST: add genuinely parallel full-corpus throughput benchmark

Benchmark_FullCorpusProbe (Phase 5) is deliberately single-threaded for clean
per-word latency percentiles, which meant there was no way to see full-corpus
throughput under real parallelism without also pulling in the oracle (via
Benchmark_ParallelThroughput, which defaults to a 60-word cap and pairs against
search - open-ended runtime risk at full Sena scale).

Benchmark_FullCorpusParallelThroughput fills that gap: FST-only (no oracle), full
wordlist, real Parallel.ForEach across HC_THREADS (default 16). Measured on the
freshly-rebased hc-rustify:
  Sena (7,121 words):  56.5s wall / 126 words/sec (vs 430s sequential, ~7.6x from
                        16-way threading), 58.1% coverage - identical to sequential
  Indonesian (121 words): 115ms wall / 1044 words/sec, 62.0% coverage

FST: correct the coverage story - 99.2% of engine-parseable on Sena (measured)

The full-corpus "58.1% coverage" number was denominated by the raw wordlist, which
turns out to be mostly words the search engine itself cannot parse. Measured
properly on a seeded 200-word random sample, with each FST-unparsed word checked
against the UNBOUNDED oracle in an isolated child process (necessary: an
in-process run crashed the test host outright on a pathological word):

  - FST parsed 120/200
  - 79 of the 80 unparsed words do not parse in the engine either (73 fast
    no-parses; 6 needed 12-90+s of unbounded search just to prove no parse)
  - exactly 1 genuine gap: ndikhali (copula construction, ser+NZR+class prefix -
    the already-documented copula/TAM gap)
  - FST / engine-parseable = 120/121 = 99.2%

Section 10.3 flipped from NOT-MET to MET-on-Sena with the full methodology;
KNOWN_GAPS gains an explicit copula/TAM entry with the ndikhali witness (closing
that one construct class would have made the sample 100%). Also records the
pathology finding: the engine's worst case is proving a non-word unparseable
(12-90+s, one OOM crash), which the FST probe answers in milliseconds - the exact
failure mode a grammar-tuning probe must not inherit.

Indonesian's engine-parseable denominator was not re-measured; its known meN-
cascade gap (Phase 3 frontier) means its ratio is genuinely lower than Sena's.

FST: add left-environment support to PhonologyRuleCompiler (Phase 3)

Extends the v1 lockstep-phonology compiler to handle a non-empty LeftEnvironment,
symmetric to the existing right-environment chain: a new ChainLeftEnvironment builds
identity arcs forward from state 0, and AddRestorationBranch/AddSubstitutionBranch now
start from that state instead of always from 0. Also fixes a latent bug found along the
way: AddRestorationBranch always routed through ChainRightEnvironment, a no-op on an
empty list, so a left-only-conditioned deletion would have added a dangling arc with no
path back to state 0 - both branch builders now special-case an empty right environment
directly, matching the pattern AddSubstitutionBranch already used.

Covered by two new end-to-end tests (left-context deletion and substitution, mirroring
the existing right-context ones) plus the retargeted former
Compile_SkipsLeftEnvironmentRule_AsUnsupported, now Compile_AutoRecoversLeftContextDeletion.
8/8 PhonologyRuleCompilerTests pass, 116/116 non-explicit HermitCrab tests pass.

Measuring against the real Indonesian grammar (not just toy tests) surfaced why this
didn't move Benchmark_CompositeVsSearch coverage (93/121 unchanged before/after): 0 of
Indonesian's 5 real phonological rules compile at all, because PhonologyRuleCompiler's
_alphabet excludes boundary-type characters entirely, so any rule with a BoundaryMarker
in its environment (3 of the 5, including two that would otherwise be simple enough for
v1) is rejected before its shape is even checked. This is a previously undocumented,
higher-priority prerequisite to the already-known alpha-variable and cascade-composition
gaps - the Phase-3 mechanism has never actually fired on real Indonesian data. Recorded
in FST_FAST_PATH_PLAN.md's Phase 3 STATUS block and KNOWN_GAPS with the fix direction.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

FST: add full-grammar coverage plan (junction probing, not generic cascade composition)

Companion to FST_FAST_PATH_PLAN.md - execution plan for closing the remaining
Sena/Indonesian gaps via bounded build-time junction probing through the real
synthesis cascade, instead of a generic multi-rule composer.

FST plan: Phase A measurement results; correct Sena ndikhali diagnosis

Indonesian: 28 divergent words on the full 121-word corpus, zero compounds -
21 simple meN- forms (Phase C target) + 7 REDUP-meN forms (Phase D target).
Confirms Phase E (compounding) is unnecessary for Indonesian.

Sena: ground-truthed ndikhali directly (8 engine analyses via a bounded
diagnostic) - it's a genuine two-root compound (e + ser via CompoundingRule
mrule7/mrule8), not the archived plan's guessed "prefixal derivation" gap.
Closing it needs the same RootMorphemeIndex multi-root lift Phase E scopes,
which is disproportionate to fix for one word in a 7,121-word corpus already
at 99.2% coverage. Deferred in favor of Phase C/D's larger Indonesian win.

FST: junction-deletion probing closes Indonesian meN- coverage (Phase C)

Indonesian's 5 phonological rules are all boundary-conditioned at affix
junctions, so the meN- assimilation+deletion cluster can be closed by
reusing/extending the existing per-affix SurfacePhonology precompile instead
of building the generic multi-rule Pinv composer the plan originally feared
would be needed.

Two fixes:
- SurfacePhonology's deleted-node rendering bug: HC marks a deletion via
  ShapeNode.IsDeleted() rather than removing the node, so the old rendering
  loop still printed the pre-deletion segment. A shared RenderNodes helper
  now skips IsDeleted() nodes, which alone recovers meN- + sonorant-initial
  roots (nasal deletion before sonorants) with no new mechanism.
- New SurfacePhonology.DeletionJunctions(underlying): probes each alphabet
  representative as a right neighbor (falling back to a second trailing
  neighbor when the rule's own right environment reaches beyond the deleted
  segment) and reports every case where the cascade deletes the NEIGHBOR
  itself. FstTemplateAnalyzer gained root-chain checkpoints so a
  junction-deletion outcome can skip the root's own deleted onset, gated at
  build time to roots whose leading segment actually matches (WireDeletionSkips).

Both mechanisms are bounded by |junction affixes| x alphabet (or x alphabet^2
for the two-neighbor fallback) - no roots x affixes blowup, no window-size
computation, no re-implemented cascade.

Measured on the full 121-word Indonesian corpus: 114/121 fully covered (up
from 93/121), 0 unsound, 0 false positives. The only remaining gaps are the
7 REDUP-meN reduplicated forms - Phase D's target. Sena unaffected (0
phonological rules).

Adds SurfacePhonologyJunctionTests.cs (toy grammar exercising the two-neighbor
fallback and the onset build-time gate) and updates
docs/FST_FULL_GRAMMAR_PLAN.md with Phase A/C results.

FST: separator-tail reduplication closes 6/7 Indonesian redup words (Phase D)

Traced the real construct behind Indonesian's "-X-X" corpus words: it's
-Cont (mrule13), not REDUP-meN as the plan guessed - confirmed via a custom
ITraceManager logging every rule-unapply step. -Cont produces
[meN-word] + "-" + [nasal+stem, without the "me" prefix text], e.g.
menulis-nulis where "nulis" is exactly menulis's own trailing 5 characters -
a genuine surface TAIL copy separated by a literal character, not a copy of
the whole prefixed word.

Extended ReduplicationProposer.AnalyzeWord with a third scan: for every
position, treat that character as a literal separator and check whether
everything after it is a surface tail of everything before it. On a match,
recurse the residual through the existing FST proposer and wrap with the
redup morpheme - reusing the exact same strip+recurse+verify pattern as the
existing prefix/suffix-copy scans, separator-character-agnostic (a wrong
guess is pruned by verify like any other candidate).

Measured: Indonesian composite coverage 114/121 -> 120/121, still 0 unsound,
0 false positives. One word (mengamat-amati) needs a suffix stacked onto
just the copy - a materially different shape - documented as a residual in
both plan docs and KNOWN_GAPS rather than pursued further.

Adds a toy-grammar test for the separator-scan mechanism and its soundness
(VerifiedFstAnalyzerTests.cs); a toy grammar for the real partial-tail shape
would need a multi-group Lhs pattern, unvalidated territory in this repo -
same call Phase 4's CV-reduplication work already made.

FST plans: execution specs for next session (Phases G1/G2/H/I)

Measured Sena build-time attribution (2026-07-03): trie 105ms, GenerateWords
175ms, grammar load 245ms - the other ~8.5s of the 9.3s build is
SurfacePhonology.DeletionJunctions, un-memoized and called per allomorph x
26 layer builds x depth 2, with the alphabet^2 fallback running to
exhaustion on every candidate (Sena has 0 phonological rules, so nothing
ever deletes). A Phase C regression. Phase H specs three composable fixes
(memoize, capability-gate, stop double-building the FST in the composite
path) with per-step verification gates; expected ~0.4s.

Phase G2 overturns the compounding "data-model lift" premise with code
evidence: MorphOp.Compound already exists, the engine already emits two-root
WordAnalysis objects (the ndikhali diagnostic printed them), and the only
hard blocker is ~6 lines in FstReplay.Confirm. Specs the trie compound loop
(+1 state, ~2x roots epsilon-arcs) with headedness handled at token-emission
time. Phase E cancelled; KNOWN_GAPS compounding + copula/TAM entries
corrected.

Phase G1 specs the mengamat-amati fix (suffix-peel inside the separator
scan, with the boundary-stripping detail that makes "+i" match surface "i").
Phase I records the lazy per-rule-chain design (true-FST generalization)
with the theory anchor and why eager composition explodes here but lazy
cannot.

Adds the current measured baseline table (states / build / walk p50-p95)
and makes the stats battery a standing reporting requirement.

FST: fix Sena build-time regression - 9.3s -> ~1.0s (Phase H)

Phase C's DeletionJunctions probe was un-memoized (unlike Variants) and ran
its alphabet^2 two-neighbor fallback to exhaustion on every candidate for a
grammar with 0 phonological rules (Sena), since nothing can ever satisfy a
deletion probe there. Two fixes:

- Memoize DeletionJunctions the same way Variants already memoizes
  ComputeVariants (_deletionJunctionsCache).
- Capability-gate both probes on the grammar's own rule shapes, computed
  once in the constructor: _anyPhonologicalRules (Variants short-circuits to
  identity when false - exact, not an approximation) and _anyDeletionSubrule
  (DeletionJunctions returns empty immediately when false - nothing can ever
  delete a neighbor, so there is nothing to find by construction).

Measured: Sena build 9.3s -> ~1.0-1.1s, Indonesian unaffected (266ms, has
real deletion subrules so its gates stay open). Reverified coverage/soundness
unaffected: Indonesian Benchmark_CompositeVsSearch identical 120/121, 0
unsound; Sena per-word-timeout-guarded slice 55/57 covered, 0 unsound
(consistent with the known single-gap pattern). Full 119-test suite green.

Sena's StateCount dropped 20,737 -> 16,322 as a side effect - investigated
but not fully root-caused (likely a redundant identical-looking variant that
BuildAffixArcs's string-dedup didn't catch, now skipped by the gate before
computation); coverage/soundness confirmed unaffected by two independent
checks, so this is recorded as a known loose end rather than chased further.

The originally-planned H3 (share one FST build across the composite path)
turned out not to be a real bug on investigation - the evidence for it came
from a diagnostic script that itself built the FST twice, not from the
library's actual call sites, which already share correctly. No code change
made for H3; struck from the plan with the finding recorded.

FST: suffix-peel closes mengamat-amati - Indonesian now 121/121 (Phase G1)

mengamat-amati is meng+amat -> -Cont -> mengamat-amat -> -i(LOC) ->
mengamat-amati: a plain suffix rule applied AFTER reduplication, which
(since it just appends at the very end) lands on the tail of the copy. The
separator scan's plain tail match correctly failed on this shape since the
copy is TAIL+suffix, not a plain tail.

Extended ReduplicationProposer: the constructor now also collects every
grammar suffix rule's surface text (boundary-stripped - Indonesian's -i is
underlyingly "+i", and the "+" boundary character never appears on the
surface, so rendering must keep only Segment-type nodes). When the plain
tail match fails in the separator scan, try peeling each known suffix
surface off the end of the copy and re-testing the remainder as a tail;
on a match, wrap with both the redup and suffix morphemes (redup first,
matching engine order). Single suffix layer only, no recursion - the
corpus needs no more and unbounded stacking would be scan cost without
evidence.

Measured: Indonesian Benchmark_CompositeVsSearch 120/121 -> 121/121, 0
unsound, 0 false positives; Diagnose_Divergences now finds zero divergent
words. This closes every engine-parseable word in the Indonesian corpus.

New toy test (Composite_CoversSuffixStackedOutsideReduplication_
WhereSeparatorScanAloneMisses) passed on first run - no PoS-gating
adjustment needed. Full suite green (120/120, was 119).

FST: compound loop + FstReplay fix closes ndikhali - 8/8 exact parity (Phase G2)

Confirms the "data-model lift" premise for compounding was false: MorphOp.
Compound already existed, WordAnalysis already represented compounds (the
engine's own ndikhali analyses proved it), and the real blocker was
FstReplay.Confirm rejecting any candidate with a second LexEntry morpheme.

FstReplay.Confirm: non-head LexEntry morphemes go into an extraRoots set
instead of triggering an early null return; LexEntrySelector admits them
alongside the head root; RuleSelector opens CompoundingRule only when a
compound is actually present.

FstTemplateAnalyzer: new BuildCompoundLoop adds one shared "join" state per
attachment site (template-less path, each template) that every root's chain
feeds into and out of - bounded to one extra root, gated on the grammar
having any CompoundingRule. ToWordAnalysis renamed to ToWordAnalyses
(IEnumerable) and now handles 2+ Root tokens by emitting one candidate per
head choice, letting FstReplay's real CompoundingRule check confirm
whichever headedness the grammar licenses.

Two things the original spec (written 2026-07-03 morning) missed, found
during implementation:
- The compound loop lived inside the template-less path's
  "_derivPrefixRules.Count > 0 || _derivSuffixRules.Count > 0" guard, so a
  grammar with compounding but no OTHER standalone derivational rule never
  built the loop at all (caught by the toy test). Fixed: guard now also
  checks hasCompoundingRules.
- Closing Sena's ndikhali specifically needed a third piece: DerivableToCategory
  had to treat compounding as a category-transition edge alongside
  derivational rules, since Sena's noun-class-agreement prefix template
  requires NZR's output category, itself only reachable via compound->NZR.
  Without this the compound loop worked (e+ser+NZR candidates appeared) but
  the class-prefix template stayed unreachable for either root - found via
  reflection-inspecting _derivPrefixRules' actual contents plus a
  rule-application trace, since static analysis alone didn't surface it.

Measured: Sena's ndikhali - 8/8 exact set parity with the engine (all four
class markers x both head orderings). Guarded 60-word Sena slice: 57/57
fully covered (up from 55/57), 0 unsound. Indonesian unchanged at 121/121,
0 unsound - its compounding rules now build the loop too, but the corpus
needs no compounds so verify prunes every proposal. Full 121-test suite
green (was 120). States: Indonesian 532->533, Sena 16,322->16,347. Build
time: Indonesian ~266ms->~433ms, Sena ~1.0-1.1s->~1.3-1.5s, both far below
the pre-Phase-H 9.3s baseline.

New toy test Fst_CoversCompound_ViaTheCompoundLoop reuses
CompoundingRuleTests.cs's existing unrestricted-rule pattern; soundness
checked via CompoundingRule's own default MaxApplicationCount=1 rejecting
a three-root chain.
…hain

Replaces Phase I's short design notes with a complete, commit-gated
execution spec (I0-I7 + optional I8, ~6-9 days) in the same format the
G1/G2/H specs used, targeting correct-by-construction coverage of ARBITRARY
regular HC grammars rather than just Sena+Indonesian:

- Governing principle: SUPERSET, NEVER SILENT SKIP - every rule compiles at
  Exact / Permissive / Identity-skip tier, verify supplies soundness, and
  ProbeReport gains a per-rule tier report replacing the bare unsupported
  count.
- I0: InversePhonology gains epsilon-output arcs (epenthesis-inverse); one
  InversePhonology per rule, chained in the engine's own unapplication
  order (read AnalysisStratumRule, don't trust the doc).
- I1: env-pattern -> NFA compiler covering the COMPLETE Matching node
  inventory (Constraint/Quantifier/Group/Alternation - quantified spans are
  what make long-distance harmony Exact-tier, not Permissive) + Exact-tier
  substitution compiler reusing v1's proven per-segment synthesis-probe
  trick over the concrete alphabet; alpha-variables by concrete enumeration.
- I2: ONE chain walker - AnalyzeComposed generalizes to a rule-state vector
  and the single-Pinv path delegates to a length-1 chain so old tests guard
  the new code. Marquee toy tests that must FAIL on today's composite
  first: word-internal rule, two-rule feeding, long-distance harmony.
- I3: deletion-inverse (capped restorations) + epenthesis-inverse.
- I4: the boundary tape - trie keeps boundary arcs (bare walk treats them
  as epsilon, byte-identical gate), chain inserts boundaries top-down
  constrained by the trie; cross-check gate: Indonesian with junction
  probing DISABLED must independently cover meN- via the chain.
- I5: metathesis inverse + honest handling of iterative self-feeding rules
  and RTL direction (flag, optional doubling, never silent).
- I6: the beam cap (closes the oldest KNOWN_GAPS item) shared by all walks.
- I7: wiring as ChainPhonologyProposer + retirement strictly by
  measurement (ComposedPhonologyProposer, ForwardSynthesisProposer, v1
  compiler internals; junction probing retired only if the chain matches
  its coverage within a 1.5x p50 budget - either outcome recorded).
- I8 (optional): clitics + process/simulfix generators, the last two
  uncovered MorphOps.
- Honest boundary section: unbounded copying stays a peel (non-regular),
  compounding stays bounded, self-feeding under-coverage is flagged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants