HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory by johnml1135 · Pull Request #451 · sillsdev/machine

johnml1135 · 2026-07-03T21:20:39Z

Stacked on #446 (hc-rustify) — targets that branch, not master, since it hasn't merged yet.

The problem

HermitCrab's analysis-cascade search for "unordered" morphological strata explores every ordering of applicable rules (a combinatorial k! walk over a 2^k subset lattice of morpheme-application states), because rule application is not commutative in general and the engine doesn't know in advance which orderings will matter. On real grammars this reaches the same state key (shape + rule multiset + syntactic feature structure — everything that determines the rest of analysis, independent of arrival order) thousands of times: dissecting Sena's worst words showed 158,227 node expansions against only ~2,546 distinct states, with a single state re-visited over 7,000 times. Because each visit re-ran a full FST pattern match and downstream template battery, ~98% of the work was provably repeated computation.

What changed

1. Order-invariant memoization (AnalysisStateKey, MemoizedCombinationRuleCascade, AnalysisScope). A new state key strips arrival order from the search key while keeping everything that actually determines the analysis outcome. Two memo tables cache, per parse: subtrees that provably yield nothing (nogood case) and subtrees that yield results (a separate table for the per-stratum template battery). A second arrival at a known state replays the stored result (Word.ReplayOnto) — cloning and re-grafting only the arrival's own rule/non-head trail prefix onto the stored subtree's suffix — instead of re-searching. An in-flight re-entrancy guard falls back to a plain unmemoized expansion for the rare self-loop case. The template battery turned out to be the dominant cost (93% of wall time on the worst words) — memoizing it is most of the win.

2. Supporting fixes surfaced by the memo work:

FeatureStruct.Freeze() gets the copy-on-write hash shortcut Shape.Freeze() already had.
Word.CloneShareFrozenShape(): clone sites that provably never mutate the cloned shape share the frozen source's Shape instance instead of deep-copying — cuts per-word allocation 4.5–7.6%.
A latent bug fix: a rule reassigned SyntacticFeatureStruct to an unfrozen clone after the owning Word was already frozen; harmless until the new state key became the first code to call GetFrozenHashCode() on it. Fixed defensively (Freeze() is idempotent).
Synthesis-side rule indexing (RootAllomorphTrie) replaces a linear scan with a lookup.

3. Corpus-batch scheduling (hc batch --parallel[=N]). Naive range-partitioning clusters the corpus's catastrophically slow words together in whichever partition draws them — measured at 2.9x the theoretical packing-bound wall-clock. Load-balances by sorting words longest-surface-first through a work-stealing partitioner, buffering results by original index so output ordering is unaffected. Closes the slack to 1.36x the packing bound; combined with memoization, a 313-word reference batch drops from 1,051s sequential to 74.4s at 16-way parallelism.

4. Memory bound. Server GC under sustained heavy-word concurrency can defer collecting transient search garbage well past what the retained memo tables account for (confirmed a few tens of MB/word, not the cause). Morpher's doc now states the DOTNET_GCHeapHardLimit/GCHeapHardLimitPercent requirement for Server-GC batch hosts.

Verification

Every change is gated on byte-identical parse output against the full Indonesian corpus (121 words) and a fixed Sena reference set (300 corpus words + 13 measured-worst words), sequential vs. parallel, before vs. after each change. 68/68 HermitCrab unit tests and the full SIL.Machine suite (828/831, 3 pre-existing skips) pass throughout, including new tests added here for single-threaded/parallel-cascade equivalence and lexical-gating parity.

What was tried and deliberately not shipped

A synthesis-side length pre-filter was checked and ruled out (too early to intercept the actual rejection). A lexical-gate optimization for pruning rootless branches shipped flag-gated default-off (measured inert on both reference corpora). A leaf-rule battery prefilter and a deferred-materialization redesign for the memo's replay path were measured below a 10% aggregate-corpus threshold and not implemented. A thread-static pooling optimization for scratch collections was implemented, measured, and reverted — it cut allocation 15–17% but regressed wall-clock 8–15% (small short-lived collections are cheaper to Gen0-allocate than to pool). Each is a measured negative result, not an oversight.

Test plan

dotnet build — full solution, 0 errors
dotnet test — HermitCrab suite 68/68, SIL.Machine suite 828/831 (3 pre-existing skips)
Byte-identity: Indonesian corpus (121/121) and Sena reference set, sequential vs. --parallel, before/after every change

This change is

…s scheduling, bound memory Stacked on hc-rustify (#446). Cuts single-word analysis time ~5x, corpus batch wall-clock ~14x with parallelism, and closes a Server-GC memory blowup, without changing any parse result (verified byte-identical on the Indonesian and Sena reference corpora throughout). ## The problem HermitCrab's analysis-cascade search for "unordered" morphological strata explores every ordering of applicable rules (a combinatorial k! walk over a 2^k subset lattice of morpheme-application states), because rule application is not commutative in general and the engine doesn't know in advance which orderings will matter. On real grammars this reaches the same state key (shape + rule multiset + syntactic feature structure -- everything that determines the rest of analysis, independent of arrival order) thousands of times: dissecting Sena's worst words showed 158,227 node expansions against only ~2,546 distinct states, with a single state re-visited over 7,000 times. Because each visit re-ran a full FST pattern match and downstream template battery, ~98% of the work was provably repeated computation. ## What changed **1. Order-invariant memoization (`AnalysisStateKey`, `MemoizedCombinationRuleCascade`, `AnalysisScope`).** A new state key strips arrival order from the search key while keeping everything that actually determines the analysis outcome. Two memo tables cache, per parse: subtrees that provably yield nothing (`Memo`, nogood case) and subtrees that yield results (`TemplateMemo` for the per-stratum template battery). A second arrival at a known state replays the stored result (`Word.ReplayOnto`) -- cloning and re-grafting only the arrival's own rule/non-head trail prefix onto the stored subtree's suffix -- instead of re-searching. An in-flight re-entrancy guard (`AnalysisScope.InProgress`) falls back to a plain unmemoized expansion for the rare self-loop case rather than reading a partial entry. The template battery turned out to be the dominant cost (93% of wall time on the worst words, not the mrule cascade the initial step-count analysis pointed at) -- memoizing it is most of the win. **2. Supporting correctness/performance fixes surfaced by the memo work:** - `FeatureStruct.Freeze()` gets the same copy-on-write hash shortcut `Shape.Freeze()` already had: a frozen source's unmutated clone adopts the cached hash instead of re-walking. - `Word.CloneShareFrozenShape()`: clone sites that provably never mutate the cloned shape (`ReplayOnto`, the template battery's per-candidate clone) share the frozen source's `Shape` instance instead of deep-copying it -- cuts per-word allocation 4.5-7.6%. - A latent bug fix: `AnalysisAffixTemplateRule.Apply` reassigned `SyntacticFeatureStruct` to an unfrozen clone after the owning `Word` was already frozen; harmless until the new state key became the first code to call `GetFrozenHashCode()` on it. Fixed defensively in `AnalysisStateKey`'s constructor (`Freeze()` is idempotent). - Synthesis-side rule indexing (`RootAllomorphTrie`) cuts synthesis rule-selection from a linear scan to a lookup. **3. Corpus-batch scheduling (`hc batch --parallel[=N]`, `BatchCommand`).** The obvious per-word parallelism was leaving most of its throughput on the table: naive range-partitioning clusters the corpus's few catastrophically slow words together in whichever partition draws them, so one thread finishes early while another is still grinding through the tail -- measured at 2.9x the theoretical packing-bound wall-clock. Load-balances by sorting words longest-surface-first and feeding them through a work-stealing partitioner (`Partitioner.Create(..., loadBalance: true)`), buffering each result by original index so output ordering is unaffected by completion order. Closes the slack to 1.36x the packing bound; combined with the memoization work, the reference 313-word Sena batch drops from 1,051s sequential to 74.4s at 16-way parallelism. **4. Memory bound.** Server GC under sustained heavy-word concurrency defers collecting transient search garbage for throughput, which can spike host memory well past what the retained memo tables account for (the tables themselves are a few tens of MB per word -- confirmed by direct measurement, not the cause). `Morpher`'s class doc now states the `DOTNET_GCHeapHardLimit`/`GCHeapHardLimitPercent` requirement for Server-GC batch hosts. ## Verification Every change is gated on byte-identical parse output against the full Indonesian corpus (121 words) and a fixed Sena reference set (300 corpus words + the 13 measured-worst words), sequential vs. parallel, before vs. after each change -- diffed on sorted per-word signatures so ordering changes never mask a real mismatch. 68/68 HermitCrab unit tests and the full SIL.Machine suite (828/831, 3 pre-existing skips) pass throughout, including new tests added here: single-threaded/parallel-cascade equivalence under compounding and affix templates, and lexical-gating parity. ## What was tried and deliberately not shipped In the interest of shipping only what measurement supports: a synthesis-side length pre-filter was checked and ruled out (the rejection it targets happens too early to intercept); a lexical-gate optimization for pruning rootless branches shipped flag-gated default-off (measured inert on both reference corpora, real grammars don't trigger it often enough yet); a leaf-rule battery prefilter and a deferred-materialization redesign for the memo's replay path were both measured below a 10% aggregate-corpus threshold and not implemented; a thread-static pooling optimization for FeatureStruct's per-comparison scratch collections was implemented, measured, and reverted -- it reduced allocation 15-17% but regressed wall-clock 8-15% (small short-lived collections are cheaper to Gen0-allocate than to pool). Each of these is a real, measured negative result, not an oversight. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory#451

HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory#451
johnml1135 wants to merge 1 commit into
hc-rustifyfrom
parse-optimization

johnml1135 commented Jul 3, 2026 •

edited by ddaspit

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

johnml1135 commented Jul 3, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem

What changed

Verification

What was tried and deliberately not shipped

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnml1135 commented Jul 3, 2026 •

edited by ddaspit

Loading