HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory#451
Draft
johnml1135 wants to merge 1 commit into
Draft
HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory#451johnml1135 wants to merge 1 commit into
johnml1135 wants to merge 1 commit into
Conversation
…s scheduling, bound memory Stacked on hc-rustify (#446). Cuts single-word analysis time ~5x, corpus batch wall-clock ~14x with parallelism, and closes a Server-GC memory blowup, without changing any parse result (verified byte-identical on the Indonesian and Sena reference corpora throughout). ## The problem HermitCrab's analysis-cascade search for "unordered" morphological strata explores every ordering of applicable rules (a combinatorial k! walk over a 2^k subset lattice of morpheme-application states), because rule application is not commutative in general and the engine doesn't know in advance which orderings will matter. On real grammars this reaches the same state key (shape + rule multiset + syntactic feature structure -- everything that determines the rest of analysis, independent of arrival order) thousands of times: dissecting Sena's worst words showed 158,227 node expansions against only ~2,546 distinct states, with a single state re-visited over 7,000 times. Because each visit re-ran a full FST pattern match and downstream template battery, ~98% of the work was provably repeated computation. ## What changed **1. Order-invariant memoization (`AnalysisStateKey`, `MemoizedCombinationRuleCascade`, `AnalysisScope`).** A new state key strips arrival order from the search key while keeping everything that actually determines the analysis outcome. Two memo tables cache, per parse: subtrees that provably yield nothing (`Memo`, nogood case) and subtrees that yield results (`TemplateMemo` for the per-stratum template battery). A second arrival at a known state replays the stored result (`Word.ReplayOnto`) -- cloning and re-grafting only the arrival's own rule/non-head trail prefix onto the stored subtree's suffix -- instead of re-searching. An in-flight re-entrancy guard (`AnalysisScope.InProgress`) falls back to a plain unmemoized expansion for the rare self-loop case rather than reading a partial entry. The template battery turned out to be the dominant cost (93% of wall time on the worst words, not the mrule cascade the initial step-count analysis pointed at) -- memoizing it is most of the win. **2. Supporting correctness/performance fixes surfaced by the memo work:** - `FeatureStruct.Freeze()` gets the same copy-on-write hash shortcut `Shape.Freeze()` already had: a frozen source's unmutated clone adopts the cached hash instead of re-walking. - `Word.CloneShareFrozenShape()`: clone sites that provably never mutate the cloned shape (`ReplayOnto`, the template battery's per-candidate clone) share the frozen source's `Shape` instance instead of deep-copying it -- cuts per-word allocation 4.5-7.6%. - A latent bug fix: `AnalysisAffixTemplateRule.Apply` reassigned `SyntacticFeatureStruct` to an unfrozen clone after the owning `Word` was already frozen; harmless until the new state key became the first code to call `GetFrozenHashCode()` on it. Fixed defensively in `AnalysisStateKey`'s constructor (`Freeze()` is idempotent). - Synthesis-side rule indexing (`RootAllomorphTrie`) cuts synthesis rule-selection from a linear scan to a lookup. **3. Corpus-batch scheduling (`hc batch --parallel[=N]`, `BatchCommand`).** The obvious per-word parallelism was leaving most of its throughput on the table: naive range-partitioning clusters the corpus's few catastrophically slow words together in whichever partition draws them, so one thread finishes early while another is still grinding through the tail -- measured at 2.9x the theoretical packing-bound wall-clock. Load-balances by sorting words longest-surface-first and feeding them through a work-stealing partitioner (`Partitioner.Create(..., loadBalance: true)`), buffering each result by original index so output ordering is unaffected by completion order. Closes the slack to 1.36x the packing bound; combined with the memoization work, the reference 313-word Sena batch drops from 1,051s sequential to 74.4s at 16-way parallelism. **4. Memory bound.** Server GC under sustained heavy-word concurrency defers collecting transient search garbage for throughput, which can spike host memory well past what the retained memo tables account for (the tables themselves are a few tens of MB per word -- confirmed by direct measurement, not the cause). `Morpher`'s class doc now states the `DOTNET_GCHeapHardLimit`/`GCHeapHardLimitPercent` requirement for Server-GC batch hosts. ## Verification Every change is gated on byte-identical parse output against the full Indonesian corpus (121 words) and a fixed Sena reference set (300 corpus words + the 13 measured-worst words), sequential vs. parallel, before vs. after each change -- diffed on sorted per-word signatures so ordering changes never mask a real mismatch. 68/68 HermitCrab unit tests and the full SIL.Machine suite (828/831, 3 pre-existing skips) pass throughout, including new tests added here: single-threaded/parallel-cascade equivalence under compounding and affix templates, and lexical-gating parity. ## What was tried and deliberately not shipped In the interest of shipping only what measurement supports: a synthesis-side length pre-filter was checked and ruled out (the rejection it targets happens too early to intercept); a lexical-gate optimization for pruning rootless branches shipped flag-gated default-off (measured inert on both reference corpora, real grammars don't trigger it often enough yet); a leaf-rule battery prefilter and a deferred-materialization redesign for the memo's replay path were both measured below a 10% aggregate-corpus threshold and not implemented; a thread-static pooling optimization for FeatureStruct's per-comparison scratch collections was implemented, measured, and reverted -- it reduced allocation 15-17% but regressed wall-clock 8-15% (small short-lived collections are cheaper to Gen0-allocate than to pool). Each of these is a real, measured negative result, not an oversight. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #446 (hc-rustify) — targets that branch, not master, since it hasn't merged yet.
The problem
HermitCrab's analysis-cascade search for "unordered" morphological strata explores every ordering of applicable rules (a combinatorial k! walk over a 2^k subset lattice of morpheme-application states), because rule application is not commutative in general and the engine doesn't know in advance which orderings will matter. On real grammars this reaches the same state key (shape + rule multiset + syntactic feature structure — everything that determines the rest of analysis, independent of arrival order) thousands of times: dissecting Sena's worst words showed 158,227 node expansions against only ~2,546 distinct states, with a single state re-visited over 7,000 times. Because each visit re-ran a full FST pattern match and downstream template battery, ~98% of the work was provably repeated computation.
What changed
1. Order-invariant memoization (
AnalysisStateKey,MemoizedCombinationRuleCascade,AnalysisScope). A new state key strips arrival order from the search key while keeping everything that actually determines the analysis outcome. Two memo tables cache, per parse: subtrees that provably yield nothing (nogood case) and subtrees that yield results (a separate table for the per-stratum template battery). A second arrival at a known state replays the stored result (Word.ReplayOnto) — cloning and re-grafting only the arrival's own rule/non-head trail prefix onto the stored subtree's suffix — instead of re-searching. An in-flight re-entrancy guard falls back to a plain unmemoized expansion for the rare self-loop case. The template battery turned out to be the dominant cost (93% of wall time on the worst words) — memoizing it is most of the win.2. Supporting fixes surfaced by the memo work:
FeatureStruct.Freeze()gets the copy-on-write hash shortcutShape.Freeze()already had.Word.CloneShareFrozenShape(): clone sites that provably never mutate the cloned shape share the frozen source'sShapeinstance instead of deep-copying — cuts per-word allocation 4.5–7.6%.SyntacticFeatureStructto an unfrozen clone after the owningWordwas already frozen; harmless until the new state key became the first code to callGetFrozenHashCode()on it. Fixed defensively (Freeze()is idempotent).RootAllomorphTrie) replaces a linear scan with a lookup.3. Corpus-batch scheduling (
hc batch --parallel[=N]). Naive range-partitioning clusters the corpus's catastrophically slow words together in whichever partition draws them — measured at 2.9x the theoretical packing-bound wall-clock. Load-balances by sorting words longest-surface-first through a work-stealing partitioner, buffering results by original index so output ordering is unaffected. Closes the slack to 1.36x the packing bound; combined with memoization, a 313-word reference batch drops from 1,051s sequential to 74.4s at 16-way parallelism.4. Memory bound. Server GC under sustained heavy-word concurrency can defer collecting transient search garbage well past what the retained memo tables account for (confirmed a few tens of MB/word, not the cause).
Morpher's doc now states theDOTNET_GCHeapHardLimit/GCHeapHardLimitPercentrequirement for Server-GC batch hosts.Verification
Every change is gated on byte-identical parse output against the full Indonesian corpus (121 words) and a fixed Sena reference set (300 corpus words + 13 measured-worst words), sequential vs. parallel, before vs. after each change. 68/68 HermitCrab unit tests and the full SIL.Machine suite (828/831, 3 pre-existing skips) pass throughout, including new tests added here for single-threaded/parallel-cascade equivalence and lexical-gating parity.
What was tried and deliberately not shipped
A synthesis-side length pre-filter was checked and ruled out (too early to intercept the actual rejection). A lexical-gate optimization for pruning rootless branches shipped flag-gated default-off (measured inert on both reference corpora). A leaf-rule battery prefilter and a deferred-materialization redesign for the memo's replay path were measured below a 10% aggregate-corpus threshold and not implemented. A thread-static pooling optimization for scratch collections was implemented, measured, and reverted — it cut allocation 15–17% but regressed wall-clock 8–15% (small short-lived collections are cheaper to Gen0-allocate than to pool). Each is a measured negative result, not an oversight.
Test plan
dotnet build— full solution, 0 errorsdotnet test— HermitCrab suite 68/68, SIL.Machine suite 828/831 (3 pre-existing skips)--parallel, before/after every changeThis change is