Skip to content

HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory#451

Draft
johnml1135 wants to merge 1 commit into
hc-rustifyfrom
parse-optimization
Draft

HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory#451
johnml1135 wants to merge 1 commit into
hc-rustifyfrom
parse-optimization

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #446 (hc-rustify) — targets that branch, not master, since it hasn't merged yet.

The problem

HermitCrab's analysis-cascade search for "unordered" morphological strata explores every ordering of applicable rules (a combinatorial k! walk over a 2^k subset lattice of morpheme-application states), because rule application is not commutative in general and the engine doesn't know in advance which orderings will matter. On real grammars this reaches the same state key (shape + rule multiset + syntactic feature structure — everything that determines the rest of analysis, independent of arrival order) thousands of times: dissecting Sena's worst words showed 158,227 node expansions against only ~2,546 distinct states, with a single state re-visited over 7,000 times. Because each visit re-ran a full FST pattern match and downstream template battery, ~98% of the work was provably repeated computation.

What changed

1. Order-invariant memoization (AnalysisStateKey, MemoizedCombinationRuleCascade, AnalysisScope). A new state key strips arrival order from the search key while keeping everything that actually determines the analysis outcome. Two memo tables cache, per parse: subtrees that provably yield nothing (nogood case) and subtrees that yield results (a separate table for the per-stratum template battery). A second arrival at a known state replays the stored result (Word.ReplayOnto) — cloning and re-grafting only the arrival's own rule/non-head trail prefix onto the stored subtree's suffix — instead of re-searching. An in-flight re-entrancy guard falls back to a plain unmemoized expansion for the rare self-loop case. The template battery turned out to be the dominant cost (93% of wall time on the worst words) — memoizing it is most of the win.

2. Supporting fixes surfaced by the memo work:

  • FeatureStruct.Freeze() gets the copy-on-write hash shortcut Shape.Freeze() already had.
  • Word.CloneShareFrozenShape(): clone sites that provably never mutate the cloned shape share the frozen source's Shape instance instead of deep-copying — cuts per-word allocation 4.5–7.6%.
  • A latent bug fix: a rule reassigned SyntacticFeatureStruct to an unfrozen clone after the owning Word was already frozen; harmless until the new state key became the first code to call GetFrozenHashCode() on it. Fixed defensively (Freeze() is idempotent).
  • Synthesis-side rule indexing (RootAllomorphTrie) replaces a linear scan with a lookup.

3. Corpus-batch scheduling (hc batch --parallel[=N]). Naive range-partitioning clusters the corpus's catastrophically slow words together in whichever partition draws them — measured at 2.9x the theoretical packing-bound wall-clock. Load-balances by sorting words longest-surface-first through a work-stealing partitioner, buffering results by original index so output ordering is unaffected. Closes the slack to 1.36x the packing bound; combined with memoization, a 313-word reference batch drops from 1,051s sequential to 74.4s at 16-way parallelism.

4. Memory bound. Server GC under sustained heavy-word concurrency can defer collecting transient search garbage well past what the retained memo tables account for (confirmed a few tens of MB/word, not the cause). Morpher's doc now states the DOTNET_GCHeapHardLimit/GCHeapHardLimitPercent requirement for Server-GC batch hosts.

Verification

Every change is gated on byte-identical parse output against the full Indonesian corpus (121 words) and a fixed Sena reference set (300 corpus words + 13 measured-worst words), sequential vs. parallel, before vs. after each change. 68/68 HermitCrab unit tests and the full SIL.Machine suite (828/831, 3 pre-existing skips) pass throughout, including new tests added here for single-threaded/parallel-cascade equivalence and lexical-gating parity.

What was tried and deliberately not shipped

A synthesis-side length pre-filter was checked and ruled out (too early to intercept the actual rejection). A lexical-gate optimization for pruning rootless branches shipped flag-gated default-off (measured inert on both reference corpora). A leaf-rule battery prefilter and a deferred-materialization redesign for the memo's replay path were measured below a 10% aggregate-corpus threshold and not implemented. A thread-static pooling optimization for scratch collections was implemented, measured, and reverted — it cut allocation 15–17% but regressed wall-clock 8–15% (small short-lived collections are cheaper to Gen0-allocate than to pool). Each is a measured negative result, not an oversight.

Test plan

  • dotnet build — full solution, 0 errors
  • dotnet test — HermitCrab suite 68/68, SIL.Machine suite 828/831 (3 pre-existing skips)
  • Byte-identity: Indonesian corpus (121/121) and Sena reference set, sequential vs. --parallel, before/after every change

This change is Reviewable

…s scheduling, bound memory

Stacked on hc-rustify (#446). Cuts single-word analysis time ~5x, corpus batch wall-clock
~14x with parallelism, and closes a Server-GC memory blowup, without changing any parse
result (verified byte-identical on the Indonesian and Sena reference corpora throughout).

## The problem

HermitCrab's analysis-cascade search for "unordered" morphological strata explores every
ordering of applicable rules (a combinatorial k! walk over a 2^k subset lattice of
morpheme-application states), because rule application is not commutative in general and the
engine doesn't know in advance which orderings will matter. On real grammars this reaches the
same state key (shape + rule multiset + syntactic feature structure -- everything that
determines the rest of analysis, independent of arrival order) thousands of times: dissecting
Sena's worst words showed 158,227 node expansions against only ~2,546 distinct states, with a
single state re-visited over 7,000 times. Because each visit re-ran a full FST pattern match
and downstream template battery, ~98% of the work was provably repeated computation.

## What changed

**1. Order-invariant memoization (`AnalysisStateKey`, `MemoizedCombinationRuleCascade`,
`AnalysisScope`).** A new state key strips arrival order from the search key while keeping
everything that actually determines the analysis outcome. Two memo tables cache, per parse:
subtrees that provably yield nothing (`Memo`, nogood case) and subtrees that yield results
(`TemplateMemo` for the per-stratum template battery). A second arrival at a known state
replays the stored result (`Word.ReplayOnto`) -- cloning and re-grafting only the arrival's own
rule/non-head trail prefix onto the stored subtree's suffix -- instead of re-searching. An
in-flight re-entrancy guard (`AnalysisScope.InProgress`) falls back to a plain unmemoized
expansion for the rare self-loop case rather than reading a partial entry. The template battery
turned out to be the dominant cost (93% of wall time on the worst words, not the mrule cascade
the initial step-count analysis pointed at) -- memoizing it is most of the win.

**2. Supporting correctness/performance fixes surfaced by the memo work:**
- `FeatureStruct.Freeze()` gets the same copy-on-write hash shortcut `Shape.Freeze()` already
  had: a frozen source's unmutated clone adopts the cached hash instead of re-walking.
- `Word.CloneShareFrozenShape()`: clone sites that provably never mutate the cloned shape
  (`ReplayOnto`, the template battery's per-candidate clone) share the frozen source's `Shape`
  instance instead of deep-copying it -- cuts per-word allocation 4.5-7.6%.
- A latent bug fix: `AnalysisAffixTemplateRule.Apply` reassigned `SyntacticFeatureStruct` to an
  unfrozen clone after the owning `Word` was already frozen; harmless until the new state key
  became the first code to call `GetFrozenHashCode()` on it. Fixed defensively in
  `AnalysisStateKey`'s constructor (`Freeze()` is idempotent).
- Synthesis-side rule indexing (`RootAllomorphTrie`) cuts synthesis rule-selection from a
  linear scan to a lookup.

**3. Corpus-batch scheduling (`hc batch --parallel[=N]`, `BatchCommand`).** The obvious
per-word parallelism was leaving most of its throughput on the table: naive range-partitioning
clusters the corpus's few catastrophically slow words together in whichever partition draws
them, so one thread finishes early while another is still grinding through the tail --
measured at 2.9x the theoretical packing-bound wall-clock. Load-balances by sorting words
longest-surface-first and feeding them through a work-stealing partitioner
(`Partitioner.Create(..., loadBalance: true)`), buffering each result by original index so
output ordering is unaffected by completion order. Closes the slack to 1.36x the packing
bound; combined with the memoization work, the reference 313-word Sena batch drops from 1,051s
sequential to 74.4s at 16-way parallelism.

**4. Memory bound.** Server GC under sustained heavy-word concurrency defers collecting
transient search garbage for throughput, which can spike host memory well past what the
retained memo tables account for (the tables themselves are a few tens of MB per word --
confirmed by direct measurement, not the cause). `Morpher`'s class doc now states the
`DOTNET_GCHeapHardLimit`/`GCHeapHardLimitPercent` requirement for Server-GC batch hosts.

## Verification

Every change is gated on byte-identical parse output against the full Indonesian corpus (121
words) and a fixed Sena reference set (300 corpus words + the 13 measured-worst words),
sequential vs. parallel, before vs. after each change -- diffed on sorted per-word signatures
so ordering changes never mask a real mismatch. 68/68 HermitCrab unit tests and the full
SIL.Machine suite (828/831, 3 pre-existing skips) pass throughout, including new tests added
here: single-threaded/parallel-cascade equivalence under compounding and affix templates, and
lexical-gating parity.

## What was tried and deliberately not shipped

In the interest of shipping only what measurement supports: a synthesis-side length
pre-filter was checked and ruled out (the rejection it targets happens too early to
intercept); a lexical-gate optimization for pruning rootless branches shipped flag-gated
default-off (measured inert on both reference corpora, real grammars don't trigger it
often enough yet); a leaf-rule battery prefilter and a deferred-materialization redesign for
the memo's replay path were both measured below a 10% aggregate-corpus threshold and not
implemented; a thread-static pooling optimization for FeatureStruct's per-comparison scratch
collections was implemented, measured, and reverted -- it reduced allocation 15-17% but
regressed wall-clock 8-15% (small short-lived collections are cheaper to Gen0-allocate than
to pool). Each of these is a real, measured negative result, not an oversight.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant