RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST) by johnml1135 · Pull Request #446 · sillsdev/machine

johnml1135 · 2026-07-01T13:19:11Z

What this is

Allocation- and CPU-optimized HermitCrab morphological parsing so bulk "Parse All Words" scales further. Parse results are unchanged (byte-identical to origin/master) — guarded by the existing HermitCrab regression suite plus new unit tests, and by a per-word analysis-signature diff against master on two real grammars (Sena, Indonesian). Rebased onto latest master; single commit, 109 files. No unrelated work, no profiling scaffolding, no dev-journal docs — everything is documented here.

Performance vs `origin/master`

Measured with a throwaway benchmark harness (not part of this PR — see "Pared back" below), 800-word runs, Server GC, MaxUnapplications=5 (the default), across dop = 1/2/4/6/8/12/16 (outer Parallel.ForEach across words; this branch's Morpher also supports MaxDegreeOfParallelism=1 to parallelize once across words and run each word's own cascade single-threaded — the FieldWorks "parallelize across words myself" pattern used here). Master has no equivalent per-word parallelism switch, so master's numbers reflect its own default (always-parallel-per-word) behavior at each outer dop.

Sena (real production grammar, 800 words):

dop	branch words/sec	master words/sec	speedup	branch MB	master MB	less memory
1	89	57	1.56x	11,979	22,689	47.2%
2	201	147	1.37x	11,966	22,648	47.2%
4	324	210	1.54x	11,966	22,673	47.2%
6	417	257	1.62x	11,966	22,673	47.2%
8	460	269	1.71x	11,966	22,653	47.2%
12	517	307	1.68x	11,966	22,636	47.1%
16	547	334	1.64x	11,966	22,542	46.9%

Indonesian (real production grammar, 121 distinct words tiled to 800 for a stable dop=16 reading):

dop	branch words/sec	master words/sec	speedup	branch MB	master MB	less memory
1	158	124	1.27x	5,284	14,748	64.2%
2	432	254	1.70x	5,280	14,773	64.3%
4	689	290	2.38x	5,280	14,760	64.2%
6	771	324	2.38x	5,280	14,740	64.2%
8	846	322	2.63x	5,280	14,738	64.2%
12	931	337	2.76x	5,280	14,724	64.1%
16	778	343	2.27x	5,280	14,770	64.2%

Summary: 1.3x–2.8x faster and 47–64% less memory across both grammars and every thread count tested, with zero change to parse/generate output. (The Indonesian dop=16 figure dips versus dop=12 — noise from thread-scheduling overhead dominating on this grammar's lighter per-word cost at high fan-out, not a regression; every other dop point scales monotonically.)

Two internal CPU-profiling rounds (via dotnet-trace, not included in this PR) additionally cut CPU time ~39.5% on a profiled dop=1 workload by fixing two costs invisible to allocation-byte counting: Array.CreateInstanceMDArray (a true multidimensional Register<TOffset>[,] flattened to Register<TOffset>[]) and the CLR's identity-hash fallback (RuntimeHelpers.GetHashCode) on several hot dictionary-key types (State, Feature/IDBearerBase, FeatureValue, ShapeNode) — fixed with cheap, correctness-preserving GetHashCode() overrides (Equals() left untouched everywhere, so identity semantics are unchanged).

Correctness

827 SIL.Machine + 63 HermitCrab tests pass; CI green (ubuntu + windows).
Per-word analysis-signature diff vs origin/master: 0 differences on both Sena (60-word uncapped MaxUnapplications=0 subset — the combinatorially expensive setting) and the full Indonesian word list (uncapped).
The two highest-risk defects from the original design have targeted, fix-sensitive regression tests: the 64-symbol ulong-mask boundary (fails without the fix) and COW source-corruption (asserts the frozen source is byte-for-byte unchanged after a clone mutation). Plus a 50×250 concurrent-determinism stress test.

Review pass on the squashed diff

Before finalizing, ran a full correctness/testability/threading review over the whole squashed diff (109 files) rather than trusting the incremental per-commit gates alone. Found and fixed three real bugs that had not surfaced in any prior gate (the test grammars happen not to trigger them):

CombinationRuleCascade's memoization was unsound. A "skip already-expanded words" optimization deduped on Word.ValueEquals, which deliberately excludes per-rule application/unapplication counts — but those counts are exactly what MaxApplicationCount gating reads. Two DFS paths through the combinatorial rule search could converge on value-equal words with different remaining rule "headroom," and the memo would silently drop the second path's still-reachable results. Worse, since ParallelCombinationRuleCascade (used whenever MaxDegreeOfParallelism != 1) had no equivalent memo, output for MorphologicalRuleOrder.Unordered grammars with MaxApplicationCount > 1 could depend on a purely-performance threading knob. Reverted to master's unconditional exploration — the safest fix given the subtlety, consistent with this branch's "only land measured wins" discipline.
FeatureStruct.EnsureFlat had an unsynchronized lazy-init race. Its bit-packed unify-cache fields were plain (non-volatile) writes, but frozen FeatureStructs are read concurrently from every parallel FST traversal thread — a classic "reader observes the ready-flag before the array write is visible" hazard. Fixed with Volatile.Read/Volatile.Write safe publication (no lock needed; redundant concurrent computation is harmless since it's deterministic).
A frozen ShapeNode's frozen flag was lost on detach. Shape.Remove()/Clear() reset a node's frozen bit to false on removal, silently violating "once frozen, always frozen" (master's original ShapeNode.IsFrozen was a permanent bool). Currently unreachable in production code paths (nothing removes an individually-frozen node from a still-unfrozen shape today) but had no test coverage and no guard against a future caller hitting it. Fixed so a detached node carries its frozen state forward.

Also fixed, as pure efficiency cleanup surfaced by the same review (no behavior change, confirmed by the signature diff above):

Shape.ValueEquals (the equality CombinationRuleCascade's dedup and every rule-cascade HashSet<Word> use) forced a full copy-on-write inflate on every comparison — defeating the COW optimization's whole point on a hot dedup path. Switched to comparing via the already-lazy int-offset projection.
Three sites in the phonological rule hot path (RewriteRuleSpec.MatchSubrule, SynthesisMetathesisRuleSpec.ApplyRhs, IterativePhonologicalPatternRule.Apply) re-resolved the same ShapeNode via a dictionary lookup more than once per iteration; hoisted/deduplicated.
TraversalMethodBase's two ExecuteCommands overloads (kept separate deliberately, to avoid boxing a List<T>.Enumerator on the hot arc-advance path) had duplicated register-copy bodies; factored into a shared private method.
A stale .gitignore comment pointed at a benchmark harness not present in this repo.

Known, accepted limitation (not fixed, flagged for awareness): SymbolicFeature.FlatIndex is assigned from a single process-wide counter, never scoped per FeatureSystem or reclaimed. In a process that loads/reloads many grammars over a long lifetime (e.g. a long-running FieldWorks session), later-loaded grammars' flat-unify arrays can grow larger than their own feature count warrants, partially eroding this exact optimization over time. Rescoping it safely would mean assigning indices during FeatureSystem.Freeze(), which conflicts with the deliberate "features can be used before their owning system freezes" design this branch relies on elsewhere — judged not worth the risk for this PR. Also noted and judged not a regression: two Metathesis rule specs resolve pattern group captures without a .Success check before use, which can throw if a metathesis group sits inside an optional/alternation pattern construct; master's original ShapeNode-based code had the identical gap (a NullReferenceException instead of a KeyNotFoundException today) — carried forward unchanged, not introduced by this branch.

Review by change group

Reviewers can approve group-by-group; each maps to a coherent slice of the diff.

#	Group	Files	What & why
A	Flat Shape/ShapeNode backing	`Annotations/Shape.cs`, `ShapeNode.cs`, `Annotation.cs`, `DataStructures/DataStructuresExtensions.cs`	`Shape` owns nodes in parallel int-linked backing arrays instead of an `OrderedBidirList`; `ShapeNode` becomes an `(Owner, Index)` handle. Foundation for array-copy clone + int FST offset.
B	Copy-on-write Shape + BidirList flattening	`Annotations/Shape.cs` (COW), `AnnotationList.cs`, `DataStructures/BidirList.cs`, `BidirListNode.cs`	Cloning a frozen shape shares the source via the int projection, inflating lazily only on handle/mutation (cuts `Word.Clone` ~60%). Skip-list towers grow on demand + inline level 0.
C	FeatureStruct ulong flat-unify	`FeatureModel/FeatureStruct.cs`, `SymbolicFeature.cs`, `SymbolicFeatureValue.cs`, `UlongSymbolicFeatureValueFlags.cs`, `FiniteState/Input.cs`	Frozen structs lazily cache a bit-packed `ulong[]` vector; `Input.Matches` takes a bitwise fast path for the simple/no-variable case, falling back to the original engine otherwise (→ byte-identical).
D	int-offset FST traversal	`FiniteState/Fst.cs`, `TraversalMethodBase.cs`, the 4 traversal method + 3 traversal instance files, `ITraversalMethod.cs`, `VisitedStates.cs` (new)	FST binds as `Fst<Word,int>` (dense offset via the lazy `AnnotationList<int>` projection), with a reusable per-`Transduce` register scaffold and an inline-`ulong` `VisitedStates` epsilon-loop set.
E	HermitCrab rule-spec adaptations	~64 files under `Morphology.HermitCrab/` (`MorphologicalRules/`, `PhonologicalRules/`, `Word.cs`, `HermitCrabExtensions.cs`, loaders…)	Flip every `Pattern`/`Match`/`Matcher` from `ShapeNode` to `int` offsets; resolve `int → ShapeNode` at rule-RHS navigation sites. Lands atomically (can't be green mid-flip).
F	CPU-cost fixes	`State.cs`, `DataStructures/IDBearerBase.cs`, `FeatureValue.cs`, `ShapeNode.cs`, `DataStructures/BidirList.cs`, `FeatureModel/FeatureStruct.cs`, `StringFeatureValue.cs`	Cheap `GetHashCode()` overrides replacing the CLR identity-hash fallback (`Equals()` untouched everywhere); `Register<TOffset>[,]` → `[]`; thread-static `Random` instead of per-instance; ordinal instead of culture-aware string sorts. Found via a real `dotnet-trace` CPU profile, not guesswork.
G	Single-thread option	`Morpher.cs`, `AnalysisStratumRule.cs`, `AnalysisAffixTemplateRule.cs`, `Rules/CombinationRuleCascade.cs`, `ParallelCombinationRuleCascade.cs`	`MaxDegreeOfParallelism == 1` selects the serial cascade; the parallel cascade honors the DOP cap.
I	Tests	`Annotations/AnnotationTests.cs` (new), `FeatureModel/FeatureStructTests.cs` (new), `FiniteState/FstTests.cs`, and rule-test updates	New COW/flat-structure + 64-symbol-mask regression + single-thread-vs-parallel equivalence + concurrent-determinism tests; existing rule tests migrated to the int-offset API.

⚠️ Breaking API change

The morphological-rule pattern type changed Pattern<Word, ShapeNode> → Pattern<Word, int> (affects AffixProcessAllomorph.Lhs, AnalysisMorphologicalTransform, MorphologicalTransform.GenerateShape, Match<Word,ShapeNode>, and the public Language.CompileAnalysisRule/CompileSynthesisRule return types IRule<Word,ShapeNode> → IRule<Word,int>). This breaks only code that constructs HC rules programmatically (FieldWorks' HCLoader.cs — a mechanical ShapeNode→int migration when it bumps the package). The Morpher/ParseWord/AnalyzeWord and XML grammar-load APIs are unchanged, so load-and-parse consumers are unaffected. Shape/ShapeNode also moved from deriving the concrete OrderedBidirList(Node) to implementing the pre-existing interface (all members preserved; nothing depends on the concrete base) — practically non-breaking. Suggest releasing as a new major version.

Approaches tried and discarded (with why)

Every idea below was implemented (or reasoned through) and measured against this branch's own "only land measured wins" discipline — full 827+63 test suites plus an uncapped byte-identical signature diff for anything touching hash/traversal order. Kept for the record so a future session doesn't re-attempt a dead end:

Pack Register<TOffset> into a single long. Unsafe.SizeOf<Register<int>>() is already 8 bytes — identical to a packed long. Zero allocation win available; any CPU win would need type-specializing a public generic API used elsewhere in SIL.Machine with non-int offsets. Not attempted.
FST output delta-log instead of per-branch Word.Clone. Confirmed dead code for HermitCrab's actual usage: Matcher.Compile() never passes an operations argument, so the FST-transducer code path this targets never executes during AnalyzeWord/ParseWord (HermitCrab always takes the FSA-only path). Not attempted.
TagMapCommand as a readonly struct. Implemented, verified byte-identical, then measured via a wall-clock micro-benchmark: 534.93ms vs 536.00ms baseline — noise-level, no measurable win (the mutation this targets only happens once per grammar load, not per word). Reverted.
Skip lazy/on-the-fly determinization. The hot matchers need every match (AllSubmatches), not just the first, so collapsing the frontier saves little — and reordering result production would reopen the MaxUnapplications-cap byte-identical question this PR depends on. Rejected by reasoning, not implemented.
Per-word arena/pool for traversal instances. Already tried and reverted in an earlier RUSTIFY phase (see the comment at Fst.cs): a pooled traversal method survives a Gen0 collection, promotes to Gen2, and the resulting stop-the-world Gen2 GC serializes parallel parsing. Not re-attempted.
Compile the frozen FST into CSR (compressed-sparse-row) form. Gated on a CPU profile showing arc-chase as a top cost. Two independent post-fix profiles show no arc-traversal frame anywhere near the top (the closest sit at 0.1–0.4% each, an order of magnitude below the real Add a couple of links to papers #1 cost). Formally closed as not-justified against this workload.
Memoize rule application across the analysis cascade (the highest-ceiling idea investigated: ~48% of intermediate Word objects during analysis are value-duplicates of an already-seen one). Implemented exactly as designed, gated correctly (build, tests, allocation counters all clean), but measured as a reproducible ~3.9% wall-clock regression rather than the predicted win — likely because the memo's frozen-input-only restriction misses most of the duplication the ~48% figure was measuring (RuleCascade.ApplyRule's input is often not yet frozen at the point of each call). Reverted. The prerequisite correctness fix this work surfaced (SyntacticFeatureStruct was being mutated after Word.Freeze() at 8 call sites, including one real word-aliasing bug where two loop iterations shared and mutated the same FeatureStruct instance) landed on its own merits, independent of the memoization outcome.

Pared back for this PR

Removed the profiling apparatus and scaffolding the optimization was developed with (preserved on a local branch, not this repo): allocation-instrumentation (MorpherStatistics/FstStatistics + hot-path call sites), dotnet-trace profiling harnesses and analysis scripts, the [Explicit] benchmarks used for this PR's own perf table above, and the RUSTIFY dev-journal docs (fable-improvements.md and similar) — everything they'd have said is now in this description instead. FeatureStruct.FlatUnifyEnabled stays internal (test-hook only). The out-of-process HermitCrab worker prototyped during this work moved to the FieldWorks repo.

🤖 Generated with Claude Code

This change is

codecov-commenter · 2026-07-01T13:21:43Z

Codecov Report

❌ Patch coverage is 83.97669% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.65%. Comparing base (7835067) to head (ea72cd7).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
src/SIL.Machine/Annotations/Shape.cs	78.59%	52 Missing and 12 partials ⚠️
src/SIL.Machine/FiniteState/VisitedStates.cs	19.35%	23 Missing and 2 partials ⚠️
...Machine/DataStructures/DataStructuresExtensions.cs	57.50%	16 Missing and 1 partial ⚠️
src/SIL.Machine/FeatureModel/FeatureStruct.cs	86.71%	16 Missing and 1 partial ⚠️
src/SIL.Machine/Annotations/ShapeNode.cs	62.50%	11 Missing and 4 partials ⚠️
.../FiniteState/NondeterministicFstTraversalMethod.cs	67.39%	15 Missing ⚠️
...hine.Morphology.HermitCrab/HermitCrabExtensions.cs	72.22%	8 Missing and 2 partials ⚠️
src/SIL.Machine/FiniteState/Fst.cs	83.92%	8 Missing and 1 partial ⚠️
src/SIL.Machine/FiniteState/Input.cs	73.07%	6 Missing and 1 partial ⚠️
.../FiniteState/NondeterministicFsaTraversalMethod.cs	82.14%	4 Missing and 1 partial ⚠️
... and 16 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #446      +/-   ##
==========================================
+ Coverage   73.30%   73.65%   +0.35%     
==========================================
  Files         443      444       +1     
  Lines       37203    37929     +726     
  Branches     5110     5253     +143     
==========================================
+ Hits        27272    27937     +665     
- Misses       8805     8856      +51     
- Partials     1126     1136      +10

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

johnml1135 · 2026-07-01T18:42:44Z

Added tests to cover the largest patch-coverage gaps in the rewritten FST code (no coverage regression):

FstTests.TransduceNondeterministic_MatchesDeterminized — transduces a raw (non-determinized) FST directly and asserts it accepts the same outputs as its Determinize()d form. This exercises NondeterministicFstTraversalMethod (was 0% → ~71%) and NondeterministicFstTraversalInstance (0% → ~86%), which no existing test hit because every other transduce path determinizes first.
AnnotationTests.LargeShape_GrowsBackingArrays_PreservesLinkAndCloneIntegrity — builds a 50-node shape (crossing several backing-array doublings), then asserts forward/backward link integrity, NodeAt/OffsetOf handle round-trips, mid-list insert/remove, and clone value-equality.

Project coverage was already non-regressing (+0.07% vs base); these target the new/rewritten lines specifically. Note: raw nondeterministic transduce of an epsilon-input FST diverges from the determinized result (production always determinizes first), so that path is intentionally left out of the equivalence oracle.

Squashed history of the RUSTIFY performance branch. Full rationale, methodology, benchmark data, and rejected-approach audit trail live in the PR description; see PR #446 for the complete writeup. Summary of what changed: - Flat/COW Shape and ShapeNode backing (parallel int-linked arrays instead of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle). - Copy-on-write Shape cloning: a clone of a frozen shape shares the source's backing until first mutation. - FeatureStruct bit-packed ulong flat-unify fast path for the common simple/no-variable case, falling back to the original engine otherwise. - int-offset FST traversal (Fst<Word,int>, flattened Register<TOffset>[,] to a 1-D array, reusable per-Transduce register scaffold). - Every HermitCrab rule-spec file migrated from ShapeNode-offset to int-offset pattern matching. - Configurable MaxDegreeOfParallelism (1 = fully single-threaded, for callers that parallelize across words themselves). - Cheap GetHashCode overrides (State, IDBearerBase, FeatureValue, ShapeNode) replacing the CLR's identity-hash fallback on several hot dictionary/hashset key paths; StringComparer.Ordinal on hot sorts that were paying culture-aware comparison; a shared per-thread Random instead of one per BidirList instance; a filtered-annotation-view cache on frozen AnnotationLists. - SyntacticFeatureStruct mutate-after-freeze correctness hardening (8 sites converted to clone-then-reassign; caught a real aliasing bug in the synthesis PriorityUnion path along the way). - Post-squash review pass fixed a memoization soundness bug in CombinationRuleCascade (reverted to master's unconditional exploration), an unsynchronized lazy-init race in FeatureStruct.EnsureFlat (fixed with Volatile.Read/Write), and a frozen-flag-lost-on-detach bug in ShapeNode, plus several efficiency/cleanup fixes. Parse results are unchanged (byte-identical to master, verified via the full HermitCrab regression suite plus per-word signature diffs against master on Sena and Indonesian). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

johnml1135 · 2026-07-02T21:12:08Z

Let's hold off until I try a few more algorithm changes...

johnml1135 force-pushed the hc-rustify branch 3 times, most recently from 6c91b71 to e910a55 Compare July 1, 2026 18:41

johnml1135 changed the title ~~RUSTIFY: allocation-optimized HermitCrab parsing (flat/COW Shape + FST)~~ RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST) Jul 2, 2026

johnml1135 force-pushed the hc-rustify branch from 12e88b4 to ea72cd7 Compare July 2, 2026 17:04

johnml1135 requested review from ddaspit and jtmaxwell3 July 2, 2026 17:58

johnml1135 mentioned this pull request Jul 2, 2026

Complexity cap: bound pathological HermitCrab parses (follow-up to rustify) #448

Draft

4 tasks

johnml1135 marked this pull request as draft July 2, 2026 21:11

johnml1135 mentioned this pull request Jul 3, 2026

HermitCrab parse optimization: memoize analysis redundancy, fix corpus scheduling, bound memory #451

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST)#446

RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST)#446
johnml1135 wants to merge 1 commit into
masterfrom
hc-rustify

johnml1135 commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jul 1, 2026 •

edited

Loading

Uh oh!

johnml1135 commented Jul 1, 2026

Uh oh!

johnml1135 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

johnml1135 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

Performance vs origin/master

Correctness

Review pass on the squashed diff

Review by change group

⚠️ Breaking API change

Approaches tried and discarded (with why)

Pared back for this PR

Uh oh!

codecov-commenter commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

johnml1135 commented Jul 1, 2026

Uh oh!

johnml1135 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnml1135 commented Jul 1, 2026 •

edited

Loading

Performance vs `origin/master`

codecov-commenter commented Jul 1, 2026 •

edited

Loading