Skip to content

RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST)#446

Draft
johnml1135 wants to merge 1 commit into
masterfrom
hc-rustify
Draft

RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST)#446
johnml1135 wants to merge 1 commit into
masterfrom
hc-rustify

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

What this is

Allocation- and CPU-optimized HermitCrab morphological parsing so bulk "Parse All Words" scales further. Parse results are unchanged (byte-identical to origin/master) — guarded by the existing HermitCrab regression suite plus new unit tests, and by a per-word analysis-signature diff against master on two real grammars (Sena, Indonesian). Rebased onto latest master; single commit, 109 files. No unrelated work, no profiling scaffolding, no dev-journal docs — everything is documented here.

Performance vs origin/master

Measured with a throwaway benchmark harness (not part of this PR — see "Pared back" below), 800-word runs, Server GC, MaxUnapplications=5 (the default), across dop = 1/2/4/6/8/12/16 (outer Parallel.ForEach across words; this branch's Morpher also supports MaxDegreeOfParallelism=1 to parallelize once across words and run each word's own cascade single-threaded — the FieldWorks "parallelize across words myself" pattern used here). Master has no equivalent per-word parallelism switch, so master's numbers reflect its own default (always-parallel-per-word) behavior at each outer dop.

Sena (real production grammar, 800 words):

dop branch words/sec master words/sec speedup branch MB master MB less memory
1 89 57 1.56x 11,979 22,689 47.2%
2 201 147 1.37x 11,966 22,648 47.2%
4 324 210 1.54x 11,966 22,673 47.2%
6 417 257 1.62x 11,966 22,673 47.2%
8 460 269 1.71x 11,966 22,653 47.2%
12 517 307 1.68x 11,966 22,636 47.1%
16 547 334 1.64x 11,966 22,542 46.9%

Indonesian (real production grammar, 121 distinct words tiled to 800 for a stable dop=16 reading):

dop branch words/sec master words/sec speedup branch MB master MB less memory
1 158 124 1.27x 5,284 14,748 64.2%
2 432 254 1.70x 5,280 14,773 64.3%
4 689 290 2.38x 5,280 14,760 64.2%
6 771 324 2.38x 5,280 14,740 64.2%
8 846 322 2.63x 5,280 14,738 64.2%
12 931 337 2.76x 5,280 14,724 64.1%
16 778 343 2.27x 5,280 14,770 64.2%

Summary: 1.3x–2.8x faster and 47–64% less memory across both grammars and every thread count tested, with zero change to parse/generate output. (The Indonesian dop=16 figure dips versus dop=12 — noise from thread-scheduling overhead dominating on this grammar's lighter per-word cost at high fan-out, not a regression; every other dop point scales monotonically.)

Two internal CPU-profiling rounds (via dotnet-trace, not included in this PR) additionally cut CPU time ~39.5% on a profiled dop=1 workload by fixing two costs invisible to allocation-byte counting: Array.CreateInstanceMDArray (a true multidimensional Register<TOffset>[,] flattened to Register<TOffset>[]) and the CLR's identity-hash fallback (RuntimeHelpers.GetHashCode) on several hot dictionary-key types (State, Feature/IDBearerBase, FeatureValue, ShapeNode) — fixed with cheap, correctness-preserving GetHashCode() overrides (Equals() left untouched everywhere, so identity semantics are unchanged).

Correctness

  • 827 SIL.Machine + 63 HermitCrab tests pass; CI green (ubuntu + windows).
  • Per-word analysis-signature diff vs origin/master: 0 differences on both Sena (60-word uncapped MaxUnapplications=0 subset — the combinatorially expensive setting) and the full Indonesian word list (uncapped).
  • The two highest-risk defects from the original design have targeted, fix-sensitive regression tests: the 64-symbol ulong-mask boundary (fails without the fix) and COW source-corruption (asserts the frozen source is byte-for-byte unchanged after a clone mutation). Plus a 50×250 concurrent-determinism stress test.

Review pass on the squashed diff

Before finalizing, ran a full correctness/testability/threading review over the whole squashed diff (109 files) rather than trusting the incremental per-commit gates alone. Found and fixed three real bugs that had not surfaced in any prior gate (the test grammars happen not to trigger them):

  1. CombinationRuleCascade's memoization was unsound. A "skip already-expanded words" optimization deduped on Word.ValueEquals, which deliberately excludes per-rule application/unapplication counts — but those counts are exactly what MaxApplicationCount gating reads. Two DFS paths through the combinatorial rule search could converge on value-equal words with different remaining rule "headroom," and the memo would silently drop the second path's still-reachable results. Worse, since ParallelCombinationRuleCascade (used whenever MaxDegreeOfParallelism != 1) had no equivalent memo, output for MorphologicalRuleOrder.Unordered grammars with MaxApplicationCount > 1 could depend on a purely-performance threading knob. Reverted to master's unconditional exploration — the safest fix given the subtlety, consistent with this branch's "only land measured wins" discipline.
  2. FeatureStruct.EnsureFlat had an unsynchronized lazy-init race. Its bit-packed unify-cache fields were plain (non-volatile) writes, but frozen FeatureStructs are read concurrently from every parallel FST traversal thread — a classic "reader observes the ready-flag before the array write is visible" hazard. Fixed with Volatile.Read/Volatile.Write safe publication (no lock needed; redundant concurrent computation is harmless since it's deterministic).
  3. A frozen ShapeNode's frozen flag was lost on detach. Shape.Remove()/Clear() reset a node's frozen bit to false on removal, silently violating "once frozen, always frozen" (master's original ShapeNode.IsFrozen was a permanent bool). Currently unreachable in production code paths (nothing removes an individually-frozen node from a still-unfrozen shape today) but had no test coverage and no guard against a future caller hitting it. Fixed so a detached node carries its frozen state forward.

Also fixed, as pure efficiency cleanup surfaced by the same review (no behavior change, confirmed by the signature diff above):

  • Shape.ValueEquals (the equality CombinationRuleCascade's dedup and every rule-cascade HashSet<Word> use) forced a full copy-on-write inflate on every comparison — defeating the COW optimization's whole point on a hot dedup path. Switched to comparing via the already-lazy int-offset projection.
  • Three sites in the phonological rule hot path (RewriteRuleSpec.MatchSubrule, SynthesisMetathesisRuleSpec.ApplyRhs, IterativePhonologicalPatternRule.Apply) re-resolved the same ShapeNode via a dictionary lookup more than once per iteration; hoisted/deduplicated.
  • TraversalMethodBase's two ExecuteCommands overloads (kept separate deliberately, to avoid boxing a List<T>.Enumerator on the hot arc-advance path) had duplicated register-copy bodies; factored into a shared private method.
  • A stale .gitignore comment pointed at a benchmark harness not present in this repo.

Known, accepted limitation (not fixed, flagged for awareness): SymbolicFeature.FlatIndex is assigned from a single process-wide counter, never scoped per FeatureSystem or reclaimed. In a process that loads/reloads many grammars over a long lifetime (e.g. a long-running FieldWorks session), later-loaded grammars' flat-unify arrays can grow larger than their own feature count warrants, partially eroding this exact optimization over time. Rescoping it safely would mean assigning indices during FeatureSystem.Freeze(), which conflicts with the deliberate "features can be used before their owning system freezes" design this branch relies on elsewhere — judged not worth the risk for this PR. Also noted and judged not a regression: two Metathesis rule specs resolve pattern group captures without a .Success check before use, which can throw if a metathesis group sits inside an optional/alternation pattern construct; master's original ShapeNode-based code had the identical gap (a NullReferenceException instead of a KeyNotFoundException today) — carried forward unchanged, not introduced by this branch.

Review by change group

Reviewers can approve group-by-group; each maps to a coherent slice of the diff.

# Group Files What & why
A Flat Shape/ShapeNode backing Annotations/Shape.cs, ShapeNode.cs, Annotation.cs, DataStructures/DataStructuresExtensions.cs Shape owns nodes in parallel int-linked backing arrays instead of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle. Foundation for array-copy clone + int FST offset.
B Copy-on-write Shape + BidirList flattening Annotations/Shape.cs (COW), AnnotationList.cs, DataStructures/BidirList.cs, BidirListNode.cs Cloning a frozen shape shares the source via the int projection, inflating lazily only on handle/mutation (cuts Word.Clone ~60%). Skip-list towers grow on demand + inline level 0.
C FeatureStruct ulong flat-unify FeatureModel/FeatureStruct.cs, SymbolicFeature.cs, SymbolicFeatureValue.cs, UlongSymbolicFeatureValueFlags.cs, FiniteState/Input.cs Frozen structs lazily cache a bit-packed ulong[] vector; Input.Matches takes a bitwise fast path for the simple/no-variable case, falling back to the original engine otherwise (→ byte-identical).
D int-offset FST traversal FiniteState/Fst.cs, TraversalMethodBase.cs, the 4 traversal method + 3 traversal instance files, ITraversalMethod.cs, VisitedStates.cs (new) FST binds as Fst<Word,int> (dense offset via the lazy AnnotationList<int> projection), with a reusable per-Transduce register scaffold and an inline-ulong VisitedStates epsilon-loop set.
E HermitCrab rule-spec adaptations ~64 files under Morphology.HermitCrab/ (MorphologicalRules/*, PhonologicalRules/*, Word.cs, HermitCrabExtensions.cs, loaders…) Flip every Pattern/Match/Matcher from ShapeNode to int offsets; resolve int → ShapeNode at rule-RHS navigation sites. Lands atomically (can't be green mid-flip).
F CPU-cost fixes State.cs, DataStructures/IDBearerBase.cs, FeatureValue.cs, ShapeNode.cs, DataStructures/BidirList.cs, FeatureModel/FeatureStruct.cs, StringFeatureValue.cs Cheap GetHashCode() overrides replacing the CLR identity-hash fallback (Equals() untouched everywhere); Register<TOffset>[,][]; thread-static Random instead of per-instance; ordinal instead of culture-aware string sorts. Found via a real dotnet-trace CPU profile, not guesswork.
G Single-thread option Morpher.cs, AnalysisStratumRule.cs, AnalysisAffixTemplateRule.cs, Rules/CombinationRuleCascade.cs, ParallelCombinationRuleCascade.cs MaxDegreeOfParallelism == 1 selects the serial cascade; the parallel cascade honors the DOP cap.
I Tests Annotations/AnnotationTests.cs (new), FeatureModel/FeatureStructTests.cs (new), FiniteState/FstTests.cs, and rule-test updates New COW/flat-structure + 64-symbol-mask regression + single-thread-vs-parallel equivalence + concurrent-determinism tests; existing rule tests migrated to the int-offset API.

⚠️ Breaking API change

The morphological-rule pattern type changed Pattern<Word, ShapeNode>Pattern<Word, int> (affects AffixProcessAllomorph.Lhs, AnalysisMorphologicalTransform, MorphologicalTransform.GenerateShape, Match<Word,ShapeNode>, and the public Language.CompileAnalysisRule/CompileSynthesisRule return types IRule<Word,ShapeNode>IRule<Word,int>). This breaks only code that constructs HC rules programmatically (FieldWorks' HCLoader.cs — a mechanical ShapeNode→int migration when it bumps the package). The Morpher/ParseWord/AnalyzeWord and XML grammar-load APIs are unchanged, so load-and-parse consumers are unaffected. Shape/ShapeNode also moved from deriving the concrete OrderedBidirList(Node) to implementing the pre-existing interface (all members preserved; nothing depends on the concrete base) — practically non-breaking. Suggest releasing as a new major version.

Approaches tried and discarded (with why)

Every idea below was implemented (or reasoned through) and measured against this branch's own "only land measured wins" discipline — full 827+63 test suites plus an uncapped byte-identical signature diff for anything touching hash/traversal order. Kept for the record so a future session doesn't re-attempt a dead end:

  • Pack Register<TOffset> into a single long. Unsafe.SizeOf<Register<int>>() is already 8 bytes — identical to a packed long. Zero allocation win available; any CPU win would need type-specializing a public generic API used elsewhere in SIL.Machine with non-int offsets. Not attempted.
  • FST output delta-log instead of per-branch Word.Clone. Confirmed dead code for HermitCrab's actual usage: Matcher.Compile() never passes an operations argument, so the FST-transducer code path this targets never executes during AnalyzeWord/ParseWord (HermitCrab always takes the FSA-only path). Not attempted.
  • TagMapCommand as a readonly struct. Implemented, verified byte-identical, then measured via a wall-clock micro-benchmark: 534.93ms vs 536.00ms baseline — noise-level, no measurable win (the mutation this targets only happens once per grammar load, not per word). Reverted.
  • Skip lazy/on-the-fly determinization. The hot matchers need every match (AllSubmatches), not just the first, so collapsing the frontier saves little — and reordering result production would reopen the MaxUnapplications-cap byte-identical question this PR depends on. Rejected by reasoning, not implemented.
  • Per-word arena/pool for traversal instances. Already tried and reverted in an earlier RUSTIFY phase (see the comment at Fst.cs): a pooled traversal method survives a Gen0 collection, promotes to Gen2, and the resulting stop-the-world Gen2 GC serializes parallel parsing. Not re-attempted.
  • Compile the frozen FST into CSR (compressed-sparse-row) form. Gated on a CPU profile showing arc-chase as a top cost. Two independent post-fix profiles show no arc-traversal frame anywhere near the top (the closest sit at 0.1–0.4% each, an order of magnitude below the real Add a couple of links to papers #1 cost). Formally closed as not-justified against this workload.
  • Memoize rule application across the analysis cascade (the highest-ceiling idea investigated: ~48% of intermediate Word objects during analysis are value-duplicates of an already-seen one). Implemented exactly as designed, gated correctly (build, tests, allocation counters all clean), but measured as a reproducible ~3.9% wall-clock regression rather than the predicted win — likely because the memo's frozen-input-only restriction misses most of the duplication the ~48% figure was measuring (RuleCascade.ApplyRule's input is often not yet frozen at the point of each call). Reverted. The prerequisite correctness fix this work surfaced (SyntacticFeatureStruct was being mutated after Word.Freeze() at 8 call sites, including one real word-aliasing bug where two loop iterations shared and mutated the same FeatureStruct instance) landed on its own merits, independent of the memoization outcome.

Pared back for this PR

Removed the profiling apparatus and scaffolding the optimization was developed with (preserved on a local branch, not this repo): allocation-instrumentation (MorpherStatistics/FstStatistics + hot-path call sites), dotnet-trace profiling harnesses and analysis scripts, the [Explicit] benchmarks used for this PR's own perf table above, and the RUSTIFY dev-journal docs (fable-improvements.md and similar) — everything they'd have said is now in this description instead. FeatureStruct.FlatUnifyEnabled stays internal (test-hook only). The out-of-process HermitCrab worker prototyped during this work moved to the FieldWorks repo.

🤖 Generated with Claude Code


This change is Reviewable

@codecov-commenter

codecov-commenter commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.97669% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.65%. Comparing base (7835067) to head (ea72cd7).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/SIL.Machine/Annotations/Shape.cs 78.59% 52 Missing and 12 partials ⚠️
src/SIL.Machine/FiniteState/VisitedStates.cs 19.35% 23 Missing and 2 partials ⚠️
...Machine/DataStructures/DataStructuresExtensions.cs 57.50% 16 Missing and 1 partial ⚠️
src/SIL.Machine/FeatureModel/FeatureStruct.cs 86.71% 16 Missing and 1 partial ⚠️
src/SIL.Machine/Annotations/ShapeNode.cs 62.50% 11 Missing and 4 partials ⚠️
.../FiniteState/NondeterministicFstTraversalMethod.cs 67.39% 15 Missing ⚠️
...hine.Morphology.HermitCrab/HermitCrabExtensions.cs 72.22% 8 Missing and 2 partials ⚠️
src/SIL.Machine/FiniteState/Fst.cs 83.92% 8 Missing and 1 partial ⚠️
src/SIL.Machine/FiniteState/Input.cs 73.07% 6 Missing and 1 partial ⚠️
.../FiniteState/NondeterministicFsaTraversalMethod.cs 82.14% 4 Missing and 1 partial ⚠️
... and 16 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #446      +/-   ##
==========================================
+ Coverage   73.30%   73.65%   +0.35%     
==========================================
  Files         443      444       +1     
  Lines       37203    37929     +726     
  Branches     5110     5253     +143     
==========================================
+ Hits        27272    27937     +665     
- Misses       8805     8856      +51     
- Partials     1126     1136      +10     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@johnml1135 johnml1135 force-pushed the hc-rustify branch 3 times, most recently from 6c91b71 to e910a55 Compare July 1, 2026 18:41
@johnml1135

Copy link
Copy Markdown
Collaborator Author

Added tests to cover the largest patch-coverage gaps in the rewritten FST code (no coverage regression):

  • FstTests.TransduceNondeterministic_MatchesDeterminized — transduces a raw (non-determinized) FST directly and asserts it accepts the same outputs as its Determinize()d form. This exercises NondeterministicFstTraversalMethod (was 0% → ~71%) and NondeterministicFstTraversalInstance (0% → ~86%), which no existing test hit because every other transduce path determinizes first.
  • AnnotationTests.LargeShape_GrowsBackingArrays_PreservesLinkAndCloneIntegrity — builds a 50-node shape (crossing several backing-array doublings), then asserts forward/backward link integrity, NodeAt/OffsetOf handle round-trips, mid-list insert/remove, and clone value-equality.

Project coverage was already non-regressing (+0.07% vs base); these target the new/rewritten lines specifically. Note: raw nondeterministic transduce of an epsilon-input FST diverges from the determinized result (production always determinizes first), so that path is intentionally left out of the equivalence oracle.

@johnml1135 johnml1135 changed the title RUSTIFY: allocation-optimized HermitCrab parsing (flat/COW Shape + FST) RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST) Jul 2, 2026
Squashed history of the RUSTIFY performance branch. Full rationale,
methodology, benchmark data, and rejected-approach audit trail live in the
PR description; see PR #446 for the complete writeup.

Summary of what changed:
- Flat/COW Shape and ShapeNode backing (parallel int-linked arrays instead
  of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle).
- Copy-on-write Shape cloning: a clone of a frozen shape shares the
  source's backing until first mutation.
- FeatureStruct bit-packed ulong flat-unify fast path for the common
  simple/no-variable case, falling back to the original engine otherwise.
- int-offset FST traversal (Fst<Word,int>, flattened Register<TOffset>[,]
  to a 1-D array, reusable per-Transduce register scaffold).
- Every HermitCrab rule-spec file migrated from ShapeNode-offset to
  int-offset pattern matching.
- Configurable MaxDegreeOfParallelism (1 = fully single-threaded, for
  callers that parallelize across words themselves).
- Cheap GetHashCode overrides (State, IDBearerBase, FeatureValue,
  ShapeNode) replacing the CLR's identity-hash fallback on several hot
  dictionary/hashset key paths; StringComparer.Ordinal on hot sorts that
  were paying culture-aware comparison; a shared per-thread Random instead
  of one per BidirList instance; a filtered-annotation-view cache on
  frozen AnnotationLists.
- SyntacticFeatureStruct mutate-after-freeze correctness hardening
  (8 sites converted to clone-then-reassign; caught a real aliasing bug
  in the synthesis PriorityUnion path along the way).
- Post-squash review pass fixed a memoization soundness bug in
  CombinationRuleCascade (reverted to master's unconditional exploration),
  an unsynchronized lazy-init race in FeatureStruct.EnsureFlat (fixed with
  Volatile.Read/Write), and a frozen-flag-lost-on-detach bug in ShapeNode,
  plus several efficiency/cleanup fixes.

Parse results are unchanged (byte-identical to master, verified via the
full HermitCrab regression suite plus per-word signature diffs against
master on Sena and Indonesian).

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@johnml1135

Copy link
Copy Markdown
Collaborator Author

Let's hold off until I try a few more algorithm changes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants