RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST)#446
RUSTIFY: allocation- and CPU-optimized HermitCrab parsing (flat/COW Shape + FST)#446johnml1135 wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #446 +/- ##
==========================================
+ Coverage 73.30% 73.65% +0.35%
==========================================
Files 443 444 +1
Lines 37203 37929 +726
Branches 5110 5253 +143
==========================================
+ Hits 27272 27937 +665
- Misses 8805 8856 +51
- Partials 1126 1136 +10 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
6c91b71 to
e910a55
Compare
|
Added tests to cover the largest patch-coverage gaps in the rewritten FST code (no coverage regression):
Project coverage was already non-regressing (+0.07% vs base); these target the new/rewritten lines specifically. Note: raw nondeterministic transduce of an epsilon-input FST diverges from the determinized result (production always determinizes first), so that path is intentionally left out of the equivalence oracle. |
Squashed history of the RUSTIFY performance branch. Full rationale, methodology, benchmark data, and rejected-approach audit trail live in the PR description; see PR #446 for the complete writeup. Summary of what changed: - Flat/COW Shape and ShapeNode backing (parallel int-linked arrays instead of an OrderedBidirList; ShapeNode becomes an (Owner, Index) handle). - Copy-on-write Shape cloning: a clone of a frozen shape shares the source's backing until first mutation. - FeatureStruct bit-packed ulong flat-unify fast path for the common simple/no-variable case, falling back to the original engine otherwise. - int-offset FST traversal (Fst<Word,int>, flattened Register<TOffset>[,] to a 1-D array, reusable per-Transduce register scaffold). - Every HermitCrab rule-spec file migrated from ShapeNode-offset to int-offset pattern matching. - Configurable MaxDegreeOfParallelism (1 = fully single-threaded, for callers that parallelize across words themselves). - Cheap GetHashCode overrides (State, IDBearerBase, FeatureValue, ShapeNode) replacing the CLR's identity-hash fallback on several hot dictionary/hashset key paths; StringComparer.Ordinal on hot sorts that were paying culture-aware comparison; a shared per-thread Random instead of one per BidirList instance; a filtered-annotation-view cache on frozen AnnotationLists. - SyntacticFeatureStruct mutate-after-freeze correctness hardening (8 sites converted to clone-then-reassign; caught a real aliasing bug in the synthesis PriorityUnion path along the way). - Post-squash review pass fixed a memoization soundness bug in CombinationRuleCascade (reverted to master's unconditional exploration), an unsynchronized lazy-init race in FeatureStruct.EnsureFlat (fixed with Volatile.Read/Write), and a frozen-flag-lost-on-detach bug in ShapeNode, plus several efficiency/cleanup fixes. Parse results are unchanged (byte-identical to master, verified via the full HermitCrab regression suite plus per-word signature diffs against master on Sena and Indonesian). Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
|
Let's hold off until I try a few more algorithm changes... |
What this is
Allocation- and CPU-optimized HermitCrab morphological parsing so bulk "Parse All Words" scales further. Parse results are unchanged (byte-identical to
origin/master) — guarded by the existing HermitCrab regression suite plus new unit tests, and by a per-word analysis-signature diff against master on two real grammars (Sena, Indonesian). Rebased onto latestmaster; single commit, 109 files. No unrelated work, no profiling scaffolding, no dev-journal docs — everything is documented here.Performance vs
origin/masterMeasured with a throwaway benchmark harness (not part of this PR — see "Pared back" below), 800-word runs, Server GC,
MaxUnapplications=5(the default), acrossdop= 1/2/4/6/8/12/16 (outerParallel.ForEachacross words; this branch'sMorpheralso supportsMaxDegreeOfParallelism=1to parallelize once across words and run each word's own cascade single-threaded — the FieldWorks "parallelize across words myself" pattern used here). Master has no equivalent per-word parallelism switch, so master's numbers reflect its own default (always-parallel-per-word) behavior at each outerdop.Sena (real production grammar, 800 words):
Indonesian (real production grammar, 121 distinct words tiled to 800 for a stable dop=16 reading):
Summary: 1.3x–2.8x faster and 47–64% less memory across both grammars and every thread count tested, with zero change to parse/generate output. (The Indonesian dop=16 figure dips versus dop=12 — noise from thread-scheduling overhead dominating on this grammar's lighter per-word cost at high fan-out, not a regression; every other dop point scales monotonically.)
Two internal CPU-profiling rounds (via
dotnet-trace, not included in this PR) additionally cut CPU time ~39.5% on a profiled dop=1 workload by fixing two costs invisible to allocation-byte counting:Array.CreateInstanceMDArray(a true multidimensionalRegister<TOffset>[,]flattened toRegister<TOffset>[]) and the CLR's identity-hash fallback (RuntimeHelpers.GetHashCode) on several hot dictionary-key types (State,Feature/IDBearerBase,FeatureValue,ShapeNode) — fixed with cheap, correctness-preservingGetHashCode()overrides (Equals()left untouched everywhere, so identity semantics are unchanged).Correctness
origin/master: 0 differences on both Sena (60-word uncappedMaxUnapplications=0subset — the combinatorially expensive setting) and the full Indonesian word list (uncapped).ulong-mask boundary (fails without the fix) and COW source-corruption (asserts the frozen source is byte-for-byte unchanged after a clone mutation). Plus a 50×250 concurrent-determinism stress test.Review pass on the squashed diff
Before finalizing, ran a full correctness/testability/threading review over the whole squashed diff (109 files) rather than trusting the incremental per-commit gates alone. Found and fixed three real bugs that had not surfaced in any prior gate (the test grammars happen not to trigger them):
CombinationRuleCascade's memoization was unsound. A "skip already-expanded words" optimization deduped onWord.ValueEquals, which deliberately excludes per-rule application/unapplication counts — but those counts are exactly whatMaxApplicationCountgating reads. Two DFS paths through the combinatorial rule search could converge on value-equal words with different remaining rule "headroom," and the memo would silently drop the second path's still-reachable results. Worse, sinceParallelCombinationRuleCascade(used wheneverMaxDegreeOfParallelism != 1) had no equivalent memo, output forMorphologicalRuleOrder.Unorderedgrammars withMaxApplicationCount > 1could depend on a purely-performance threading knob. Reverted to master's unconditional exploration — the safest fix given the subtlety, consistent with this branch's "only land measured wins" discipline.FeatureStruct.EnsureFlathad an unsynchronized lazy-init race. Its bit-packed unify-cache fields were plain (non-volatile) writes, but frozenFeatureStructs are read concurrently from every parallel FST traversal thread — a classic "reader observes the ready-flag before the array write is visible" hazard. Fixed withVolatile.Read/Volatile.Writesafe publication (no lock needed; redundant concurrent computation is harmless since it's deterministic).ShapeNode's frozen flag was lost on detach.Shape.Remove()/Clear()reset a node's frozen bit tofalseon removal, silently violating "once frozen, always frozen" (master's originalShapeNode.IsFrozenwas a permanent bool). Currently unreachable in production code paths (nothing removes an individually-frozen node from a still-unfrozen shape today) but had no test coverage and no guard against a future caller hitting it. Fixed so a detached node carries its frozen state forward.Also fixed, as pure efficiency cleanup surfaced by the same review (no behavior change, confirmed by the signature diff above):
Shape.ValueEquals(the equalityCombinationRuleCascade's dedup and every rule-cascadeHashSet<Word>use) forced a full copy-on-write inflate on every comparison — defeating the COW optimization's whole point on a hot dedup path. Switched to comparing via the already-lazy int-offset projection.RewriteRuleSpec.MatchSubrule,SynthesisMetathesisRuleSpec.ApplyRhs,IterativePhonologicalPatternRule.Apply) re-resolved the sameShapeNodevia a dictionary lookup more than once per iteration; hoisted/deduplicated.TraversalMethodBase's twoExecuteCommandsoverloads (kept separate deliberately, to avoid boxing aList<T>.Enumeratoron the hot arc-advance path) had duplicated register-copy bodies; factored into a shared private method..gitignorecomment pointed at a benchmark harness not present in this repo.Known, accepted limitation (not fixed, flagged for awareness):
SymbolicFeature.FlatIndexis assigned from a single process-wide counter, never scoped perFeatureSystemor reclaimed. In a process that loads/reloads many grammars over a long lifetime (e.g. a long-running FieldWorks session), later-loaded grammars' flat-unify arrays can grow larger than their own feature count warrants, partially eroding this exact optimization over time. Rescoping it safely would mean assigning indices duringFeatureSystem.Freeze(), which conflicts with the deliberate "features can be used before their owning system freezes" design this branch relies on elsewhere — judged not worth the risk for this PR. Also noted and judged not a regression: twoMetathesisrule specs resolve pattern group captures without a.Successcheck before use, which can throw if a metathesis group sits inside an optional/alternation pattern construct; master's originalShapeNode-based code had the identical gap (aNullReferenceExceptioninstead of aKeyNotFoundExceptiontoday) — carried forward unchanged, not introduced by this branch.Review by change group
Reviewers can approve group-by-group; each maps to a coherent slice of the diff.
Annotations/Shape.cs,ShapeNode.cs,Annotation.cs,DataStructures/DataStructuresExtensions.csShapeowns nodes in parallel int-linked backing arrays instead of anOrderedBidirList;ShapeNodebecomes an(Owner, Index)handle. Foundation for array-copy clone + int FST offset.Annotations/Shape.cs(COW),AnnotationList.cs,DataStructures/BidirList.cs,BidirListNode.csWord.Clone~60%). Skip-list towers grow on demand + inline level 0.FeatureModel/FeatureStruct.cs,SymbolicFeature.cs,SymbolicFeatureValue.cs,UlongSymbolicFeatureValueFlags.cs,FiniteState/Input.csulong[]vector;Input.Matchestakes a bitwise fast path for the simple/no-variable case, falling back to the original engine otherwise (→ byte-identical).FiniteState/Fst.cs,TraversalMethodBase.cs, the 4 traversal method + 3 traversal instance files,ITraversalMethod.cs,VisitedStates.cs(new)Fst<Word,int>(dense offset via the lazyAnnotationList<int>projection), with a reusable per-Transduceregister scaffold and an inline-ulongVisitedStatesepsilon-loop set.Morphology.HermitCrab/(MorphologicalRules/*,PhonologicalRules/*,Word.cs,HermitCrabExtensions.cs, loaders…)Pattern/Match/MatcherfromShapeNodetointoffsets; resolveint → ShapeNodeat rule-RHS navigation sites. Lands atomically (can't be green mid-flip).State.cs,DataStructures/IDBearerBase.cs,FeatureValue.cs,ShapeNode.cs,DataStructures/BidirList.cs,FeatureModel/FeatureStruct.cs,StringFeatureValue.csGetHashCode()overrides replacing the CLR identity-hash fallback (Equals()untouched everywhere);Register<TOffset>[,]→[]; thread-staticRandominstead of per-instance; ordinal instead of culture-aware string sorts. Found via a realdotnet-traceCPU profile, not guesswork.Morpher.cs,AnalysisStratumRule.cs,AnalysisAffixTemplateRule.cs,Rules/CombinationRuleCascade.cs,ParallelCombinationRuleCascade.csMaxDegreeOfParallelism == 1selects the serial cascade; the parallel cascade honors the DOP cap.Annotations/AnnotationTests.cs(new),FeatureModel/FeatureStructTests.cs(new),FiniteState/FstTests.cs, and rule-test updatesThe morphological-rule pattern type changed
Pattern<Word, ShapeNode>→Pattern<Word, int>(affectsAffixProcessAllomorph.Lhs,AnalysisMorphologicalTransform,MorphologicalTransform.GenerateShape,Match<Word,ShapeNode>, and the publicLanguage.CompileAnalysisRule/CompileSynthesisRulereturn typesIRule<Word,ShapeNode>→IRule<Word,int>). This breaks only code that constructs HC rules programmatically (FieldWorks'HCLoader.cs— a mechanicalShapeNode→intmigration when it bumps the package). TheMorpher/ParseWord/AnalyzeWordand XML grammar-load APIs are unchanged, so load-and-parse consumers are unaffected.Shape/ShapeNodealso moved from deriving the concreteOrderedBidirList(Node)to implementing the pre-existing interface (all members preserved; nothing depends on the concrete base) — practically non-breaking. Suggest releasing as a new major version.Approaches tried and discarded (with why)
Every idea below was implemented (or reasoned through) and measured against this branch's own "only land measured wins" discipline — full 827+63 test suites plus an uncapped byte-identical signature diff for anything touching hash/traversal order. Kept for the record so a future session doesn't re-attempt a dead end:
Register<TOffset>into a singlelong.Unsafe.SizeOf<Register<int>>()is already 8 bytes — identical to a packed long. Zero allocation win available; any CPU win would need type-specializing a public generic API used elsewhere inSIL.Machinewith non-intoffsets. Not attempted.Word.Clone. Confirmed dead code for HermitCrab's actual usage:Matcher.Compile()never passes anoperationsargument, so the FST-transducer code path this targets never executes duringAnalyzeWord/ParseWord(HermitCrab always takes the FSA-only path). Not attempted.TagMapCommandas areadonly struct. Implemented, verified byte-identical, then measured via a wall-clock micro-benchmark: 534.93ms vs 536.00ms baseline — noise-level, no measurable win (the mutation this targets only happens once per grammar load, not per word). Reverted.AllSubmatches), not just the first, so collapsing the frontier saves little — and reordering result production would reopen theMaxUnapplications-cap byte-identical question this PR depends on. Rejected by reasoning, not implemented.Fst.cs): a pooled traversal method survives a Gen0 collection, promotes to Gen2, and the resulting stop-the-world Gen2 GC serializes parallel parsing. Not re-attempted.Wordobjects during analysis are value-duplicates of an already-seen one). Implemented exactly as designed, gated correctly (build, tests, allocation counters all clean), but measured as a reproducible ~3.9% wall-clock regression rather than the predicted win — likely because the memo's frozen-input-only restriction misses most of the duplication the ~48% figure was measuring (RuleCascade.ApplyRule's input is often not yet frozen at the point of each call). Reverted. The prerequisite correctness fix this work surfaced (SyntacticFeatureStructwas being mutated afterWord.Freeze()at 8 call sites, including one real word-aliasing bug where two loop iterations shared and mutated the sameFeatureStructinstance) landed on its own merits, independent of the memoization outcome.Pared back for this PR
Removed the profiling apparatus and scaffolding the optimization was developed with (preserved on a local branch, not this repo): allocation-instrumentation (
MorpherStatistics/FstStatistics+ hot-path call sites),dotnet-traceprofiling harnesses and analysis scripts, the[Explicit]benchmarks used for this PR's own perf table above, and the RUSTIFY dev-journal docs (fable-improvements.mdand similar) — everything they'd have said is now in this description instead.FeatureStruct.FlatUnifyEnabledstaysinternal(test-hook only). The out-of-process HermitCrab worker prototyped during this work moved to the FieldWorks repo.🤖 Generated with Claude Code
This change is