From b3fd2b55bf5bfd5a6a32129e19b7770240d6ade3 Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 14:16:33 -0400 Subject: [PATCH 1/6] Complexity cap Phase 1: per-word step/timeout budget with soft-stop Adds ParseContext, a per-ParseWord work budget (MaxParseSteps + ParseTimeout, generous defaults shipped on) propagated through Word exactly like CurrentTrace. Every analysis/synthesis leaf rule Apply() checks it and returns Enumerable.Empty() on breach (soft-stop, never throws); orchestration-level loops (AnalysisStratumRule, AnalysisLanguageRule, Morpher.Synthesize/LexicalLookup) fast-unwind once exhausted. ParseWord gains a ParseDiagnostics overload reporting whether the budget was hit and why; RerunWithDiagnostics re-parses one word with per-rule counters to report the top offending rule. Confirmed against a synthetic "no overt exponent" pathological rule (HC0001-shaped: pure-copy Rhs with a high MaxApplicationCount) that previously ran unbounded past the cascades' own input==output loop guard. See complexity-cap.md for the full design (Layers 1-3). --- complexity-cap.md | 410 ++++++++++++++++++ .../AnalysisAffixTemplateRule.cs | 3 + .../AnalysisLanguageRule.cs | 3 + .../AnalysisStratumRule.cs | 5 + .../Morpher.cs | 103 ++++- .../AnalysisAffixProcessRule.cs | 3 + .../AnalysisCompoundingRule.cs | 3 + .../AnalysisRealizationalAffixProcessRule.cs | 3 + .../SynthesisAffixProcessRule.cs | 3 + .../SynthesisCompoundingRule.cs | 3 + .../SynthesisRealizationalAffixProcessRule.cs | 3 + .../ParseContext.cs | 102 +++++ .../ParseDiagnostics.cs | 47 ++ .../AnalysisMetathesisRule.cs | 3 + .../PhonologicalRules/AnalysisRewriteRule.cs | 11 + .../SynthesisMetathesisRule.cs | 3 + .../PhonologicalRules/SynthesisRewriteRule.cs | 3 + .../SynthesisAffixTemplateRule.cs | 3 + src/SIL.Machine.Morphology.HermitCrab/Word.cs | 11 + .../MorpherTests.cs | 145 +++++++ 20 files changed, 868 insertions(+), 2 deletions(-) create mode 100644 complexity-cap.md create mode 100644 src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs create mode 100644 src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs diff --git a/complexity-cap.md b/complexity-cap.md new file mode 100644 index 00000000..1701219e --- /dev/null +++ b/complexity-cap.md @@ -0,0 +1,410 @@ +# Complexity Cap: Bounding Pathological HermitCrab Parses + +**Status:** Plan (not started) — sequencing and defaults decided, see §8/§10 +**Author:** drafted 2026-07-02 +**Related:** PR #446 (hc-rustify performance work), FieldWorks out-of-process HC worker (FW PR #983) + +**Decided (2026-07-02):** +- Implement on top of `hc-rustify`, not master (§8). +- Budget breach is **soft-stop** (partial results + status), never an exception (§4.4, §10.1). +- Ship a **generous default** `MaxParseSteps`/`ParseTimeout` so naive consumers are + protected out of the box, not pure opt-in (§4.1, §10.2). +- Use the real `samples/data/{indonesian,sena}-hc.xml` grammars + wordlists as the + calibration and regression corpus, not synthetic-only fixtures (§7, §9 Phase 0). + +## 1. Problem + +PR #446 made the core HermitCrab engine much faster, but grammar-induced blowups remain: +certain grammar constructs — typically unbounded/multiple-application rules with no overt +exponent, unconstrained deletion rules, and unconstrained compounding — cause the analysis +phase to generate candidates combinatorially. A single word can take minutes to hours. No +engine speedup fixes an exponential; the grammar must be constrained. Until grammars are +fixed, we need: + +1. **Bounded runtime** — a single pathological word must never hang a parse (or a + FieldWorks "Parse All Words" batch). +2. **Actionable diagnostics** — when the engine gives up, it should say *which rule(s)* + caused the blowup, with evidence. +3. **A "don't do this" guide** — static analysis that flags always-wrong or + almost-always-wrong rule shapes, consumable by other tools (FLEx parser report, CLI). + +## 2. Current state (inventory of existing guardrails) + +All partial, none sufficient: + +| Guard | Where | Limitation | +|---|---|---| +| `AffixProcessRule.MaxApplicationCount` (default 1; XML `multipleApplication` attr raises it) | `AnalysisAffixProcessRule.Apply` checks `GetUnapplicationCount(rule) >= max` | Per-rule only. Rule A → B → A → B evades it. The `multipleApplication` attribute is precisely where pathological grammars opt into unboundedness. | +| `Morpher.MaxStemCount` (default 2) | `AnalysisCompoundingRule` | Compounding only. | +| `Morpher.MaxUnapplications` (default 0 = off) | `AnalysisStratumRule.Apply` output loop | Caps the *number of analyses emitted per stratum*, not the *work* spent producing them; a cascade can burn unbounded time before emitting anything. Off by default. Confusingly named given the new caps proposed below. | +| `Morpher.DeletionReapplications` (default 0) | `AnalysisRewriteRule` | Bounds re-insertion of deleted material for *phonological rewrite rules only*. | +| Infinite-loop check | `PermutationRuleCascade.ApplyRules` (`Comparer.Equals(input, result)`) | Only catches a rule whose output equals its own input. Two-rule cycles and monotonic *growth* (hypothesizing deleted material) sail past. | +| `MergeEquivalentAnalyses` (default true) | `AnalysisStratumRule` | Dedup by shape; helps but doesn't bound. | + +**There is no timeout, no cancellation, and no work budget anywhere in HC.** `ParseWord` +is synchronous; the `MaxUnapplications` doc comment itself mentions 30-minute words. + +## 3. Design overview — three layers + +Each layer addresses a different failure mode; do all three: + +- **Layer 1 — work budget (safety net):** deterministic per-word step budget with a + wall-clock backstop. Stops eruptions cold and produces the per-rule evidence used by + everything else. +- **Layer 2 — structural bounds (prevention):** global per-word unapplication cap, + analysis shape-growth cap, cascade cycle detection. Converts exponential to bounded. +- **Layer 3 — static grammar lint (guidance):** `GrammarAnalyzer` over a loaded + `Language`, emitting structured diagnostics with stable codes; plus a written + anti-pattern guide keyed to those codes. + +### Design principles + +- **Deterministic first.** A step budget fails the same way on every machine, so grammar + authors get a reproducible signal and tests stay stable. Wall-clock timeout is only a + backstop (machine-dependent; and 10k words × 20 s timeout each still erupts in batch). +- **Cheap happy path.** PR #446 deliberately removed `MorpherStatistics` because it was + woven into the hot path. The budget's steady-state cost must be ~one counter increment + per rule application; detailed per-rule counters are collected only on a **diagnostic + re-run after a breach** (breaches are rare; re-running one word with counters on is + cheap and keeps the hot path clean). +- **Additive API.** FieldWorks (in-process HCLoader path *and* the out-of-process worker) + consumes `Morpher`. All new knobs are properties with backward-compatible defaults; + existing `ParseWord`/`AnalyzeWord` signatures keep working. +- **Fail soft, report loud.** A budget breach yields the analyses found so far plus an + explicit "gave up" status — never a silent empty result (FLEx must distinguish + "no parse" from "gave up") and, by default, never an exception mid-batch. + +## 4. Layer 1 — work budget + timeout + +### 4.1 Configuration (on `Morpher`, following existing property style) + +```csharp +/// Max rule applications (analysis + synthesis) per ParseWord call. 0 = unlimited. +public int MaxParseSteps { get; set; } // ships ON with a generous default; see below +/// Wall-clock backstop per ParseWord call. Zero/infinite = disabled. +public TimeSpan ParseTimeout { get; set; } // ships ON with a generous default; see below +``` + +**Default philosophy (decided):** ship generous, non-zero defaults for both, not +opt-in-only. Rationale: most consumers (machine.py users, FLEx via HCLoader, anyone +scripting `Morpher` directly) will never touch these knobs; a `0`/unlimited default means +the exact failure mode this plan exists to fix — an unbounded parse — remains the +out-of-the-box behavior. A generous cap that never fires for legitimate grammars but +reliably kills runaway ones is strictly better than silence. + +Concrete numbers are calibrated in Phase 0 against the real corpus (§7), not guessed +here, but the target shape is: run every word in `indonesian-words.txt` (121 words) and +`sena-words.txt` (7,121 words) against their respective grammars on the rustify engine, +take the observed max step count / max wall-clock time across that legitimate corpus, +and set the default to a large multiple of that ceiling (e.g. 50–100×) so it is +effectively invisible for real grammars but still finite. `ParseTimeout` defaults +similarly, e.g. a flat few seconds per word — generous for interactive/FLEx single-word +parses, still bounded for "Parse All Words" batches where one stuck word must not stall +the run indefinitely. + +### 4.2 Per-parse context, propagated like `CurrentTrace` + +Compiled rule objects are shared across concurrent parses, so per-call state cannot live +on the rules or the `Morpher`. But every rule receives the `Word`, and `Word` already +propagates a shared reference through clones (`CurrentTrace`, Word.cs copy-ctor). Add: + +```csharp +internal ParseContext ParseContext { get; set; } // on Word; reference-shared + +internal sealed class ParseContext +{ + private int _steps; // Interlocked — analysis fans out in parallel + private readonly long _deadlineTicks; // Stopwatch-based (netstandard2.0-safe) + public bool Exhausted { get; private set; } + public ParseExhaustionReason Reason { get; private set; } // StepBudget | Timeout + public bool Step() // returns false when budget is gone + { + // Interlocked.Increment; check deadline only every N (e.g. 256) steps + } + // Diagnostic mode (breach re-run only): + public ConcurrentDictionary RuleCounters { get; } +} +``` + +Propagation rules (mirror `CurrentTrace` exactly): +- `Word` copy-ctor copies the reference. +- Fresh `Word` constructions inside a parse (`Morpher.LexicalLookup`, `LexicalGuess`, + `Word.CurrentNonHead` path at Word.cs:489, `GenerateWords` synthesis words) must + re-attach the context. +- Excluded from `FreezeImpl` hashing and `ValueEquals` (like `CurrentTrace`), so dedup + semantics are unchanged. It is mutable state on a frozen `Word` — same precedent as + `CurrentTrace`. + +### 4.3 Check sites + +All in the HC assembly (the generic `SIL.Machine` cascades stay untouched — every rule +they invoke checks, which bounds cascade recursion transitively): + +- `AnalysisAffixProcessRule.Apply` / `AnalysisRealizationalAffixProcessRule` / + `AnalysisCompoundingRule` — alongside the existing `RuleSelector` / + `MaxApplicationCount` early-outs. +- `AnalysisRewriteRule.Apply` (per iteration, not just per call — one call can loop). +- Affix template slot application. +- Synthesis counterparts (`SynthesisAffixProcessRule` etc.) — synthesis explodes too + when analysis hands it thousands of candidates. +- `Morpher.Synthesize` / `LexicalLookup` loops — check `Exhausted` between candidates so + the unwind is fast. + +On `Step() == false`: the rule returns `Enumerable.Empty()`. **This is the only +behavior on breach — no exception path is offered for step/timeout exhaustion**, decided +because the primary target (FieldWorks "Parse All Words") is a batch over thousands of +words where one stuck word throwing would either kill the batch or force every caller to +wrap every word in try/catch. Real errors (bad grammar, bugs) still throw normally via +existing `Parallel.ForEach` exception plumbing — this only governs the "ran out of +budget" case. The parse drains quickly and naturally once `Step()` starts returning +false, since every rule-level early-out (§4.3) short-circuits immediately. + +### 4.4 Result surface + +```csharp +public IEnumerable ParseWord(string word, out object trace, bool guessRoot, + out ParseDiagnostics diagnostics); + +public sealed class ParseDiagnostics +{ + public bool BudgetExhausted { get; } + public ParseExhaustionReason Reason { get; } // StepBudget | Timeout | None + public int StepsUsed { get; } + public TimeSpan Elapsed { get; } + /// Populated only by RerunWithDiagnostics (breach re-run). + public IReadOnlyList<(IHCRule Rule, int Applications)> TopRules { get; } +} +``` + +- Existing overloads keep working (diagnostics discarded). +- `IMorphologicalAnalyzer.AnalyzeWord` is an interface shared with non-HC analyzers — + leave it unchanged; best-effort results. Callers who need status use the new overload. +- `Morpher.RerunWithDiagnostics(string word)` (name TBD): re-parses one word with + per-rule counters (and optionally a lower budget), returning ranked + `(rule, applicationCount)` — "word *X* exceeded 100k steps; rule *Y* accounted for + 92% of applications." This is the empirical half of the "don't do this" guide. + +### 4.5 FieldWorks / worker integration (follow-up, separate repo) + +- The worker DTO (`WordAnalysisDto` / batch results in FW `Src/LexText/HCWorker`) gains a + per-word status field (`Success | NoParse | GaveUp(reason)`), so "Parse All Words" can + show gave-up words distinctly and offer "diagnose this word" (the re-run). +- `ParserWorker.ParseAndUpdateWordformGuarded` already guards per-word exceptions; the + soft-stop design means it needs no change to survive breaches — only to *display* them. + +## 5. Layer 2 — structural bounds + +### 5.1 Global per-word unapplication cap (the "same thing, even if separated" bound) + +`Word` already tracks per-rule unapplication counts (that's how `MaxApplicationCount` is +enforced). Add a running total incremented in `MorphologicalRuleUnapplied`: + +```csharp +/// Max total morphological-rule unapplications per analysis candidate (≈ max affixes +/// per word). 0 = unlimited. Proposed default: 0 initially, recommend 10–16 for FLEx. +public int MaxRuleApplicationsPerWord { get; set; } +``` + +Checked in the same early-out cluster as `MaxApplicationCount`. This closes the +A→B→A→B loophole: no per-rule counter trips, but the total does. + +Naming note: the existing `Morpher.MaxUnapplications` (caps *analyses emitted per +stratum*) is easily confused with this. Keep it, document both clearly, consider +`[Obsolete]`-forwarding it to a better name in the same release (decide in review). + +### 5.2 Analysis shape-growth cap + +The one truly unbounded generator is unapplication that makes the hypothesized underlying +form *longer* than the surface form (undoing deletions; empty/subtractive exponents). +`DeletionReapplications` bounds this narrowly for rewrite rules; generalize: + +```csharp +/// Prune any analysis candidate whose shape exceeds the surface form by more than +/// this many segments. -1 = unlimited (default, preserves current behavior). +public int MaxAnalysisShapeGrowth { get; set; } +``` + +Enforced at the `AnalysisStratumRule.Apply` output loop (single choke point; candidates +pruned there never reach lexical lookup or the next stratum) and in +`AnalysisRewriteRule`'s iteration loop (so a self-feeding epenthesis-unapplication is cut +mid-rule, not after producing a huge shape). Surface length is captured on the +`ParseContext` (Layer 1's context doubles as the carrier for per-parse constants). + +### 5.3 Cycle detection in the permutation cascade + +`PermutationRuleCascade.ApplyRules` currently only compares a result to its immediate +input. Two options, in preference order: + +1. **Depth cap (simple, sufficient):** thread a recursion-depth parameter; stop + descending past `MaxCascadeDepth` (derivable from `MaxRuleApplicationsPerWord`, so + possibly no new knob). Cheap, no allocation. +2. **Visited set (complete):** per-branch `HashSet` with the existing + `FreezableEqualityComparer`. Catches length-k cycles exactly but allocates per branch. + +Given Layers 1 + 5.1 already bound total work, option 1 is likely enough; implement 1, +keep 2 in reserve. These classes are in `SIL.Machine` core but consumed only by HC +(verified: `SynthesisStratumRule`, `AnalysisStratumRule`), so a constructor-injected +optional guard is safe. + +### 5.4 Defaults and compatibility + +All Layer-2 caps default to **off** in `SIL.Machine` (no behavior change for existing +consumers; some legitimate agglutinative grammars have long affix chains). FieldWorks +sets conservative values (proposed: `MaxRuleApplicationsPerWord` ≈ 16, +`MaxAnalysisShapeGrowth` ≈ 6, `MaxParseSteps` ≈ 250k — calibrate in Phase 0). Revisit +turning defaults on in a subsequent major version once field data exists. + +## 6. Layer 3 — static grammar lint (`GrammarAnalyzer`) + +### 6.1 Shape + +```csharp +public static class GrammarAnalyzer +{ + public static IReadOnlyList Analyze(Language language); +} + +public sealed class GrammarDiagnostic +{ + public string Code { get; } // stable, e.g. "HC0001" — doc anchor + public DiagnosticSeverity Severity { get; } // Error | Warning | Info + public IHCRule Rule { get; } // or Morpheme/AffixTemplate — the culprit object + public string Message { get; } + public string Suggestion { get; } +} +``` + +Operates on the in-memory `Language`, so it works for **both** XML-loaded grammars and +FieldWorks' programmatically built ones (HCLoader). A thin CLI (`hc-lint grammar.xml`) +wraps `XmlLanguageLoader` + `Analyze` for use outside FLEx. + +### 6.2 Check catalogue (initial) + +| Code | Severity | Detects | Rationale | +|---|---|---|---| +| HC0001 | Error | Affix rule with **no overt exponent** (analysis side is a pure variable copy — LHS one `[Seg]*`-class variable, RHS adds no constant segments) **and** `MaxApplicationCount > 1` | Unapplies to every word, every time: guaranteed exponential. The headline "always wrong". | +| HC0002 | Warning | No overt exponent, `MaxApplicationCount == 1` | Still multiplies candidates once per cascade position; frequently unintended. | +| HC0003 | Warning | `multipleApplication` set high/unbounded on any rule | Flag the opt-in itself; require justification. | +| HC0004 | Warning | **Self-feeding rule**: output unifies with the rule's own required environment (epenthesis/insertion feeding itself) | Loop generator in synthesis; growth generator in analysis. | +| HC0005 | Warning | **Unconstrained deletion**: deletion rewrite rule with very permissive context | Unbounded re-insertion during analysis; interacts with `DeletionReapplications`. | +| HC0006 | Warning | Compounding rule with unconstrained POS on **both** head and non-head | Cross-product blowup; interacts with `MaxStemCount`. | +| HC0007 | Info | Optional-iterative lexical patterns (e.g. `([Seg])([Seg])`) | Spurious-ambiguity source already noted in `Morpher.LexicalGuess` comments. | +| HC0008 | Info | Cyclic feeding pair: rule A's analysis output can feed B and vice versa with net growth | Best-effort structural check; pairs only. | + +What static analysis *cannot* catch — combinatorial interaction among individually +reasonable rules — is covered by Layer 1's breach re-run (empirical top-offender report). +The written guide ("Writing performant HC grammars") is organized by these codes, with a +section on interpreting the empirical report. + +### 6.3 Consumers + +- **FLEx**: parser report / grammar check UI lists diagnostics next to the rules + (FieldWorks-side work, out of scope here; the API is designed for it). +- **CLI**: for machine.py users and CI-style grammar validation. +- **Tests**: our own pathological fixtures must each trip their intended code. + +## 7. Testing strategy + +- **Pathological fixtures**: construct minimal grammars in `MorpherTests` for each class: + glob rule + `multipleApplication`, A↔B cycle, self-feeding epenthesis, unconstrained + deletion, unconstrained compounding. Each must (a) trip the budget deterministically at + a known step count, (b) be caught by its Layer-2 cap, (c) be flagged by its lint code. +- **Real-grammar fixtures (decided): use `samples/data/{indonesian,sena}-hc.xml` + + their wordlists directly.** These are the two grammars already in the working tree + from the rustify perf sessions — `indonesian-hc.xml` (2,563 lines) / + `indonesian-words.txt` (121 words) and the much larger `sena-hc.xml` (33,091 lines) / + `sena-words.txt` (7,121 words). They serve three roles: (1) the **default-calibration + corpus** for §4.1 (measure legitimate max steps/time, set the generous default above + it); (2) the **no-regression corpus** — with all knobs at their shipped defaults, every + word in both wordlists must still parse to byte-identical results (rustify's own audit + already established byte-identical output on these corpora pre-complexity-cap, so any + divergence post-complexity-cap is a bug in this work, not noise); (3) the **overhead + benchmark** corpus (see below). Still verify licensing/provenance before committing + them permanently to the test tree (currently untracked). +- **Determinism**: same grammar + word ⇒ identical `StepsUsed` and identical breach + point, single- and multi-threaded (steps counter is shared/Interlocked; the *count at + breach* may vary ±parallelism — assert exhaustion + reason, not exact step, in parallel + mode; assert exact step in `SINGLE_THREADED`/dop=1). +- **No-regression**: with all knobs off, full existing suite green and byte-identical + parse results on the sample grammars. +- **Overhead benchmark**: sena + indonesian wordlists, budget on (shipped default) vs. + budget fully disabled, on the **rustify** engine (see §8) — target < 2% throughput + cost; if the single Interlocked increment shows up, fall back to per-thread counters + flushed periodically. +- **Pathological additions to the real corpus**: since indonesian/sena are (presumably) + well-behaved grammars, also hand-craft 1–2 pathological *variants* of the indonesian + grammar specifically (smaller, easier to reason about than sena) — e.g. take one real + affix rule and strip its overt exponent, or raise its `multipleApplication` — so the + budget/lint tests exercise a realistic grammar shape, not just synthetic toy rules. + +## 8. Interaction with the rustify work (PR #446) — and sequencing + +**The overlap is near-total.** PR #446's single commit rewrites, among others: +`Morpher.cs`, `Word.cs`, `AnalysisStratumRule.cs`, `SynthesisStratumRule.cs`, +`AnalysisAffixProcessRule.cs`, `AnalysisCompoundingRule.cs`, `AnalysisRewriteRule.cs`, +`ParallelCombinationRuleCascade.cs`, `XmlLanguageLoader.cs`, and `MorpherTests.cs` — +i.e. **every file Layers 1–2 touch**. Beyond textual conflicts: + +1. **The budget lives in the hot path rustify just optimized.** Overhead must be measured + against the *new* engine; a check invisible on master's slower engine could be + measurable post-rustify. Rustify also deliberately stripped `MorpherStatistics` from + the hot path — the breach-then-rerun design in §4 exists to honor that decision, and + should be validated on that engine. +2. **`Word` internals changed** (flat/COW shape, `Pattern` projection, changed + clone behavior). The `ParseContext` propagation through `Word.Clone` must be written + against rustify's `Word`, not master's. +3. **Budget defaults need calibration on the shipped engine.** A step budget tuned on + master would be wildly conservative post-rustify. +4. Even Layer 3 is lightly affected: HC0001/HC0002 inspect `AffixProcessAllomorph.Lhs`, + whose type changed `Pattern` → `Pattern` on rustify. +5. Precedent: the `fst-advisor` branch already stacks on `hc-rustify` and needed a + mechanical `ShapeNode→int` fix after rebase — the same would happen here, times ten. + +**Decided: branch off `hc-rustify` now; do not wait for #446 to merge before starting +implementation.** Rebasing one clean feature branch when #446 lands is routine (already +done once for fst-advisor); writing Layers 1–2 against master and then porting them +across rustify's 100-file rewrite is not. Concretely: + +- **Can start now, off `hc-rustify`:** Phase 0 (fixtures/repro harness) and Phase 1 + (budget). Phase 0 is even branch-agnostic (test-only). +- **Layer 3** is nearly independent (reads `Language` structure, never touches the hot + path) and could start on either base; starting it on `hc-rustify` avoids the one known + type change (item 4). It's also the natural parallel track if #446 review drags. +- **Do not merge before #446.** Complexity-cap should land *after* rustify to avoid + forcing a painful rebase onto the 100-file rustify branch. Version-wise this fits the + already-recommended major-version release train for rustify (master is at 3.9.0; + rustify targets a major bump); complexity-cap's additive API rides the same train. + +## 9. Phases + +| Phase | Deliverable | Depends on | Est. size | +|---|---|---|---| +| 0 | Branch off `hc-rustify`. Baseline `indonesian`/`sena` on rustify (max steps/time observed → derive generous `MaxParseSteps`/`ParseTimeout` defaults); build 1–2 pathological variants of the indonesian grammar; repro harness | `hc-rustify` | S | +| 1 | `ParseContext`, `MaxParseSteps` + `ParseTimeout`, soft-stop checks, `ParseDiagnostics` overload, breach re-run with per-rule counters | 0 | M | +| 2 | `MaxRuleApplicationsPerWord`, `MaxAnalysisShapeGrowth`, cascade depth cap | 1 (shares `ParseContext`) | M | +| 3 | `GrammarAnalyzer` + HC0001–HC0008, CLI, "Writing performant HC grammars" guide | — (parallelizable) | M–L | +| 4 | FieldWorks follow-ups: worker DTO status field, FLEx "diagnose word" + parser-report lint surfacing, set conservative caps in HCLoader | 1–3, FW repo | separate effort | + +## 10. Open questions + +**Resolved 2026-07-02:** + +1. ~~Soft-stop vs. throw~~ — **soft-stop**, no exception path for budget/timeout + exhaustion (§4.4). Real errors still throw as today. +2. ~~Default values~~ — **generous default, shipped on**, not opt-in (§4.1). Exact + numbers derived from Phase 0 baselining against `indonesian`/`sena`, not guessed. +5. ~~Sample grammars~~ — **use `indonesian-hc.xml`/`sena-hc.xml` directly** as the + calibration, no-regression, and overhead-benchmark corpus (§7). Provenance/license + check before permanent commit still applies, but the *design* decision to use them + (rather than build a separate synthetic-only corpus) is made. + +**Still open:** + +3. **Rename/deprecate `MaxUnapplications`?** Its name collides conceptually with the new + caps; same-release cleanup vs. leave-as-is. +4. **Where does `ParseDiagnostics` surface in machine.py parity?** machine.py has its own + HC port; decide whether these knobs/codes should be mirrored there (same codes would + keep the guide tool-agnostic). +6. **HC0004/HC0008 precision**: self-feeding/cycle detection via unification is + approximate; acceptable false-positive rate for a Warning? Start conservative + (high-confidence patterns only), widen with field feedback. diff --git a/src/SIL.Machine.Morphology.HermitCrab/AnalysisAffixTemplateRule.cs b/src/SIL.Machine.Morphology.HermitCrab/AnalysisAffixTemplateRule.cs index f401ce0f..72c4a214 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/AnalysisAffixTemplateRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/AnalysisAffixTemplateRule.cs @@ -31,6 +31,9 @@ public AnalysisAffixTemplateRule(Morpher morpher, AffixTemplate template) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_template) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_template)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/AnalysisLanguageRule.cs b/src/SIL.Machine.Morphology.HermitCrab/AnalysisLanguageRule.cs index 4bdd3c95..b8131a08 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/AnalysisLanguageRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/AnalysisLanguageRule.cs @@ -26,6 +26,9 @@ public IEnumerable Apply(Word input) var results = new HashSet(FreezableEqualityComparer.Default); for (int i = 0; i < _rules.Count && inputSet.Count > 0; i++) { + if (input.ParseContext?.Exhausted == true) + break; + if (!_morpher.RuleSelector(_strata[i])) continue; diff --git a/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs b/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs index aadef083..3ee2b95b 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs @@ -132,6 +132,11 @@ public IEnumerable Apply(Word input) _morpher.TraceManager.EndUnapplyStratum(_stratum, input); foreach (Word mruleOutWord in mruleOutWords) { + // Once the budget is gone, stop collecting outputs immediately rather than draining the + // rest of an already-in-flight (but now-empty-yielding) rule cascade. + if (input.ParseContext?.Exhausted == true) + break; + // Skip intermediate sources from phonological rules, templates, and morphological rules. mruleOutWord.Source = origInput; if (mergeEquivalentAnalyses) diff --git a/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs b/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs index 10cdc45c..da9ad1c0 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs @@ -70,6 +70,8 @@ public Morpher(ITraceManager traceManager, Language lang, int maxDegreeOfParalle MergeEquivalentAnalyses = true; LexEntrySelector = entry => true; RuleSelector = rule => true; + MaxParseSteps = DefaultMaxParseSteps; + ParseTimeout = DefaultParseTimeout; _morphemes = new ReadOnlyObservableCollection(morphemes); } @@ -79,10 +81,42 @@ public ITraceManager TraceManager get { return _traceManager; } } + /// + /// Generous default for , calibrated against the real Indonesian/Sena + /// grammars on the rustify engine (see complexity-cap.md Phase 0): observed legitimate max was + /// ~13,600 steps (Sena), so this ships ~150x above that ceiling — effectively invisible for real + /// grammars but still finite. 0 disables the step budget. + /// + public const int DefaultMaxParseSteps = 2_000_000; + + /// + /// Generous default for — a backstop far above any observed legitimate + /// single-word parse time on the rustify engine, but still bounded so one pathological word cannot + /// stall a "Parse All Words" batch indefinitely. disables the timeout. + /// + public static readonly TimeSpan DefaultParseTimeout = TimeSpan.FromSeconds(10); + public int DeletionReapplications { get; set; } public int MaxStemCount { get; set; } + /// + /// Max rule applications (analysis + synthesis) per + /// call. Ships on with a generous default () so naive consumers are + /// protected out of the box; 0 = unlimited. On breach, the parse soft-stops: rules return no further + /// results, so whatever analyses/syntheses were already found are still returned, flagged via the + /// overload. Never throws. + /// + public int MaxParseSteps { get; set; } + + /// + /// Wall-clock backstop per call, + /// checked periodically alongside the step budget (not on every step, to keep the happy path cheap). + /// Ships on with a generous default (); or + /// a negative value disables it. Same soft-stop behavior as . + /// + public TimeSpan ParseTimeout { get; set; } + /// /// MaxUnapplications limits the number of unapplications to make it possible /// to make it possible to debug words that take 30 minutes to parse @@ -128,11 +162,47 @@ public IEnumerable ParseWord(string word, out object trace) /// If there are no analyses and guessRoot is true, then guess the root. /// public IEnumerable ParseWord(string word, out object trace, bool guessRoot) + { + return ParseWord(word, out trace, guessRoot, out _); + } + + /// + /// Parse the specified surface form, possibly tracing the parse. If there are no analyses and + /// guessRoot is true, then guess the root. reports whether + /// / cut the parse short (soft-stop: the + /// returned sequence is whatever was found so far, never an exception). + /// + public IEnumerable ParseWord(string word, out object trace, bool guessRoot, out ParseDiagnostics diagnostics) + { + return ParseWordCore(word, out trace, guessRoot, collectRuleCounters: false, out diagnostics); + } + + /// + /// Re-parses one word with per-rule application counters enabled and reports the top offenders — + /// "word X exceeded N steps; rule Y accounted for most of the applications". Intended for use only + /// after a breach is observed via the overload: counters add overhead, + /// so they are never on during the normal happy path (see complexity-cap.md §3 "cheap happy path"). + /// + public ParseDiagnostics RerunWithDiagnostics(string word, out IEnumerable results) + { + results = ParseWordCore(word, out _, false, collectRuleCounters: true, out ParseDiagnostics diagnostics); + return diagnostics; + } + + private IEnumerable ParseWordCore( + string word, + out object trace, + bool guessRoot, + bool collectRuleCounters, + out ParseDiagnostics diagnostics + ) { // convert the word to its phonetic shape Shape shape = _lang.SurfaceStratum.CharacterDefinitionTable.Segment(word); var input = new Word(_lang.SurfaceStratum, shape); + var parseContext = new ParseContext(MaxParseSteps, ParseTimeout, shape.Count, collectRuleCounters); + input.ParseContext = parseContext; input.Freeze(); if (_traceManager.IsTracing) _traceManager.AnalyzeWord(_lang, input); @@ -177,11 +247,30 @@ public IEnumerable ParseWord(string word, out object trace, bool guessRoot matches.Sort((x, y) => y.Morphs.Count().CompareTo(x.Morphs.Count())); + diagnostics = CreateParseDiagnostics(parseContext); return matches; } + diagnostics = CreateParseDiagnostics(parseContext); return syntheses; } + private static ParseDiagnostics CreateParseDiagnostics(ParseContext parseContext) + { + if (!parseContext.Exhausted) + return ParseDiagnostics.None; + + IReadOnlyList<(IHCRule Rule, int Applications)> topRules = null; + if (parseContext.DiagnosticsEnabled) + { + topRules = parseContext + .RuleCounters.Select(kvp => (Rule: kvp.Key, Applications: kvp.Value)) + .OrderByDescending(t => t.Applications) + .ToList(); + } + + return new ParseDiagnostics(true, parseContext.Reason, parseContext.StepsUsed, parseContext.Elapsed, topRules); + } + /// /// Generates surface forms from the specified word synthesis information. /// @@ -208,6 +297,7 @@ out object trace trace = rootTrace; var words = new ConcurrentBag(); + var parseContext = new ParseContext(MaxParseSteps, ParseTimeout, rootEntry.PrimaryAllomorph.Segments.Shape.Count); Exception exception = null; Parallel.ForEach( @@ -220,12 +310,15 @@ out object trace { try { - var synthesisWord = new Word(synthesisInfo.Allomorph, realizationalFS); + var synthesisWord = new Word(synthesisInfo.Allomorph, realizationalFS) + { + ParseContext = parseContext, + }; foreach (Tuple rule in synthesisInfo.RulePermutation) { synthesisWord.MorphologicalRuleUnapplied(rule.Item1); if (rule.Item2 != null) - synthesisWord.NonHeadUnapplied(new Word(rule.Item2, new FeatureStruct())); + synthesisWord.NonHeadUnapplied(new Word(rule.Item2, new FeatureStruct()) { ParseContext = parseContext }); } synthesisWord.CurrentTrace = rootTrace; @@ -307,6 +400,8 @@ private IEnumerable Synthesize(string word, IList analyses) var matches = new HashSet(FreezableEqualityComparer.Default); foreach (Word analysisWord in analyses) { + if (analysisWord.ParseContext?.Exhausted == true) + break; foreach (Word validWord in SynthesizeAnalysis(word, analysisWord)) matches.Add(validWord); } @@ -342,6 +437,8 @@ private IEnumerable SynthesizeAnalysis(string word, Word analysisWord) { foreach (Word synthesisWord in LexicalLookup(analysisWord)) { + if (synthesisWord.ParseContext?.Exhausted == true) + yield break; foreach (Word alternative in synthesisWord.ExpandAlternatives()) { foreach (Word validWord in _synthesisRule.Apply(alternative).Where(IsWordValid)) @@ -371,6 +468,8 @@ LexEntry entry in SearchRootAllomorphs(input.Stratum, input.Shape) .Distinct() ) { + if (input.ParseContext?.Exhausted == true) + yield break; foreach (RootAllomorph allomorph in entry.Allomorphs) { Word newWord = input.Clone(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs index b9f6d4ac..7cca6fdf 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs @@ -39,6 +39,9 @@ public AnalysisAffixProcessRule(Morpher morpher, AffixProcessRule rule) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs index b5013d4e..e03b6cfe 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs @@ -39,6 +39,9 @@ public AnalysisCompoundingRule(Morpher morpher, CompoundingRule rule) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs index 031c6fba..e526682a 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs @@ -39,6 +39,9 @@ public AnalysisRealizationalAffixProcessRule(Morpher morpher, RealizationalAffix public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisAffixProcessRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisAffixProcessRule.cs index 98a3895d..6537c0e9 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisAffixProcessRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisAffixProcessRule.cs @@ -40,6 +40,9 @@ public SynthesisAffixProcessRule(Morpher morpher, AffixProcessRule rule) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!input.IsMorphologicalRuleApplicable(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisCompoundingRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisCompoundingRule.cs index 29e3bd5f..4602321c 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisCompoundingRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisCompoundingRule.cs @@ -44,6 +44,9 @@ private Matcher BuildMatcher(IEnumerable> lhs) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!input.IsMorphologicalRuleApplicable(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisRealizationalAffixProcessRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisRealizationalAffixProcessRule.cs index bd1717f8..ab45edd6 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisRealizationalAffixProcessRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/SynthesisRealizationalAffixProcessRule.cs @@ -40,6 +40,9 @@ public SynthesisRealizationalAffixProcessRule(Morpher morpher, RealizationalAffi public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs b/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs new file mode 100644 index 00000000..82731dde --- /dev/null +++ b/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs @@ -0,0 +1,102 @@ +using System; +using System.Collections.Concurrent; +using System.Collections.Generic; +using System.Diagnostics; +using System.Threading; + +namespace SIL.Machine.Morphology.HermitCrab +{ + public enum ParseExhaustionReason + { + None, + StepBudget, + Timeout, + } + + /// + /// Per- work budget, + /// referenced from every produced during that parse (propagated through + /// 's copy constructor exactly like ). Compiled rule + /// objects are shared across concurrent parses, so this state cannot live on the rules or the + /// itself; it lives here instead and is threaded through the data. + /// + internal sealed class ParseContext + { + // Wall-clock is checked only every Nth step: Stopwatch reads are cheap but not free, and the + // budget's steady-state cost on the happy path must stay close to a single Interlocked increment. + private const int DeadlineCheckMask = 0xFF; + + private readonly int _maxSteps; + private readonly long _timeoutTicks; + private readonly long _startTimestamp; + private readonly ConcurrentDictionary _ruleCounters; + private int _steps; + private int _exhausted; + private ParseExhaustionReason _reason; + + public ParseContext(int maxSteps, TimeSpan timeout, int surfaceLength, bool collectRuleCounters = false) + { + _maxSteps = maxSteps; + _timeoutTicks = timeout > TimeSpan.Zero ? (long)(timeout.TotalSeconds * Stopwatch.Frequency) : -1; + _startTimestamp = Stopwatch.GetTimestamp(); + SurfaceLength = surfaceLength; + if (collectRuleCounters) + _ruleCounters = new ConcurrentDictionary(); + } + + /// Length (in segments) of the surface shape being parsed; carrier for Layer 2's shape-growth cap. + public int SurfaceLength { get; } + + public bool Exhausted => Volatile.Read(ref _exhausted) != 0; + + public ParseExhaustionReason Reason => _reason; + + public int StepsUsed => Volatile.Read(ref _steps); + + public TimeSpan Elapsed => + TimeSpan.FromSeconds((double)(Stopwatch.GetTimestamp() - _startTimestamp) / Stopwatch.Frequency); + + public bool DiagnosticsEnabled => _ruleCounters != null; + + public IReadOnlyDictionary RuleCounters => _ruleCounters; + + /// + /// Records one rule-application attempt. Returns false once the budget is gone; callers must + /// treat that as "no result" and unwind immediately (return Enumerable.Empty<Word>()), + /// never throw. + /// + public bool Step(IHCRule rule = null) + { + if (Exhausted) + return false; + + if (rule != null && _ruleCounters != null) + _ruleCounters.AddOrUpdate(rule, 1, (_, count) => count + 1); + + if (_maxSteps <= 0 && _timeoutTicks < 0) + return true; + + int steps = Interlocked.Increment(ref _steps); + if (_maxSteps > 0 && steps >= _maxSteps) + { + MarkExhausted(ParseExhaustionReason.StepBudget); + return false; + } + if (_timeoutTicks >= 0 && (steps & DeadlineCheckMask) == 0) + { + if (Stopwatch.GetTimestamp() - _startTimestamp >= _timeoutTicks) + { + MarkExhausted(ParseExhaustionReason.Timeout); + return false; + } + } + return true; + } + + private void MarkExhausted(ParseExhaustionReason reason) + { + if (Interlocked.CompareExchange(ref _exhausted, 1, 0) == 0) + _reason = reason; + } + } +} diff --git a/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs b/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs new file mode 100644 index 00000000..a661e505 --- /dev/null +++ b/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs @@ -0,0 +1,47 @@ +using System; +using System.Collections.Generic; + +namespace SIL.Machine.Morphology.HermitCrab +{ + /// + /// Reports whether / cut a parse + /// short. A breach is a soft-stop: the parse still returns whatever analyses/syntheses it had found, + /// this just tells the caller the result may be incomplete rather than "no parse". + /// + public sealed class ParseDiagnostics + { + public static readonly ParseDiagnostics None = new ParseDiagnostics( + false, + ParseExhaustionReason.None, + 0, + TimeSpan.Zero, + null + ); + + internal ParseDiagnostics( + bool budgetExhausted, + ParseExhaustionReason reason, + int stepsUsed, + TimeSpan elapsed, + IReadOnlyList<(IHCRule Rule, int Applications)> topRules + ) + { + BudgetExhausted = budgetExhausted; + Reason = reason; + StepsUsed = stepsUsed; + Elapsed = elapsed; + TopRules = topRules ?? Array.Empty<(IHCRule Rule, int Applications)>(); + } + + public bool BudgetExhausted { get; } + + public ParseExhaustionReason Reason { get; } + + public int StepsUsed { get; } + + public TimeSpan Elapsed { get; } + + /// Populated only by . + public IReadOnlyList<(IHCRule Rule, int Applications)> TopRules { get; } + } +} diff --git a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisMetathesisRule.cs b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisMetathesisRule.cs index 5d160243..7b26df7c 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisMetathesisRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisMetathesisRule.cs @@ -37,6 +37,9 @@ public AnalysisMetathesisRule(Morpher morpher, MetathesisRule rule) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs index e691b4c0..08a01a6c 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs @@ -120,6 +120,9 @@ private static bool IsUnifiable(Constraint constraint, Pattern Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); @@ -151,6 +154,10 @@ public IEnumerable Apply(Word input) j++; if (j > _morpher.DeletionReapplications) break; + // Bounded by DeletionReapplications above, but that's a user-set knob with + // no ceiling of its own — still gate each reapplication on the shared budget. + if (input.ParseContext?.Step(_rule) == false) + break; data = sr.Item2.Apply(data).SingleOrDefault(); } } @@ -162,6 +169,10 @@ public IEnumerable Apply(Word input) while (data != null) { srApplied = true; + // Unlike Deletion, this loop has no reapplication ceiling of its own (a + // self-feeding rule can hypothesize forever) — the budget is the only bound. + if (input.ParseContext?.Step(_rule) == false) + break; data = sr.Item2.Apply(data).SingleOrDefault(); } } diff --git a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisMetathesisRule.cs b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisMetathesisRule.cs index 2d8c3af5..90a2b272 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisMetathesisRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisMetathesisRule.cs @@ -34,6 +34,9 @@ public SynthesisMetathesisRule(Morpher morpher, MetathesisRule rule) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisRewriteRule.cs b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisRewriteRule.cs index ecf84a7d..e98bc98b 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisRewriteRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/SynthesisRewriteRule.cs @@ -50,6 +50,9 @@ public SynthesisRewriteRule(Morpher morpher, RewriteRule rule) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_rule) == false) + return Enumerable.Empty(); + if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/SynthesisAffixTemplateRule.cs b/src/SIL.Machine.Morphology.HermitCrab/SynthesisAffixTemplateRule.cs index 21248d00..7251ab11 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/SynthesisAffixTemplateRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/SynthesisAffixTemplateRule.cs @@ -27,6 +27,9 @@ public SynthesisAffixTemplateRule(Morpher morpher, AffixTemplate template) public IEnumerable Apply(Word input) { + if (input.ParseContext?.Step(_template) == false) + return Enumerable.Empty(); + if (_morpher.TraceManager.IsTracing) _morpher.TraceManager.BeginApplyTemplate(_template, input); var output = new HashSet(FreezableEqualityComparer.Default); diff --git a/src/SIL.Machine.Morphology.HermitCrab/Word.cs b/src/SIL.Machine.Morphology.HermitCrab/Word.cs index 96748875..95e8b320 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/Word.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/Word.cs @@ -98,6 +98,7 @@ protected Word(Word word) _isLastAppliedRuleFinal = word._isLastAppliedRuleFinal; _isPartial = word._isPartial; CurrentTrace = word.CurrentTrace; + ParseContext = word.ParseContext; _disjunctiveAllomorphIndices = word._disjunctiveAllomorphIndices == null || word._disjunctiveAllomorphIndices.Count == 0 ? null @@ -226,6 +227,15 @@ public IEnumerable MorphemesInApplicationOrder public object CurrentTrace { get; set; } + /// + /// Work budget for the parse this word is part of. Null for words never routed through + /// (e.g. words built + /// directly by rule-level unit tests), in which case budget checks are no-ops (unlimited). + /// Reference-shared like — deliberately excluded from + /// and so dedup semantics are unchanged. + /// + internal ParseContext ParseContext { get; set; } + public bool IsPartial { get { return _isPartial; } @@ -514,6 +524,7 @@ internal bool CheckBlocking(out Word word) word = new Word(entry.PrimaryAllomorph, RealizationalFeatureStruct.Clone()) { CurrentTrace = CurrentTrace, + ParseContext = ParseContext, }; word.Freeze(); return true; diff --git a/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs b/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs index 8245d17a..59879a89 100644 --- a/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs +++ b/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs @@ -543,6 +543,151 @@ public void AnalyzeWord_ConcurrentRepeatedParsing_IsDeterministic() } } + [Test] + public void ParseWord_DefaultBudget_DoesNotTripOnOrdinaryGrammar() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var edSuffix = new AffixProcessRule + { + Id = "PAST", + Name = "ed_suffix", + Gloss = "PAST", + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + edSuffix.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1"), new InsertSegments(Table3, "+d") }, + } + ); + Morphophonemic.MorphologicalRules.Add(edSuffix); + + var morpher = new Morpher(TraceManager, Language); + Assert.That(morpher.MaxParseSteps, Is.EqualTo(Morpher.DefaultMaxParseSteps)); + Assert.That(morpher.ParseTimeout, Is.EqualTo(Morpher.DefaultParseTimeout)); + + IEnumerable results = morpher.ParseWord("sagd", out _, false, out ParseDiagnostics diagnostics).ToList(); + + Assert.That(results, Is.Not.Empty); + Assert.That(diagnostics.BudgetExhausted, Is.False); + Assert.That(diagnostics.Reason, Is.EqualTo(ParseExhaustionReason.None)); + } + + [Test] + public void ParseWord_StepBudgetExhausted_SoftStopsWithDiagnostics() + { + // A rule that keeps genuinely unapplying (each unapplication strips one distinct "+d" + // morph, so the cascade's own "input == output" infinite-loop guard never trips) with a + // MaxApplicationCount high enough that only the new step budget bounds it. + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + // No overt exponent: Rhs is a pure copy of the input, so every unapplication produces a + // Word with the identical Shape but one more entry in the morphological-rule-application + // list. The cascades' infinite-loop guard compares Words by ValueEquals (which includes + // that list), so it never trips here — only the new step budget bounds this. + var noExponentSuffix = new AffixProcessRule + { + Id = "REPEAT", + Name = "no_exponent_suffix", + Gloss = "REPEAT", + MaxApplicationCount = 1_000_000, + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + noExponentSuffix.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(noExponentSuffix); + SetRuleOrder(MorphologicalRuleOrder.Unordered); + + var morpher = new Morpher(TraceManager, Language) { MaxParseSteps = 500, ParseTimeout = TimeSpan.Zero }; + + List results = morpher.ParseWord("sag", out _, false, out ParseDiagnostics diagnostics).ToList(); + + Assert.That(diagnostics.BudgetExhausted, Is.True); + Assert.That(diagnostics.Reason, Is.EqualTo(ParseExhaustionReason.StepBudget)); + Assert.That(diagnostics.StepsUsed, Is.GreaterThanOrEqualTo(500)); + // Soft-stop: never throws, and ParseWord itself must remain usable afterwards. + Assert.That(() => morpher.ParseWord("sagd", out _, false), Throws.Nothing); + } + + [Test] + public void ParseWord_StepBudget_IsDeterministicSingleThreaded() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + // No overt exponent: Rhs is a pure copy of the input, so every unapplication produces a + // Word with the identical Shape but one more entry in the morphological-rule-application + // list. The cascades' infinite-loop guard compares Words by ValueEquals (which includes + // that list), so it never trips here — only the new step budget bounds this. + var noExponentSuffix = new AffixProcessRule + { + Id = "REPEAT", + Name = "no_exponent_suffix", + Gloss = "REPEAT", + MaxApplicationCount = 1_000_000, + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + noExponentSuffix.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(noExponentSuffix); + SetRuleOrder(MorphologicalRuleOrder.Unordered); + + var morpher = new Morpher(TraceManager, Language, maxDegreeOfParallelism: 1) + { + MaxParseSteps = 500, + ParseTimeout = TimeSpan.Zero, + }; + + morpher.ParseWord("sag", out _, false, out ParseDiagnostics first).ToList(); + morpher.ParseWord("sag", out _, false, out ParseDiagnostics second).ToList(); + + Assert.That(first.StepsUsed, Is.EqualTo(second.StepsUsed)); + } + + [Test] + public void RerunWithDiagnostics_ReportsTopOffendingRule() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + // No overt exponent: Rhs is a pure copy of the input, so every unapplication produces a + // Word with the identical Shape but one more entry in the morphological-rule-application + // list. The cascades' infinite-loop guard compares Words by ValueEquals (which includes + // that list), so it never trips here — only the new step budget bounds this. + var noExponentSuffix = new AffixProcessRule + { + Id = "REPEAT", + Name = "no_exponent_suffix", + Gloss = "REPEAT", + MaxApplicationCount = 1_000_000, + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + noExponentSuffix.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(noExponentSuffix); + SetRuleOrder(MorphologicalRuleOrder.Unordered); + + var morpher = new Morpher(TraceManager, Language) { MaxParseSteps = 500, ParseTimeout = TimeSpan.Zero }; + + ParseDiagnostics diagnostics = morpher.RerunWithDiagnostics("sag", out IEnumerable results); + results.ToList(); + + Assert.That(diagnostics.BudgetExhausted, Is.True); + Assert.That(diagnostics.TopRules, Is.Not.Empty); + Assert.That(diagnostics.TopRules[0].Rule, Is.EqualTo(noExponentSuffix)); + } + private static string AnalysisSignature(Morpher morpher, string word) { return string.Join( From e68f09844fb43de8759a408947c10b28e9e87817 Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 14:34:54 -0400 Subject: [PATCH 2/6] Complexity cap Phase 2: structural bounds (Layer 2) Adds three additive, default-off caps that convert exponential blowups into bounded ones instead of merely time-boxing them: - Morpher.MaxRuleApplicationsPerWord: a running total-unapplications counter on Word (Word.TotalUnapplicationCount), checked alongside the existing per-rule MaxApplicationCount in the three affix/compounding analysis rules. Closes the "rule A -> B -> A -> B" loophole that a per-rule cap alone cannot catch. - Morpher.MaxAnalysisShapeGrowth: prunes analysis candidates whose shape has grown past the surface form by more than N segments, checked at AnalysisStratumRule's output loop (the choke point - candidates pruned there never reach lexical lookup) and per-iteration inside AnalysisRewriteRule's Deletion/SelfOpaquing reapplication loops. - PermutationRuleCascade.MaxDepth (SIL.Machine core, opt-in via a new property, -1/unlimited by default so existing consumers are unaffected): caps nested rule-reapplication depth, derived from MaxRuleApplicationsPerWord rather than a new knob, synced each Apply() call since the cap can be set via object-initializer syntax after the rule cascade is already compiled. Verified against RewriteRuleTests.DeletionRules' real deletion-rule grammar: capping MaxAnalysisShapeGrowth excludes the deep-reinsertion analysis while the shallow ones survive as a strict subset of the uncapped result. --- .../AnalysisStratumRule.cs | 28 ++++- .../Morpher.cs | 43 +++++++- .../AnalysisAffixProcessRule.cs | 4 + .../AnalysisCompoundingRule.cs | 4 + .../AnalysisRealizationalAffixProcessRule.cs | 8 ++ .../PhonologicalRules/AnalysisRewriteRule.cs | 14 ++- src/SIL.Machine.Morphology.HermitCrab/Word.cs | 9 ++ .../Rules/PermutationRuleCascade.cs | 17 ++- .../MorpherTests.cs | 104 ++++++++++++++++++ 9 files changed, 219 insertions(+), 12 deletions(-) diff --git a/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs b/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs index 3ee2b95b..6cda018e 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs @@ -11,6 +11,7 @@ namespace SIL.Machine.Morphology.HermitCrab internal class AnalysisStratumRule : IRule { private readonly IRule _mrulesRule; + private readonly PermutationRuleCascade _permutationCascade; private readonly IRule _prulesRule; private readonly IRule _templatesRule; private readonly Stratum _stratum; @@ -39,11 +40,12 @@ public AnalysisStratumRule(Morpher morpher, Stratum stratum) // because morphological rules should be considered optional // during unapplication (they are obligatory during application, // but we don't know they have been applied during unapplication). - _mrulesRule = new PermutationRuleCascade( + _permutationCascade = new PermutationRuleCascade( mrules, true, FreezableEqualityComparer.Default ); + _mrulesRule = _permutationCascade; break; case MorphologicalRuleOrder.Unordered: // Single-threaded when the caller caps within-word parallelism (e.g. it @@ -106,8 +108,24 @@ private IRule CompilePhonologicalRule(IPhonologicalRule prule, Morphe } } + private bool ExceedsShapeGrowth(Word word) + { + return _morpher.MaxAnalysisShapeGrowth >= 0 + && word.ParseContext != null + && word.Shape.Count > word.ParseContext.SurfaceLength + _morpher.MaxAnalysisShapeGrowth; + } + public IEnumerable Apply(Word input) { + // Re-synced on every call rather than baked in at compile time: MaxRuleApplicationsPerWord + // is a mutable Morpher property that callers set via object-initializer syntax after + // construction (the same pattern MaxParseSteps/ParseTimeout use), which runs after this + // rule was already compiled. No new knob per complexity-cap.md §5.3 — derived from the + // existing per-word unapplication cap (0/unlimited maps to no depth limit). + if (_permutationCascade != null) + _permutationCascade.MaxDepth = + _morpher.MaxRuleApplicationsPerWord > 0 ? _morpher.MaxRuleApplicationsPerWord : -1; + if (_morpher.TraceManager.IsTracing) _morpher.TraceManager.BeginUnapplyStratum(_stratum, input); @@ -137,6 +155,14 @@ public IEnumerable Apply(Word input) if (input.ParseContext?.Exhausted == true) break; + // Prune candidates whose hypothesized underlying shape has grown too far past the + // surface form — the truly unbounded generator (undone deletions, empty exponents). + // Pruned here so they never reach lexical lookup or the next stratum. + if (ExceedsShapeGrowth(mruleOutWord)) + { + continue; + } + // Skip intermediate sources from phonological rules, templates, and morphological rules. mruleOutWord.Source = origInput; if (mergeEquivalentAnalyses) diff --git a/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs b/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs index da9ad1c0..5a8630c9 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs @@ -117,6 +117,24 @@ public ITraceManager TraceManager /// public TimeSpan ParseTimeout { get; set; } + /// + /// Max total morphological-rule unapplications per analysis candidate (≈ max affixes per + /// word), checked across all rules combined — closes the loophole where a per-rule + /// application cap never trips because no single rule repeats (e.g. rule A unapplies, then B, + /// then A again). 0 = unlimited + /// (default: some legitimate agglutinative grammars have long affix chains, so this is off by + /// default in the library; FieldWorks is expected to opt into a conservative value). + /// + public int MaxRuleApplicationsPerWord { get; set; } + + /// + /// Prunes any analysis candidate whose shape exceeds the surface form's length by more than + /// this many segments — the one truly unbounded generator, where unapplication hypothesizes + /// deleted/epenthesized material and keeps making the underlying form longer. -1 = unlimited + /// (default, preserves existing behavior). + /// + public int MaxAnalysisShapeGrowth { get; set; } = -1; + /// /// MaxUnapplications limits the number of unapplications to make it possible /// to make it possible to debug words that take 30 minutes to parse @@ -172,7 +190,12 @@ public IEnumerable ParseWord(string word, out object trace, bool guessRoot /// / cut the parse short (soft-stop: the /// returned sequence is whatever was found so far, never an exception). /// - public IEnumerable ParseWord(string word, out object trace, bool guessRoot, out ParseDiagnostics diagnostics) + public IEnumerable ParseWord( + string word, + out object trace, + bool guessRoot, + out ParseDiagnostics diagnostics + ) { return ParseWordCore(word, out trace, guessRoot, collectRuleCounters: false, out diagnostics); } @@ -268,7 +291,13 @@ private static ParseDiagnostics CreateParseDiagnostics(ParseContext parseContext .ToList(); } - return new ParseDiagnostics(true, parseContext.Reason, parseContext.StepsUsed, parseContext.Elapsed, topRules); + return new ParseDiagnostics( + true, + parseContext.Reason, + parseContext.StepsUsed, + parseContext.Elapsed, + topRules + ); } /// @@ -297,7 +326,11 @@ out object trace trace = rootTrace; var words = new ConcurrentBag(); - var parseContext = new ParseContext(MaxParseSteps, ParseTimeout, rootEntry.PrimaryAllomorph.Segments.Shape.Count); + var parseContext = new ParseContext( + MaxParseSteps, + ParseTimeout, + rootEntry.PrimaryAllomorph.Segments.Shape.Count + ); Exception exception = null; Parallel.ForEach( @@ -318,7 +351,9 @@ out object trace { synthesisWord.MorphologicalRuleUnapplied(rule.Item1); if (rule.Item2 != null) - synthesisWord.NonHeadUnapplied(new Word(rule.Item2, new FeatureStruct()) { ParseContext = parseContext }); + synthesisWord.NonHeadUnapplied( + new Word(rule.Item2, new FeatureStruct()) { ParseContext = parseContext } + ); } synthesisWord.CurrentTrace = rootTrace; diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs index 7cca6fdf..067e23ad 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisAffixProcessRule.cs @@ -47,6 +47,10 @@ public IEnumerable Apply(Word input) if ( input.GetUnapplicationCount(_rule) >= _rule.MaxApplicationCount + || ( + _morpher.MaxRuleApplicationsPerWord > 0 + && input.TotalUnapplicationCount >= _morpher.MaxRuleApplicationsPerWord + ) || !_rule.OutSyntacticFeatureStruct.IsUnifiable(input.SyntacticFeatureStruct) ) { diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs index e03b6cfe..f9874c26 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisCompoundingRule.cs @@ -48,6 +48,10 @@ public IEnumerable Apply(Word input) if ( input.NonHeadCount + 1 >= _morpher.MaxStemCount || input.GetUnapplicationCount(_rule) >= _rule.MaxApplicationCount + || ( + _morpher.MaxRuleApplicationsPerWord > 0 + && input.TotalUnapplicationCount >= _morpher.MaxRuleApplicationsPerWord + ) || !_rule.OutSyntacticFeatureStruct.IsUnifiable(input.SyntacticFeatureStruct) ) { diff --git a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs index e526682a..a03b9379 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/MorphologicalRules/AnalysisRealizationalAffixProcessRule.cs @@ -45,6 +45,14 @@ public IEnumerable Apply(Word input) if (!_morpher.RuleSelector(_rule)) return Enumerable.Empty(); + if ( + _morpher.MaxRuleApplicationsPerWord > 0 + && input.TotalUnapplicationCount >= _morpher.MaxRuleApplicationsPerWord + ) + { + return Enumerable.Empty(); + } + FeatureStruct realFS; if (!_rule.RealizationalFeatureStruct.Unify(input.RealizationalFeatureStruct, out realFS)) return Enumerable.Empty(); diff --git a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs index 08a01a6c..ae9bbe4e 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs @@ -118,6 +118,13 @@ private static bool IsUnifiable(Constraint constraint, Pattern= 0 + && data.ParseContext != null + && data.Shape.Count > data.ParseContext.SurfaceLength + _morpher.MaxAnalysisShapeGrowth; + } + public IEnumerable Apply(Word input) { if (input.ParseContext?.Step(_rule) == false) @@ -156,7 +163,7 @@ public IEnumerable Apply(Word input) break; // Bounded by DeletionReapplications above, but that's a user-set knob with // no ceiling of its own — still gate each reapplication on the shared budget. - if (input.ParseContext?.Step(_rule) == false) + if (input.ParseContext?.Step(_rule) == false || ExceedsShapeGrowth(data)) break; data = sr.Item2.Apply(data).SingleOrDefault(); } @@ -170,8 +177,9 @@ public IEnumerable Apply(Word input) { srApplied = true; // Unlike Deletion, this loop has no reapplication ceiling of its own (a - // self-feeding rule can hypothesize forever) — the budget is the only bound. - if (input.ParseContext?.Step(_rule) == false) + // self-feeding rule can hypothesize forever) — the budget and shape-growth + // cap are the only bounds. + if (input.ParseContext?.Step(_rule) == false || ExceedsShapeGrowth(data)) break; data = sr.Item2.Apply(data).SingleOrDefault(); } diff --git a/src/SIL.Machine.Morphology.HermitCrab/Word.cs b/src/SIL.Machine.Morphology.HermitCrab/Word.cs index 95e8b320..09321541 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/Word.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/Word.cs @@ -35,6 +35,7 @@ public class Word : Freezable, IAnnotatedData, ICloneable private bool _isPartial; private Dictionary> _disjunctiveAllomorphIndices; // lazily allocated (see above) private int _mruleAppCount = 0; + private int _totalUnapplicationCount = 0; private readonly IList _alternatives = new List(); public Word(RootAllomorph rootAllomorph, FeatureStruct realizationalFS) @@ -107,6 +108,7 @@ protected Word(Word word) kvp => new HashSet(kvp.Value) ); _mruleAppCount = word._mruleAppCount; + _totalUnapplicationCount = word._totalUnapplicationCount; } public IEnumerable> Morphs @@ -253,6 +255,12 @@ public IEnumerable MorphologicalRules internal int MorphologicalRuleApplicationCount => _mruleAppCount; + /// + /// Total morphological-rule unapplications on this analysis candidate, across all rules + /// combined. Carrier for . + /// + internal int TotalUnapplicationCount => _totalUnapplicationCount; + internal bool IsAllMorphologicalRulesApplied { get { return _mruleAppIndex == -1; } @@ -341,6 +349,7 @@ internal void RemoveMorph(Annotation morphAnn) internal void MorphologicalRuleUnapplied(IMorphologicalRule mrule) { CheckFrozen(); + _totalUnapplicationCount++; if (mrule != null) (_mrulesUnapplied = _mrulesUnapplied ?? new Dictionary()).UpdateValue( mrule, diff --git a/src/SIL.Machine/Rules/PermutationRuleCascade.cs b/src/SIL.Machine/Rules/PermutationRuleCascade.cs index b16671f4..3b82e449 100644 --- a/src/SIL.Machine/Rules/PermutationRuleCascade.cs +++ b/src/SIL.Machine/Rules/PermutationRuleCascade.cs @@ -22,22 +22,31 @@ IEqualityComparer comparer ) : base(rules, multiApp, comparer) { } + /// + /// Caps how many nested rule (re-)applications a single branch may descend through, on top of + /// the base class's input==output infinite-loop guard (which a rule whose output never exactly + /// repeats its input — e.g. one that keeps growing the shape — sails past). -1 = unlimited, the + /// default, so existing consumers see no behavior change. + /// + public int MaxDepth { get; set; } = -1; + public override IEnumerable Apply(TData input) { var output = new HashSet(Comparer); - ApplyRules(input, 0, output); + ApplyRules(input, 0, 0, output); return output; } - private void ApplyRules(TData input, int ruleIndex, HashSet output) + private void ApplyRules(TData input, int ruleIndex, int depth, HashSet output) { + bool descend = MaxDepth < 0 || depth < MaxDepth; for (int i = ruleIndex; i < Rules.Count; i++) { foreach (TData result in ApplyRule(Rules[i], i, input)) { // avoid infinite loop - if (!MultipleApplication || !Comparer.Equals(input, result)) - ApplyRules(result, MultipleApplication ? i : i + 1, output); + if (descend && (!MultipleApplication || !Comparer.Equals(input, result))) + ApplyRules(result, MultipleApplication ? i : i + 1, depth + 1, output); output.Add(result); } } diff --git a/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs b/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs index 59879a89..1a3f49b6 100644 --- a/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs +++ b/tests/SIL.Machine.Morphology.HermitCrab.Tests/MorpherTests.cs @@ -688,6 +688,110 @@ public void RerunWithDiagnostics_ReportsTopOffendingRule() Assert.That(diagnostics.TopRules[0].Rule, Is.EqualTo(noExponentSuffix)); } + [Test] + public void ParseWord_MaxRuleApplicationsPerWord_BoundsTotalAcrossRules() + { + // Same no-overt-exponent shape as the step-budget test, but bounded via the total- + // unapplications cap instead of the step budget: closes the "even if separated" loophole + // that per-rule MaxApplicationCount alone cannot (see complexity-cap.md §5.1). + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var noExponentSuffix = new AffixProcessRule + { + Id = "REPEAT", + Name = "no_exponent_suffix", + Gloss = "REPEAT", + MaxApplicationCount = 1_000_000, + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + noExponentSuffix.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(noExponentSuffix); + SetRuleOrder(MorphologicalRuleOrder.Unordered); + + // No step/timeout budget here — MaxRuleApplicationsPerWord alone must terminate the parse. + var morpher = new Morpher(TraceManager, Language) + { + MaxParseSteps = 0, + ParseTimeout = TimeSpan.Zero, + MaxRuleApplicationsPerWord = 10, + }; + + List results = morpher.ParseWord("sag", out _, false, out ParseDiagnostics diagnostics).ToList(); + + Assert.That( + diagnostics.BudgetExhausted, + Is.False, + "MaxRuleApplicationsPerWord is not a ParseDiagnostics-reported budget" + ); + Assert.That(results.All(w => w.TotalUnapplicationCount <= 10), Is.True); + } + + [Test] + public void ParseWord_MaxAnalysisShapeGrowth_PrunesDeepReinsertion() + { + // Reuses the DeletionRules rule4 shape (RewriteRuleTests.DeletionRules): deleting a high + // front unrounded vowel ("i") after a high vowel, so analysis can hypothesize progressively + // more deleted "i"s to the left, growing the shape well past the surface form. + var highFrontUnrndVowel = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("cons-") + .Symbol("voc+") + .Symbol("high+") + .Symbol("low-") + .Symbol("back-") + .Symbol("round-") + .Value; + var highVowel = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("cons-") + .Symbol("voc+") + .Symbol("high+") + .Value; + + var rule4 = new RewriteRule + { + Name = "rule4", + Lhs = Pattern.New().Annotation(highFrontUnrndVowel).Value, + }; + Allophonic.PhonologicalRules.Add(rule4); + rule4.Subrules.Add( + new RewriteSubrule { LeftEnvironment = Pattern.New().Annotation(highVowel).Value } + ); + + // Unbounded (default): matches the existing DeletionRules precedent exactly (RewriteRuleTests. + // DeletionRules) — deep reinsertion morph "27" ("buiibuii", 8 segments vs. surface "bubu"'s 4) + // is reachable. + var uncapped = new Morpher(TraceManager, Language) { DeletionReapplications = 1 }; + List uncappedResults = uncapped.ParseWord("bubu", out _, false).ToList(); + AssertMorphsEqual(uncappedResults, "24", "25", "26", "27", "19"); + + // Capped tightly enough that the deepest reinsertion chain cannot complete: the result set + // must shrink (never grow) relative to uncapped, and every remaining candidate's *analysis* + // step count must be no larger than the uncapped run's (the cap can only prune work, not add + // any) — without hard-coding which exact morphs the pruning walks away, since the interaction + // between DeletionReapplications' reapplication loop and Simultaneous-mode multi-site matching + // is intricate enough that pinning exact morph identities here would be over-fitting to + // incidental engine internals rather than the behavior this cap actually promises. + var capped = new Morpher(TraceManager, Language) { DeletionReapplications = 1, MaxAnalysisShapeGrowth = 0 }; + List cappedResults = capped.ParseWord("bubu", out _, false).ToList(); + Assert.That( + cappedResults.Select(w => string.Join("+", w.AllomorphsInMorphOrder.Select(a => a.Morpheme.Id))), + Is.SubsetOf( + uncappedResults.Select(w => string.Join("+", w.AllomorphsInMorphOrder.Select(a => a.Morpheme.Id))) + ) + ); + // The maximally-grown morph ("27", which needs the underlying shape to grow by 4 segments) + // must not survive a cap of 0 (no growth allowed at all). + Assert.That(cappedResults.Any(w => w.AllomorphsInMorphOrder.Any(a => a.Morpheme.Id == "27")), Is.False); + } + private static string AnalysisSignature(Morpher morpher, string word) { return string.Join( From c8a39aeb16dcc783b90ef8e80a48ce74f0a57a0b Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 16:16:35 -0400 Subject: [PATCH 3/6] Complexity cap Phase 3: static grammar lint (Layer 3) + calibration honesty pass Adds GrammarAnalyzer, a static analyzer over a loaded Language that flags always/almost-always-wrong rule shapes with stable diagnostic codes (HC0001-HC0008: no-overt-exponent affix rules, unbounded multipleApplication, self-feeding epenthesis/deletion rules, unconstrained compounding, optional-iterative lexical patterns, cyclic feeding pairs). Wired into the hc CLI as a new `hc lint` command, plus a `hc parse --diagnose` flag that surfaces RerunWithDiagnostics' top offending rules for a single word - the empirical companion to the static lint. Both are documented in a new docs/hermitcrab-grammar-performance.md guide organized by HC code. While shaping HC0004's self-feeding check, deduped the "does this rule's output unify with its own required environment" logic shared between AnalysisRewriteRule and GrammarAnalyzer into a single IsUnifiableWithEnvironment extension, and found/fixed a real gap: the lint only covered one of two engine paths that select self-opaquing behavior, silently missing the epenthesis case (unconditionally dangerous in Simultaneous mode). Also fixed a pre-existing HC0007 condition that required Optional *and* IsIterative on adjacent lexical pattern nodes, when the design doc's own canonical example (([Seg])([Seg])) is two plain-optional (non-iterative) groups - the check now matches the documented intent. Ran the real Phase 0 calibration corpus (indonesian/sena) against the rustify engine and replaced the Phase 1 doc comment's fabricated "~13,600 steps" figure with real numbers: Indonesian's worst word takes 10,445 steps (flat ~10-rule combinatorial interaction, not one bad rule); Sena's worst sampled word takes 14.9M steps/105s from only a ~1% corpus sample, and a separate real word was previously being truncated by the old 10s default timeout at 99,584 steps. Raised DefaultMaxParseSteps to 50,000,000 and DefaultParseTimeout to 30s accordingly, and documented in complexity-cap.md (with two new "still open" items) that the Sena figures are a floor pending a full-corpus re-baseline, and that the timeout is a genuine truncation/latency tradeoff rather than a pure safety margin. 82/82 HermitCrab tests pass; both projects build clean; csharpier clean. Co-Authored-By: Claude Sonnet 5 --- complexity-cap.md | 62 +++- docs/hermitcrab-grammar-performance.md | 115 ++++++ .../LintCommand.cs | 68 ++++ .../ParseCommand.cs | 57 ++- .../Program.cs | 1 + .../GrammarAnalyzer.cs | 349 ++++++++++++++++++ .../HermitCrabExtensions.cs | 28 ++ .../Morpher.cs | 35 +- .../ParseContext.cs | 6 +- .../ParseDiagnostics.cs | 8 - .../PhonologicalRules/AnalysisRewriteRule.cs | 20 +- .../ComplexityCapCorpusTests.cs | 253 +++++++++++++ .../GrammarAnalyzerTests.cs | 337 +++++++++++++++++ 13 files changed, 1287 insertions(+), 52 deletions(-) create mode 100644 docs/hermitcrab-grammar-performance.md create mode 100644 src/SIL.Machine.Morphology.HermitCrab.Tool/LintCommand.cs create mode 100644 src/SIL.Machine.Morphology.HermitCrab/GrammarAnalyzer.cs create mode 100644 tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs create mode 100644 tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarAnalyzerTests.cs diff --git a/complexity-cap.md b/complexity-cap.md index 1701219e..75fe21ae 100644 --- a/complexity-cap.md +++ b/complexity-cap.md @@ -92,15 +92,49 @@ the exact failure mode this plan exists to fix — an unbounded parse — remain out-of-the-box behavior. A generous cap that never fires for legitimate grammars but reliably kills runaway ones is strictly better than silence. -Concrete numbers are calibrated in Phase 0 against the real corpus (§7), not guessed -here, but the target shape is: run every word in `indonesian-words.txt` (121 words) and -`sena-words.txt` (7,121 words) against their respective grammars on the rustify engine, -take the observed max step count / max wall-clock time across that legitimate corpus, -and set the default to a large multiple of that ceiling (e.g. 50–100×) so it is -effectively invisible for real grammars but still finite. `ParseTimeout` defaults -similarly, e.g. a flat few seconds per word — generous for interactive/FLEx single-word -parses, still bounded for "Parse All Words" batches where one stuck word must not stall -the run indefinitely. +**Calibration results (2026-07-02, partial — see caveat below):** running the real +corpus against the rustify engine shows legitimate cost varies by roughly **1000x** +between the two grammars, which broke the original "large multiple of Indonesian's +ceiling" plan: + +- `indonesian-words.txt` (121 words, 2,563-line grammar): worst observed word + (`mengamat-amati`, a reduplicated compound) took **10,445 steps**. `hc lint` reports + this grammar as clean (2 `HC0006` warnings, 0 errors), and `RerunWithDiagnostics` + shows the cost is a **flat distribution across ~10 rules** (~6.5% each) — legitimate + combinatorial interaction from compounding + reduplication, not one bad rule. +- `sena-words.txt` (7,121 words, 33,091-line grammar): also lint-clean, but far more + expensive per word. The worst word sampled so far, `atawirambo`, took **14,905,517 + steps / 105.3 seconds** — a successful, legitimate parse with the same flat + multi-rule-interaction shape (Sena's agglutinative verb morphology stacks many + candidate subject/tense/object affix slots). A separate word, `ndinakupangani`, hit + the (now-superseded) 10-second default timeout at only 99,584 steps, i.e. a real word + was previously getting truncated by the shipped default. +- Because a full Sena run takes hours (many individual words take 10s–100+ seconds), + **only ~1% of the Sena corpus (72/7,121 words) has actually been sampled.** The + 14,905,517-step figure is the worst seen so far, not a proven ceiling — this should be + re-baselined against the full corpus before the shipped defaults are treated as final. + +Given this, the "50–100x a single grammar's ceiling" heuristic below doesn't transfer +across grammars of very different size/complexity — a multiplier calibrated on +Indonesian would be irrelevant to Sena's scale, and a multiplier large enough for Sena +would be absurd for Indonesian. Shipped defaults (`Morpher.DefaultMaxParseSteps` = +50,000,000, `Morpher.DefaultParseTimeout` = 30s) are instead set with headroom above the +largest *legitimate* word observed so far across both grammars, on the expectation that +`ParseTimeout` — not the step count — is what actually trips for slow-but-legitimate +words in practice, since step cost and wall-clock time track closely (~140k steps/sec +observed on Sena). The step budget mainly exists to catch algorithmically-cheap infinite +loops, which are cheap enough per step to blow past millions of steps in a fraction of a +second regardless of a grammar's normal cost profile. See the doc comments on those two +constants in `Morpher.cs` for the full reasoning. Note the timeout is a genuine, openly +acknowledged tradeoff, not just a safety margin: at 30s it will still occasionally +truncate an expensive-but-legitimate Sena word; raising it protects those words at the +cost of a slower worst-case "Parse All Words" batch. + +Original target shape (superseded by the above, kept for history): run every word in +`indonesian-words.txt` and `sena-words.txt` against their respective grammars, take the +observed max step count / max wall-clock time across that legitimate corpus, and set the +default to a large multiple of that ceiling (e.g. 50–100×) so it is effectively invisible +for real grammars but still finite. ### 4.2 Per-parse context, propagated like `CurrentTrace` @@ -408,3 +442,13 @@ across rustify's 100-file rewrite is not. Concretely: 6. **HC0004/HC0008 precision**: self-feeding/cycle detection via unification is approximate; acceptable false-positive rate for a Warning? Start conservative (high-confidence patterns only), widen with field feedback. +7. **Sena calibration is based on a ~1% sample (72/7,121 words)**, not a full corpus run + (see §4.1) — the worst-observed-word figures used to set `DefaultMaxParseSteps`/ + `DefaultParseTimeout` are a floor, not a proven ceiling. Re-baseline against the full + corpus (accept the multi-hour run, or parallelize it) before treating these as final, + and specifically check whether any word exceeds the current 50,000,000-step default. +8. **`DefaultParseTimeout` = 30s will still truncate some legitimate Sena words** (one + observed at 105s). Whether 30s is the right number — vs. a larger default, vs. no + default timeout with only a step budget, vs. a per-consumer-tunable-only knob with no + shipped default at all — is a real product decision that needs field input, not + something this investigation can resolve alone. diff --git a/docs/hermitcrab-grammar-performance.md b/docs/hermitcrab-grammar-performance.md new file mode 100644 index 00000000..8553bd47 --- /dev/null +++ b/docs/hermitcrab-grammar-performance.md @@ -0,0 +1,115 @@ +# Writing performant HermitCrab grammars + +HermitCrab's engine speedups (see the `hc-rustify` work) and its complexity-cap safety net +(`complexity-cap.md`) both help pathological grammars fail *safely* — bounded runtime, a status +flag, and per-rule evidence when a parse gives up. Neither one makes a pathological grammar fast. +The real fix is always at the grammar level. This guide catalogues the rule shapes that reliably +cause combinatorial blowups, keyed by the stable diagnostic codes `GrammarAnalyzer.Analyze` +(`hc lint`) emits, plus the interaction patterns that only show up empirically. + +## Static checks (`GrammarAnalyzer` / `hc lint`) + +### HC0001 — Error: no overt exponent + `MaxApplicationCount > 1` + +An affix rule whose every allomorph's output is a pure copy of the input (no inserted segments) +*and* whose `MaxApplicationCount` has been raised above 1 (the XML `multipleApplication` +attribute) will unapply to every word, every time, with nothing to ever make it stop. Analysis +keeps "peeling off" a rule that changed nothing, over and over, up to the configured cap. + +**Fix:** give the rule a real, overt exponent (an inserted segment or boundary), or drop +`MaxApplicationCount` back to the default of 1. + +### HC0002 — Warning: no overt exponent, single application + +Same "adds nothing" shape as HC0001, but capped at one application. Still doubles the candidate +count at every cascade position it's considered at, for no linguistic payoff. Often this is an +unintentional gap in a grammar rather than a deliberate zero-exponent rule (e.g. a rule that's +purely feature-changing). + +**Fix:** add an overt exponent if one is missing, or confirm the zero-exponent shape is +intentional (e.g. modeling a floating feature) and leave it — HC0002 is Info-adjacent, not a hard +error. + +### HC0003 — Warning: `MaxApplicationCount` raised + +Flags the opt-in itself, on any affix rule, independent of whether it has an overt exponent. This +is exactly the knob a pathological grammar reaches for. It's not wrong to raise it — some +agglutinative languages need real recursive affixation — but every raised value should be +justified by an actual attested word shape, not left at "big enough." + +**Fix:** set it to the smallest value that covers real words in the language, not a round number +picked for headroom. + +### HC0004 — Warning: self-feeding rewrite rule + +A `Simultaneous`-mode phonological rule whose output can satisfy its own environment again. Before +complexity-cap's Layer 1, this specific shape (`ReapplyType.SelfOpaquing` in `AnalysisRewriteRule`) +had **no reapplication bound at all** — an unconditional infinite loop the first time a grammar +hit it. Layer 1's step budget now catches it, but it's still wasted work every single parse. + +**Fix:** add an environment constraint that excludes the rule's own output (so a second +application can't match), or switch to `Iterative` mode if repeated application really is the +intent — iterative mode terminates naturally once the pattern stops matching. + +### HC0005 — Warning: unconstrained deletion + +A deletion phonological rule (synthesis removes more material than it keeps) with no left or +right environment constraint at all. During analysis, HermitCrab must hypothesize that the deleted +segment could have been anywhere satisfying the (empty) environment — i.e. everywhere — and +`Morpher.DeletionReapplications` governs how many times it's willing to keep re-guessing. + +**Fix:** add a left and/or right environment constraint so reinsertion is only considered where +deletion could plausibly have applied. + +### HC0006 — Warning: unconstrained compounding + +A compounding rule that constrains the part of speech of neither the head nor the non-head. Every +stem in the lexicon becomes a candidate on *both* sides — a cross-product that interacts with +`Morpher.MaxStemCount` and grows fast with lexicon size. + +**Fix:** constrain `HeadRequiredSyntacticFeatureStruct` and/or `NonHeadRequiredSyntacticFeatureStruct` +to the parts of speech that can actually compound in the language. + +### HC0007 — Info: adjacent optional/iterative lexical patterns + +A lexical guess pattern (e.g. `([Seg])([Seg])`) with two or more optional/iterative segments back +to back. `Morpher.LexicalGuess`'s own comments already note this produces spurious ambiguity: +multiple paths through the pattern match the same literal string, multiplying candidates without +adding coverage. + +**Fix:** prefer a single Kleene-star class (`[Seg]*`) over back-to-back optional groups when the +intent is "zero or more of these." + +### HC0008 — Info: cyclic feeding pair (best-effort) + +Two affix rules that each add no overt exponent, where each rule's output syntactic category is +compatible with the other's input requirement. Structurally, this is the shape of an +`A → B → A → B → ...` cycle that never terminates via a shape change — the specific loophole that +`Morpher.MaxRuleApplicationsPerWord` exists to close, since neither rule's own +`MaxApplicationCount` will ever trip on its own. + +This check is intentionally conservative (high-confidence pairs only, per an open question in +complexity-cap.md §10) — it will miss cycles that involve an overt exponent that nonetheless still +loops via some other mechanism, and it won't catch cycles longer than two rules. + +**Fix:** verify the two rules can't actually chain into each other indefinitely; if they +legitimately can (rare), set a `MaxRuleApplicationsPerWord` cap. + +## What static analysis can't catch + +Individually reasonable rules can still combine into exponential blowups — this is inherent to +static analysis over a rule *set*, not a specific bug in `GrammarAnalyzer`. When a word breaches +`Morpher.MaxParseSteps`/`ParseTimeout`, use `Morpher.RerunWithDiagnostics` to re-parse that one word +with per-rule counters enabled and get an empirical top-offender report: *"word X exceeded N +steps; rule Y accounted for most of the applications."* That rule is where to start — check it +against the codes above even if the static pass didn't flag it standalone, since the empirical +report is often revealing an *interaction*, not a single bad rule. + +## Layered defense, not a substitute for grammar fixes + +None of `MaxParseSteps`, `ParseTimeout`, `MaxRuleApplicationsPerWord`, or `MaxAnalysisShapeGrowth` +make a pathological grammar parse faster or more correctly — they bound the damage (a soft-stop +with partial results, never a hang, never an exception) while the grammar gets fixed. A grammar +that regularly needs those caps to fire is a grammar that needs fixing, not a grammar that's +"handled." Treat a budget breach as a bug report against the grammar, using the codes and the +empirical report above to find the specific rule to fix. diff --git a/src/SIL.Machine.Morphology.HermitCrab.Tool/LintCommand.cs b/src/SIL.Machine.Morphology.HermitCrab.Tool/LintCommand.cs new file mode 100644 index 00000000..5c2bb824 --- /dev/null +++ b/src/SIL.Machine.Morphology.HermitCrab.Tool/LintCommand.cs @@ -0,0 +1,68 @@ +using System.Linq; +using ManyConsole; + +namespace SIL.Machine.Morphology.HermitCrab; + +/// +/// Thin CLI wrapper around (complexity-cap.md §6.3) — lets +/// machine.py users and CI-style grammar validation run the static lint outside FLEx. +/// +internal class LintCommand : ConsoleCommand +{ + private readonly HCContext _context; + private string _severity; + + public LintCommand(HCContext context) + { + _context = context; + + IsCommand("lint", "Runs static grammar analysis and reports diagnostics (see complexity-cap.md)."); + SkipsCommandSummaryBeforeRunning(); + HasOption( + "s|severity=", + "minimum severity to report: info, warning, or error (default: info)", + o => _severity = o + ); + } + + public override int Run(string[] remainingArguments) + { + DiagnosticSeverity minSeverity = ParseSeverity(_severity); + var diagnostics = GrammarAnalyzer + .Analyze(_context.Language) + .Where(d => d.Severity >= minSeverity) + .OrderBy(d => d.Code) + .ToList(); + + if (diagnostics.Count == 0) + { + _context.Out.WriteLine("No grammar diagnostics found."); + } + else + { + foreach (GrammarDiagnostic diagnostic in diagnostics) + { + _context.Out.WriteLine("{0} [{1}] {2}", diagnostic.Code, diagnostic.Severity, diagnostic.Message); + _context.Out.WriteLine(" Suggestion: {0}", diagnostic.Suggestion); + } + _context.Out.WriteLine(); + _context.Out.WriteLine("{0} diagnostic(s).", diagnostics.Count); + } + + _context.Out.WriteLine(); + return 0; + } + + private static DiagnosticSeverity ParseSeverity(string severity) + { + switch (severity?.ToLowerInvariant()) + { + case "warning": + return DiagnosticSeverity.Warning; + case "error": + return DiagnosticSeverity.Error; + default: + return DiagnosticSeverity.Info; + } + } +} diff --git a/src/SIL.Machine.Morphology.HermitCrab.Tool/ParseCommand.cs b/src/SIL.Machine.Morphology.HermitCrab.Tool/ParseCommand.cs index 86bfc0db..86b96a8e 100644 --- a/src/SIL.Machine.Morphology.HermitCrab.Tool/ParseCommand.cs +++ b/src/SIL.Machine.Morphology.HermitCrab.Tool/ParseCommand.cs @@ -1,4 +1,5 @@ -using System.Collections.Generic; +using System; +using System.Collections.Generic; using System.Diagnostics; using System.Linq; using ManyConsole; @@ -8,6 +9,7 @@ namespace SIL.Machine.Morphology.HermitCrab; internal class ParseCommand : ConsoleCommand { private readonly HCContext _context; + private bool _diagnose; public ParseCommand(HCContext context) { @@ -16,11 +18,18 @@ public ParseCommand(HCContext context) IsCommand("parse", "Parses a word"); SkipsCommandSummaryBeforeRunning(); HasAdditionalArguments(1, ""); + HasOption( + "d|diagnose", + "reports step budget usage and the top offending rules for this word (see complexity-cap.md)", + o => _diagnose = true + ); } public override int Run(string[] remainingArguments) { string word = remainingArguments[0]; + if (_diagnose) + return RunDiagnose(word); try { _context.ParseCount++; @@ -58,6 +67,52 @@ public override int Run(string[] remainingArguments) _context.Out.WriteLine(); return 1; } + finally + { + _diagnose = false; + } + } + + private int RunDiagnose(string word) + { + try + { + ParseDiagnostics diagnostics = _context.Morpher.RerunWithDiagnostics(word, out IEnumerable results); + int resultCount = results.Count(); + _context.Out.WriteLine( + "\"{0}\": {1} result(s), {2} step(s), {3:F1}ms, budget exhausted: {4}{5}", + word, + resultCount, + diagnostics.StepsUsed, + diagnostics.Elapsed.TotalMilliseconds, + diagnostics.BudgetExhausted, + diagnostics.BudgetExhausted ? $" ({diagnostics.Reason})" : "" + ); + _context.Out.WriteLine("Top rules by application count:"); + foreach ((IHCRule rule, int applications) in diagnostics.TopRules.Take(10)) + { + double pct = 100.0 * applications / Math.Max(diagnostics.StepsUsed, 1); + _context.Out.WriteLine( + " {0,8} ({1,5:F1}%) {2} '{3}'", + applications, + pct, + rule.GetType().Name, + rule.Name + ); + } + _context.Out.WriteLine(); + return 0; + } + catch (InvalidShapeException ise) + { + _context.Out.WriteLine("The word contains an invalid segment at position {0}.", ise.Position + 1); + _context.Out.WriteLine(); + return 1; + } + finally + { + _diagnose = false; + } } private void PrintTrace(Trace trace, int indent, HashSet lineIndices) diff --git a/src/SIL.Machine.Morphology.HermitCrab.Tool/Program.cs b/src/SIL.Machine.Morphology.HermitCrab.Tool/Program.cs index ff8e86bc..ac1b0aa5 100644 --- a/src/SIL.Machine.Morphology.HermitCrab.Tool/Program.cs +++ b/src/SIL.Machine.Morphology.HermitCrab.Tool/Program.cs @@ -92,6 +92,7 @@ public static int Main(string[] args) new TracingCommand(context), new TestCommand(context), new StatsCommand(context), + new LintCommand(context), }; string input; diff --git a/src/SIL.Machine.Morphology.HermitCrab/GrammarAnalyzer.cs b/src/SIL.Machine.Morphology.HermitCrab/GrammarAnalyzer.cs new file mode 100644 index 00000000..0fc0434a --- /dev/null +++ b/src/SIL.Machine.Morphology.HermitCrab/GrammarAnalyzer.cs @@ -0,0 +1,349 @@ +using System.Collections.Generic; +using System.Linq; +using SIL.Machine.Annotations; +using SIL.Machine.FeatureModel; +using SIL.Machine.Matching; +using SIL.Machine.Morphology.HermitCrab.MorphologicalRules; +using SIL.Machine.Morphology.HermitCrab.PhonologicalRules; + +namespace SIL.Machine.Morphology.HermitCrab +{ + public enum DiagnosticSeverity + { + Info, + Warning, + Error, + } + + /// + /// One finding from : a static "don't do this" signal about a + /// specific rule shape, keyed by a stable so other tools (FLEx's parser report, + /// a CLI) can key documentation/UI off it. See complexity-cap.md §6 for the code catalogue and the + /// "Writing performant HC grammars" guide organized by these codes. + /// + public sealed class GrammarDiagnostic + { + internal GrammarDiagnostic( + string code, + DiagnosticSeverity severity, + object rule, + string message, + string suggestion + ) + { + Code = code; + Severity = severity; + Rule = rule; + Message = message; + Suggestion = suggestion; + } + + public string Code { get; } + public DiagnosticSeverity Severity { get; } + + /// The culprit object — an (rule/template) or a (lexical entry). + public object Rule { get; } + public string Message { get; } + public string Suggestion { get; } + + public override string ToString() + { + string ruleName = (Rule as IHCRule)?.Name ?? (Rule as Morpheme)?.Id ?? Rule?.ToString(); + return $"{Code} [{Severity}] {ruleName}: {Message}"; + } + } + + /// + /// Layer 3 of complexity-cap.md: static analysis over a loaded that flags + /// rule shapes which are always-wrong or almost-always-wrong for parse complexity — independent of + /// any specific word, and independent of whether the grammar was loaded from XML or built + /// programmatically (FieldWorks' HCLoader), since both produce the same in-memory . + /// What this *cannot* catch is combinatorial interaction between individually-reasonable rules; that + /// is covered empirically by instead (see complexity-cap.md §6.2). + /// + public static class GrammarAnalyzer + { + public static IReadOnlyList Analyze(Language language) + { + var diagnostics = new List(); + foreach (Stratum stratum in language.Strata) + { + foreach (IMorphologicalRule rule in stratum.MorphologicalRules) + { + if (rule is AffixProcessRule affixRule) + CheckAffixProcessRule(affixRule, diagnostics); + else if (rule is CompoundingRule compoundingRule) + CheckCompoundingRule(compoundingRule, diagnostics); + } + + foreach (IPhonologicalRule prule in stratum.PhonologicalRules) + { + if (prule is RewriteRule rewriteRule) + CheckRewriteRule(rewriteRule, diagnostics); + } + + CheckLexicalPatterns(stratum, diagnostics); + } + + CheckCyclicFeedingPairs(language, diagnostics); + + return diagnostics; + } + + // HC0001 / HC0002 / HC0003 + private static void CheckAffixProcessRule(AffixProcessRule rule, List diagnostics) + { + if (HasNoOvertExponent(rule)) + { + if (rule.MaxApplicationCount > 1) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0001", + DiagnosticSeverity.Error, + rule, + "Affix rule has no overt exponent (every allomorph's output is a pure copy of " + + "the input, adding no phonological material) and MaxApplicationCount > 1. " + + "This unapplies to every word, every time, with no way to ever stop: " + + "guaranteed exponential.", + "Give the rule an overt exponent, or set MaxApplicationCount back to 1." + ) + ); + } + else + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0002", + DiagnosticSeverity.Warning, + rule, + "Affix rule has no overt exponent (every allomorph's output is a pure copy of " + + "the input, adding no phonological material). It still multiplies " + + "candidates once per cascade position and is frequently unintended.", + "Add an overt exponent (an inserted segment/boundary), or confirm this " + + "zero-exponent rule (e.g. a purely feature-changing rule) is intentional." + ) + ); + } + } + + if (rule.MaxApplicationCount > 1) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0003", + DiagnosticSeverity.Warning, + rule, + $"MaxApplicationCount is {rule.MaxApplicationCount} (the XML multipleApplication " + + "attribute raises it above the default of 1) — this is precisely where an " + + "unbounded grammar opts into unboundedness.", + "Confirm a bound this high is actually needed; prefer the smallest value that " + + "covers legitimate words." + ) + ); + } + } + + private static bool HasNoOvertExponent(AffixProcessRule rule) + { + if (rule.Allomorphs.Count == 0) + return false; + return rule.Allomorphs.All(allo => + allo.Rhs.All(action => action is CopyFromInput || action is ModifyFromInput) + ); + } + + // HC0006 + private static void CheckCompoundingRule(CompoundingRule rule, List diagnostics) + { + if (rule.HeadRequiredSyntacticFeatureStruct.IsEmpty && rule.NonHeadRequiredSyntacticFeatureStruct.IsEmpty) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0006", + DiagnosticSeverity.Warning, + rule, + "Compounding rule constrains the part of speech of neither the head nor the " + + "non-head — every stem in the lexicon is a candidate on both sides, a " + + "cross-product blowup that interacts with Morpher.MaxStemCount.", + "Constrain HeadRequiredSyntacticFeatureStruct and/or " + + "NonHeadRequiredSyntacticFeatureStruct to the parts of speech that can " + + "actually compound." + ) + ); + } + } + + // HC0004 / HC0005 + private static void CheckRewriteRule(RewriteRule rule, List diagnostics) + { + foreach (RewriteSubrule subrule in rule.Subrules) + { + // Deletion subrule: underlying (Lhs) longer than surface (Rhs) — synthesis deletes + // material, so analysis must hypothesize/reinsert it. Matches AnalysisRewriteRule's own + // ReapplyType.Deletion classification. + if (rule.Lhs.Children.Count > subrule.Rhs.Children.Count) + { + if (subrule.LeftEnvironment.Children.Count == 0 && subrule.RightEnvironment.Children.Count == 0) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0005", + DiagnosticSeverity.Warning, + rule, + "Deletion rule has no left or right environment constraint at all — " + + "analysis can hypothesize a deleted segment matching this pattern " + + "anywhere in the word, unboundedly reinserting it (interacts with " + + "Morpher.DeletionReapplications).", + "Add a left and/or right environment constraint so reinsertion is only " + + "considered in the position(s) where deletion could plausibly have occurred." + ) + ); + } + } + + // Self-feeding: matches AnalysisRewriteRule's own ReapplyType.SelfOpaquing selection + // exactly — that path had no reapplication bound at all before complexity-cap Layer 1, + // i.e. an unconditional infinite loop for any grammar that hits it. Two distinct engine + // branches select it (see AnalysisRewriteRule's constructor): + // - Lhs.Count == Rhs.Count (a same-length/feature-changing subrule): only when + // Simultaneous *and* a Rhs segment constraint could satisfy its own environment again. + // - Lhs.Count == 0 (epenthesis): unconditionally, whenever Simultaneous — the inserted + // segment's own shape is irrelevant, so there's no unification check to gate it. + bool isSelfOpaquing; + if (rule.Lhs.Children.Count == subrule.Rhs.Children.Count) + { + isSelfOpaquing = + rule.ApplicationMode == RewriteApplicationMode.Simultaneous && IsSelfFeeding(subrule); + } + else if (rule.Lhs.Children.Count == 0) + { + isSelfOpaquing = rule.ApplicationMode == RewriteApplicationMode.Simultaneous; + } + else + { + isSelfOpaquing = false; // Deletion/expansion branches — always ReapplyType.Deletion. + } + + if (isSelfOpaquing) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0004", + DiagnosticSeverity.Warning, + rule, + "Simultaneous-mode rewrite rule whose output can satisfy its own environment " + + "again (self-feeding) — analysis can keep re-hypothesizing this rule's " + + "effect on its own output indefinitely.", + "Add an environment constraint that excludes the rule's own output, or switch " + + "to Iterative application mode if that's the intent." + ) + ); + } + } + } + + private static bool IsSelfFeeding(RewriteSubrule subrule) + { + foreach (Constraint constraint in subrule.Rhs.Children.OfType>()) + { + if (constraint.Type() != HCFeatureSystem.Segment) + continue; + if ( + !constraint.IsUnifiableWithEnvironment(subrule.LeftEnvironment) + || !constraint.IsUnifiableWithEnvironment(subrule.RightEnvironment) + ) + { + return true; + } + } + return false; + } + + // HC0007 + private static void CheckLexicalPatterns(Stratum stratum, List diagnostics) + { + foreach (LexEntry entry in stratum.Entries) + { + foreach (RootAllomorph allomorph in entry.Allomorphs) + { + if (!allomorph.IsPattern) + continue; + int consecutiveOptional = 0; + bool flagged = false; + foreach (ShapeNode node in allomorph.Segments.Shape) + { + if (flagged) + break; + if (node.Annotation.Optional || node.IsIterative()) + { + consecutiveOptional++; + if (consecutiveOptional >= 2) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0007", + DiagnosticSeverity.Info, + entry, + $"Lexical pattern '{entry.Id}' has two or more adjacent " + + "optional/iterative segments — a known source of spurious " + + "ambiguity (multiple paths through the pattern produce the " + + "same string).", + "Prefer a single Kleene-star class over back-to-back optional groups." + ) + ); + flagged = true; + } + } + else + { + consecutiveOptional = 0; + } + } + } + } + } + + // HC0008 + private static void CheckCyclicFeedingPairs(Language language, List diagnostics) + { + foreach (Stratum stratum in language.Strata) + { + List rules = stratum.MorphologicalRules.OfType().ToList(); + for (int i = 0; i < rules.Count; i++) + { + for (int j = i + 1; j < rules.Count; j++) + { + AffixProcessRule a = rules[i]; + AffixProcessRule b = rules[j]; + // Best-effort, high-confidence-only pairs (per complexity-cap.md §10 open + // question #6): both sides add no overt exponent, and each rule's output + // syntactic category is compatible with the other's input requirement — an + // A-then-B-then-A-then-B chain that never terminates via shape change. + if ( + HasNoOvertExponent(a) + && HasNoOvertExponent(b) + && a.OutSyntacticFeatureStruct.IsUnifiable(b.RequiredSyntacticFeatureStruct) + && b.OutSyntacticFeatureStruct.IsUnifiable(a.RequiredSyntacticFeatureStruct) + ) + { + diagnostics.Add( + new GrammarDiagnostic( + "HC0008", + DiagnosticSeverity.Info, + a, + $"'{a.Name}' and '{b.Name}' both add no overt exponent and each " + + "rule's output category is compatible with the other's input " + + "requirement — a cyclic feeding pair (A feeds B feeds A) is " + + "structurally possible.", + "Verify these two rules can't unapply to each other indefinitely; " + + "consider a MaxRuleApplicationsPerWord cap either way." + ) + ); + } + } + } + } + } + } +} diff --git a/src/SIL.Machine.Morphology.HermitCrab/HermitCrabExtensions.cs b/src/SIL.Machine.Morphology.HermitCrab/HermitCrabExtensions.cs index 5cf2ad5a..535ab96e 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/HermitCrabExtensions.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/HermitCrabExtensions.cs @@ -27,6 +27,34 @@ public static FeatureSymbol Type(this Constraint constraint) return (FeatureSymbol)constraint.FeatureStruct.GetValue(HCFeatureSystem.Type); } + /// + /// Whether could satisfy every segment constraint in + /// — i.e. whether a segment matching + /// could itself sit in that environment again. Shared by + /// (which uses it to pick ReapplyType.SelfOpaquing at compile time) and + /// (which replicates that exact classification statically to flag + /// HC0004 self-feeding rules) — both need the identical rule to stay in sync. + /// + internal static bool IsUnifiableWithEnvironment( + this Constraint constraint, + Pattern environment + ) + { + foreach ( + Constraint envConstraint in environment.GetNodesDepthFirst().OfType>() + ) + { + if ( + envConstraint.Type() == HCFeatureSystem.Segment + && !envConstraint.FeatureStruct.IsUnifiable(constraint.FeatureStruct) + ) + { + return false; + } + } + return true; + } + // RUSTIFY Stage 2: the FST binds as Fst and its matcher filters / inspects the // shape's int-offset annotation projection (Annotation), which shares the FeatureStruct // with the ShapeNode annotations — so these read identically to the ShapeNode overloads. diff --git a/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs b/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs index 5a8630c9..d4b553dc 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/Morpher.cs @@ -82,19 +82,31 @@ public ITraceManager TraceManager } /// - /// Generous default for , calibrated against the real Indonesian/Sena - /// grammars on the rustify engine (see complexity-cap.md Phase 0): observed legitimate max was - /// ~13,600 steps (Sena), so this ships ~150x above that ceiling — effectively invisible for real - /// grammars but still finite. 0 disables the step budget. + /// Generous default for . Calibrated against the real Indonesian + /// (~2,500-line grammar, worst observed word ~10,400 steps) and Sena (~33,000-line grammar, worst + /// observed word so far 14,905,517 steps / 105.3s, from a partial corpus sample) grammars — see + /// complexity-cap.md Phase 0. Legitimate cost varies by roughly 1000x between these two grammars + /// because Sena's agglutinative verb morphology combines many candidate affix slots, so this is set + /// with headroom above the largest legitimate word seen so far rather than as a fixed multiple of + /// Indonesian's ceiling. Because only ~1% of the Sena corpus has been sampled, this should be + /// re-validated against a full corpus run before being treated as final. In practice + /// is expected to trip before this does for slow-but-legitimate + /// words, since step cost and wall-clock time track closely (~140k steps/sec observed on Sena); this + /// step budget mainly exists to catch algorithmically cheap infinite loops. 0 disables the step + /// budget. /// - public const int DefaultMaxParseSteps = 2_000_000; + public const int DefaultMaxParseSteps = 50_000_000; /// - /// Generous default for — a backstop far above any observed legitimate - /// single-word parse time on the rustify engine, but still bounded so one pathological word cannot - /// stall a "Parse All Words" batch indefinitely. disables the timeout. + /// Generous default for . This is a genuine product tradeoff, not just a + /// safety margin: real Sena words have been observed taking 100+ seconds to parse legitimately (see + /// ), so any finite timeout will occasionally cut off a real parse + /// on grammars like Sena. 30 seconds is chosen as generous enough for the vast majority of legitimate + /// words while still bounding worst-case per-word latency in a "Parse All Words" batch to something + /// human-tolerable. Consumers with expensive grammars and no batch-latency constraint should raise + /// this. disables the timeout. /// - public static readonly TimeSpan DefaultParseTimeout = TimeSpan.FromSeconds(10); + public static readonly TimeSpan DefaultParseTimeout = TimeSpan.FromSeconds(30); public int DeletionReapplications { get; set; } @@ -279,9 +291,6 @@ out ParseDiagnostics diagnostics private static ParseDiagnostics CreateParseDiagnostics(ParseContext parseContext) { - if (!parseContext.Exhausted) - return ParseDiagnostics.None; - IReadOnlyList<(IHCRule Rule, int Applications)> topRules = null; if (parseContext.DiagnosticsEnabled) { @@ -292,7 +301,7 @@ private static ParseDiagnostics CreateParseDiagnostics(ParseContext parseContext } return new ParseDiagnostics( - true, + parseContext.Exhausted, parseContext.Reason, parseContext.StepsUsed, parseContext.Elapsed, diff --git a/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs b/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs index 82731dde..99fbf69f 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/ParseContext.cs @@ -73,9 +73,9 @@ public bool Step(IHCRule rule = null) if (rule != null && _ruleCounters != null) _ruleCounters.AddOrUpdate(rule, 1, (_, count) => count + 1); - if (_maxSteps <= 0 && _timeoutTicks < 0) - return true; - + // Always counted, even when both limits are disabled: StepsUsed must reflect real work + // (calibration/diagnostics rely on it), and a single Interlocked.Increment is the "steady- + // state cost ~one counter increment per rule application" the design promises either way. int steps = Interlocked.Increment(ref _steps); if (_maxSteps > 0 && steps >= _maxSteps) { diff --git a/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs b/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs index a661e505..66228f70 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/ParseDiagnostics.cs @@ -10,14 +10,6 @@ namespace SIL.Machine.Morphology.HermitCrab /// public sealed class ParseDiagnostics { - public static readonly ParseDiagnostics None = new ParseDiagnostics( - false, - ParseExhaustionReason.None, - 0, - TimeSpan.Zero, - null - ); - internal ParseDiagnostics( bool budgetExhausted, ParseExhaustionReason reason, diff --git a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs index ae9bbe4e..22dea216 100644 --- a/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs +++ b/src/SIL.Machine.Morphology.HermitCrab/PhonologicalRules/AnalysisRewriteRule.cs @@ -54,8 +54,8 @@ public AnalysisRewriteRule(Morpher morpher, RewriteRule rule) if (constraint.Type() == HCFeatureSystem.Segment) { if ( - !IsUnifiable(constraint, sr.LeftEnvironment) - || !IsUnifiable(constraint, sr.RightEnvironment) + !constraint.IsUnifiableWithEnvironment(sr.LeftEnvironment) + || !constraint.IsUnifiableWithEnvironment(sr.RightEnvironment) ) { reapplyType = ReapplyType.SelfOpaquing; @@ -102,22 +102,6 @@ public AnalysisRewriteRule(Morpher morpher, RewriteRule rule) } } - private static bool IsUnifiable(Constraint constraint, Pattern env) - { - foreach (Constraint curConstraint in env.GetNodesDepthFirst().OfType>()) - { - if ( - curConstraint.Type() == HCFeatureSystem.Segment - && !curConstraint.FeatureStruct.IsUnifiable(constraint.FeatureStruct) - ) - { - return false; - } - } - - return true; - } - private bool ExceedsShapeGrowth(Word data) { return _morpher.MaxAnalysisShapeGrowth >= 0 diff --git a/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs b/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs new file mode 100644 index 00000000..ec5151bf --- /dev/null +++ b/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs @@ -0,0 +1,253 @@ +using System.Diagnostics; +using NUnit.Framework; + +namespace SIL.Machine.Morphology.HermitCrab; + +/// +/// Complexity-cap Phase 0 (see complexity-cap.md §7, §9): calibration and no-regression corpus using +/// the real Indonesian/Sena grammars. These grammars + wordlists are large, not licensed for this repo, +/// and stay untracked (see .gitignore) — every test here is [Explicit] (not run by default CI) and +/// skips itself when the files aren't present locally, exactly like the RustifyBenchmark precedent +/// referenced in .gitignore's comment. +/// +[TestFixture] +[Explicit("Requires the untracked samples/data/{indonesian,sena}-hc.xml corpus; see complexity-cap.md Phase 0.")] +public class ComplexityCapCorpusTests +{ + private static string? FindRepoRoot() + { + var dir = new DirectoryInfo(AppContext.BaseDirectory); + while (dir != null) + { + if (File.Exists(Path.Combine(dir.FullName, "machine.sln"))) + return dir.FullName; + dir = dir.Parent; + } + return null; + } + + private static (string Grammar, string Words)? FindCorpus(string name) + { + string? root = FindRepoRoot(); + if (root == null) + return null; + string grammar = Path.Combine(root, "samples", "data", $"{name}-hc.xml"); + string words = Path.Combine(root, "samples", "data", $"{name}-words.txt"); + if (!File.Exists(grammar) || !File.Exists(words)) + return null; + return (grammar, words); + } + + // "Unlimited" for calibration purposes only: a genuinely pathological word in a real corpus must + // not be allowed to hang the calibration run forever (see the Sena run that sat stuck for 23 + // minutes before being killed — exactly the failure mode complexity-cap exists to catch). This is + // a calibration safety net only, ~2500x above any legitimate word observed so far; it is not a + // proposed shipped default. + private const int CalibrationStepCeiling = 50_000_000; + + private static void RunCorpus(string name) + { + (string Grammar, string Words)? corpus = FindCorpus(name); + if (corpus == null) + { + Assert.Ignore( + $"samples/data/{name}-hc.xml and/or {name}-words.txt not present locally (untracked, see .gitignore) — skipping." + ); + return; + } + + Language language = XmlLanguageLoader.Load(corpus.Value.Grammar); + var morpher = new Morpher(new TraceManager(), language) + { + MaxParseSteps = CalibrationStepCeiling, + ParseTimeout = TimeSpan.Zero, + }; + + string[] words = File.ReadAllLines(corpus.Value.Words).Select(w => w.Trim()).Where(w => w.Length > 0).ToArray(); + + int maxSteps = 0; + string maxStepsWord = ""; + var sw = Stopwatch.StartNew(); + long maxWordMs = 0; + string maxWordMsWord = ""; + int wordsParsed = 0; + int wordsSkipped = 0; + var pathologicalWords = new List<(string Word, int Steps)>(); + foreach (string word in words) + { + ParseDiagnostics diagnostics; + var wordSw = Stopwatch.StartNew(); + try + { + morpher.ParseWord(word, out _, false, out diagnostics).ToList(); + } + catch (InvalidShapeException) + { + // Malformed/non-word lines in this ad hoc wordlist (e.g. gloss annotations that slipped + // in) aren't a complexity-cap concern — skip rather than fail the calibration run. + wordsSkipped++; + continue; + } + wordSw.Stop(); + wordsParsed++; + // Flushed immediately (unlike TestContext.Out, which buffers until the test ends) so a + // hang/crash mid-run still shows which word was last attempted. + TestContext.Progress.WriteLine( + $" [{wordsParsed}/{words.Length}] '{word}': {diagnostics.StepsUsed} steps, {wordSw.ElapsedMilliseconds}ms" + ); + + if (diagnostics.BudgetExhausted) + pathologicalWords.Add((word, diagnostics.StepsUsed)); + + if (diagnostics.StepsUsed > maxSteps) + { + maxSteps = diagnostics.StepsUsed; + maxStepsWord = word; + } + if (wordSw.ElapsedMilliseconds > maxWordMs) + { + maxWordMs = wordSw.ElapsedMilliseconds; + maxWordMsWord = word; + } + } + sw.Stop(); + + TestContext.Out.WriteLine( + $"{name}: {wordsParsed} words parsed ({wordsSkipped} skipped as malformed), total {sw.ElapsedMilliseconds}ms, " + + $"max steps {maxSteps} (word '{maxStepsWord}'), " + + $"max single-word time {maxWordMs}ms (word '{maxWordMsWord}'), " + + $"suggested default MaxParseSteps (100x observed max) = {Math.Max(maxSteps, 1) * 100}" + ); + + if (pathologicalWords.Count > 0) + { + TestContext.Out.WriteLine( + $"WARNING: {pathologicalWords.Count} word(s) hit the {CalibrationStepCeiling:N0}-step calibration " + + "ceiling — these are candidates for genuinely pathological grammar interactions, not " + + "legitimate baseline data points:" + ); + foreach ((string word, int steps) in pathologicalWords) + TestContext.Out.WriteLine($" '{word}': {steps} steps (hit ceiling)"); + } + + Assert.That( + pathologicalWords, + Is.Empty, + $"{pathologicalWords.Count} word(s) hit the calibration step ceiling — see output for which word(s); " + + "investigate with RerunWithDiagnostics before trusting the max-steps number above for calibration." + ); + } + + [Test] + public void Indonesian_Baseline_NoWordExhaustsUnlimitedBudget() + { + RunCorpus("indonesian"); + } + + /// + /// Ad hoc diagnostic, not a pass/fail assertion: reports which rule(s) account for the bulk of the + /// step count on the single most expensive word in the corpus, using RerunWithDiagnostics exactly + /// as the "Writing performant HC grammars" guide (docs/hermitcrab-grammar-performance.md) + /// recommends. Useful for eyeballing whether a corpus's worst-case word is a legitimate expensive + /// parse or a symptom of a specific bad rule. + /// + [Test] + public void Indonesian_TopOffendingRules_ForWorstWord() + { + ReportTopOffenders("indonesian", "mengamat-amati"); + } + + private static void ReportTopOffenders(string name, string word) + { + (string Grammar, string Words)? corpus = FindCorpus(name); + if (corpus == null) + { + Assert.Ignore($"samples/data/{name}-hc.xml not present locally — skipping."); + return; + } + + Language language = XmlLanguageLoader.Load(corpus.Value.Grammar); + var morpher = new Morpher(new TraceManager(), language) { MaxParseSteps = 0, ParseTimeout = TimeSpan.Zero }; + + ParseDiagnostics diagnostics; + try + { + diagnostics = morpher.RerunWithDiagnostics(word, out IEnumerable results); + results.ToList(); + } + catch (InvalidShapeException) + { + Assert.Ignore($"'{word}' is not a valid shape in the {name} grammar's character set."); + return; + } + + TestContext.Out.WriteLine( + $"{name} '{word}': {diagnostics.StepsUsed} steps, {diagnostics.Elapsed.TotalMilliseconds:F1}ms" + ); + TestContext.Out.WriteLine("Top rules by application count:"); + foreach ((IHCRule rule, int applications) in diagnostics.TopRules.Take(10)) + { + double pct = 100.0 * applications / Math.Max(diagnostics.StepsUsed, 1); + TestContext.Out.WriteLine($" {applications, 6} ({pct, 5:F1}%) {rule.GetType().Name} '{rule.Name}'"); + } + } + + [Test] + public void Sena_Baseline_NoWordExhaustsUnlimitedBudget() + { + RunCorpus("sena"); + } + + /// + /// Confirms the *shipped* defaults (Morpher.DefaultMaxParseSteps / DefaultParseTimeout) are + /// generous enough to be invisible on real, legitimate grammars — the "no-regression" half of + /// Phase 0 (§7): every word must still complete without tripping the budget at the defaults a + /// naive consumer gets out of the box. + /// + [Test] + public void Indonesian_ShippedDefaults_NeverTrip() + { + RunCorpusAtDefaults("indonesian"); + } + + [Test] + public void Sena_ShippedDefaults_NeverTrip() + { + RunCorpusAtDefaults("sena"); + } + + private static void RunCorpusAtDefaults(string name) + { + (string Grammar, string Words)? corpus = FindCorpus(name); + if (corpus == null) + { + Assert.Ignore( + $"samples/data/{name}-hc.xml and/or {name}-words.txt not present locally (untracked, see .gitignore) — skipping." + ); + return; + } + + Language language = XmlLanguageLoader.Load(corpus.Value.Grammar); + var morpher = new Morpher(new TraceManager(), language); // shipped defaults + + string[] words = File.ReadAllLines(corpus.Value.Words).Select(w => w.Trim()).Where(w => w.Length > 0).ToArray(); + + foreach (string word in words) + { + ParseDiagnostics diagnostics; + try + { + morpher.ParseWord(word, out _, false, out diagnostics).ToList(); + } + catch (InvalidShapeException) + { + continue; + } + Assert.That( + diagnostics.BudgetExhausted, + Is.False, + $"'{word}' tripped the shipped default budget (StepsUsed={diagnostics.StepsUsed}) — defaults are not generous enough for this corpus" + ); + } + } +} diff --git a/tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarAnalyzerTests.cs b/tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarAnalyzerTests.cs new file mode 100644 index 00000000..ac3fdc87 --- /dev/null +++ b/tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarAnalyzerTests.cs @@ -0,0 +1,337 @@ +using NUnit.Framework; +using SIL.Machine.FeatureModel; +using SIL.Machine.Matching; +using SIL.Machine.Morphology.HermitCrab.MorphologicalRules; +using SIL.Machine.Morphology.HermitCrab.PhonologicalRules; + +namespace SIL.Machine.Morphology.HermitCrab; + +[TestFixture] +public class GrammarAnalyzerTests : HermitCrabTestBase +{ + [Test] + public void HC0001_NoOvertExponentWithMultipleApplication_IsError() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var rule = new AffixProcessRule + { + Name = "bad_rule", + MaxApplicationCount = 100, + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + rule.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That( + diagnostics, + Has.Some.Matches(d => + d.Code == "HC0001" && d.Severity == DiagnosticSeverity.Error && d.Rule == rule + ) + ); + } + + [Test] + public void HC0002_NoOvertExponentSingleApplication_IsWarning() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var rule = new AffixProcessRule + { + Name = "zero_exponent_rule", + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + rule.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That( + diagnostics, + Has.Some.Matches(d => d.Code == "HC0002" && d.Severity == DiagnosticSeverity.Warning) + ); + Assert.That(diagnostics, Has.None.Matches(d => d.Code == "HC0001")); + } + + [Test] + public void HC0001_RuleWithOvertExponent_IsNotFlagged() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var rule = new AffixProcessRule + { + Name = "ed_suffix", + MaxApplicationCount = 100, + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + rule.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1"), new InsertSegments(Table3, "+d") }, + } + ); + Morphophonemic.MorphologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.None.Matches(d => d.Code == "HC0001" || d.Code == "HC0002")); + // MaxApplicationCount > 1 alone still trips HC0003 regardless of overt exponent. + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0003" && d.Rule == rule)); + } + + [Test] + public void HC0004_SelfFeedingSimultaneousRule_IsFlagged() + { + // Matches AnalysisRewriteRule's own ReapplyType.SelfOpaquing selection: Simultaneous mode with + // a Rhs segment constraint that is NOT unifiable with its own environment. + var voc = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("voc+") + .Value; + var cons = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("voc-") + .Value; + var rule = new RewriteRule + { + Name = "self_feeding_rule", + ApplicationMode = RewriteApplicationMode.Simultaneous, + Lhs = Pattern.New().Value, + }; + rule.Subrules.Add( + new RewriteSubrule + { + Rhs = Pattern.New().Annotation(voc).Value, + LeftEnvironment = Pattern.New().Annotation(cons).Value, + } + ); + Allophonic.PhonologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0004" && d.Rule == rule)); + } + + [Test] + public void HC0004_SimultaneousEpenthesis_IsUnconditionallyFlagged() + { + // Epenthesis (Lhs.Children.Count == 0): the engine (AnalysisRewriteRule's constructor) selects + // ReapplyType.SelfOpaquing here whenever ApplicationMode is Simultaneous, with no unification + // check at all — unlike the same-length-subrule case. Must be flagged unconditionally too. + var voc = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("voc+") + .Value; + var rule = new RewriteRule + { + Name = "epenthesis_rule", + ApplicationMode = RewriteApplicationMode.Simultaneous, + Lhs = Pattern.New().Value, // empty Lhs = epenthesis + }; + rule.Subrules.Add(new RewriteSubrule { Rhs = Pattern.New().Annotation(voc).Value }); + Allophonic.PhonologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0004" && d.Rule == rule)); + } + + [Test] + public void HC0004_IterativeEpenthesis_IsNotFlagged() + { + var voc = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("voc+") + .Value; + var rule = new RewriteRule + { + Name = "epenthesis_rule_iterative", + ApplicationMode = RewriteApplicationMode.Iterative, + Lhs = Pattern.New().Value, + }; + rule.Subrules.Add(new RewriteSubrule { Rhs = Pattern.New().Annotation(voc).Value }); + Allophonic.PhonologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.None.Matches(d => d.Code == "HC0004")); + } + + [Test] + public void HC0005_UnconstrainedDeletion_IsFlagged() + { + var highFrontUnrndVowel = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("cons-") + .Symbol("voc+") + .Symbol("high+") + .Symbol("low-") + .Symbol("back-") + .Symbol("round-") + .Value; + var rule = new RewriteRule + { + Name = "unconstrained_deletion", + Lhs = Pattern.New().Annotation(highFrontUnrndVowel).Value, + }; + rule.Subrules.Add(new RewriteSubrule()); // Rhs defaults to empty (deletion), no environment constraints. + Allophonic.PhonologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0005" && d.Rule == rule)); + } + + [Test] + public void HC0005_ConstrainedDeletion_IsNotFlagged() + { + var highFrontUnrndVowel = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("cons-") + .Symbol("voc+") + .Symbol("high+") + .Symbol("low-") + .Symbol("back-") + .Symbol("round-") + .Value; + var highVowel = FeatureStruct + .New(Language.PhonologicalFeatureSystem) + .Symbol(HCFeatureSystem.Segment) + .Symbol("cons-") + .Symbol("voc+") + .Symbol("high+") + .Value; + var rule = new RewriteRule + { + Name = "constrained_deletion", + Lhs = Pattern.New().Annotation(highFrontUnrndVowel).Value, + }; + rule.Subrules.Add( + new RewriteSubrule { LeftEnvironment = Pattern.New().Annotation(highVowel).Value } + ); + Allophonic.PhonologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.None.Matches(d => d.Code == "HC0005")); + } + + [Test] + public void HC0006_UnconstrainedCompounding_IsFlagged() + { + var rule = new CompoundingRule { Name = "unconstrained_compound" }; + Morphophonemic.MorphologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0006" && d.Rule == rule)); + } + + [Test] + public void HC0006_ConstrainedCompounding_IsNotFlagged() + { + var rule = new CompoundingRule + { + Name = "constrained_compound", + HeadRequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("N").Value, + NonHeadRequiredSyntacticFeatureStruct = FeatureStruct + .New(Language.SyntacticFeatureSystem) + .Symbol("V") + .Value, + }; + Morphophonemic.MorphologicalRules.Add(rule); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.None.Matches(d => d.Code == "HC0006")); + } + + [Test] + public void HC0007_AdjacentOptionalIterativeLexicalPattern_IsFlagged() + { + var naturalClass = new NaturalClass(new FeatureStruct()) { Name = "Any" }; + Morphophonemic.CharacterDefinitionTable.AddNaturalClass(naturalClass); + LexEntry entry = AddEntry("pattern_entry", new FeatureStruct(), Morphophonemic, "([Any])([Any])"); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0007" && d.Rule == entry)); + } + + [Test] + public void HC0008_CyclicFeedingPair_IsFlagged() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var a = new AffixProcessRule + { + Name = "cycle_a", + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + a.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + var b = new AffixProcessRule + { + Name = "cycle_b", + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + b.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1") }, + } + ); + Morphophonemic.MorphologicalRules.Add(a); + Morphophonemic.MorphologicalRules.Add(b); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Has.Some.Matches(d => d.Code == "HC0008")); + } + + [Test] + public void Analyze_WellBehavedGrammar_ProducesNoDiagnostics() + { + var any = FeatureStruct.New().Symbol(HCFeatureSystem.Segment).Value; + var edSuffix = new AffixProcessRule + { + Name = "ed_suffix", + RequiredSyntacticFeatureStruct = FeatureStruct.New(Language.SyntacticFeatureSystem).Symbol("V").Value, + }; + edSuffix.Allomorphs.Add( + new AffixProcessAllomorph + { + Lhs = { Pattern.New("1").Annotation(any).OneOrMore.Value }, + Rhs = { new CopyFromInput("1"), new InsertSegments(Table3, "+d") }, + } + ); + Morphophonemic.MorphologicalRules.Add(edSuffix); + + var diagnostics = GrammarAnalyzer.Analyze(Language); + + Assert.That(diagnostics, Is.Empty); + } +} From 13567446dc3ffaf04c5f0d04239109d1ef00be2b Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 16:17:18 -0400 Subject: [PATCH 4/6] complexity-cap.md: mark Phases 0-3 done, record commit hashes Bookkeeping only - the status header and phase table still said "Plan (not started)" after Phases 0-3 were implemented and committed. --- complexity-cap.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/complexity-cap.md b/complexity-cap.md index 75fe21ae..2fa4ffdc 100644 --- a/complexity-cap.md +++ b/complexity-cap.md @@ -1,6 +1,8 @@ # Complexity Cap: Bounding Pathological HermitCrab Parses -**Status:** Plan (not started) — sequencing and defaults decided, see §8/§10 +**Status:** Phases 0–3 implemented and committed on `complexity-cap` (stacked on `hc-rustify`, +see §8); Phase 4 (FieldWorks integration) is a separate follow-up in the FW repo. Sena +calibration is a ~1% sample pending a full-corpus re-baseline — see §10 items 7–8. **Author:** drafted 2026-07-02 **Related:** PR #446 (hc-rustify performance work), FieldWorks out-of-process HC worker (FW PR #983) @@ -411,13 +413,13 @@ across rustify's 100-file rewrite is not. Concretely: ## 9. Phases -| Phase | Deliverable | Depends on | Est. size | -|---|---|---|---| -| 0 | Branch off `hc-rustify`. Baseline `indonesian`/`sena` on rustify (max steps/time observed → derive generous `MaxParseSteps`/`ParseTimeout` defaults); build 1–2 pathological variants of the indonesian grammar; repro harness | `hc-rustify` | S | -| 1 | `ParseContext`, `MaxParseSteps` + `ParseTimeout`, soft-stop checks, `ParseDiagnostics` overload, breach re-run with per-rule counters | 0 | M | -| 2 | `MaxRuleApplicationsPerWord`, `MaxAnalysisShapeGrowth`, cascade depth cap | 1 (shares `ParseContext`) | M | -| 3 | `GrammarAnalyzer` + HC0001–HC0008, CLI, "Writing performant HC grammars" guide | — (parallelizable) | M–L | -| 4 | FieldWorks follow-ups: worker DTO status field, FLEx "diagnose word" + parser-report lint surfacing, set conservative caps in HCLoader | 1–3, FW repo | separate effort | +| Phase | Deliverable | Depends on | Est. size | Status | +|---|---|---|---|---| +| 0 | Branch off `hc-rustify`. Baseline `indonesian`/`sena` on rustify (max steps/time observed → derive generous `MaxParseSteps`/`ParseTimeout` defaults); build 1–2 pathological variants of the indonesian grammar; repro harness | `hc-rustify` | S | **Done** (Indonesian fully baselined; Sena ~1% sampled — see §10.7) | +| 1 | `ParseContext`, `MaxParseSteps` + `ParseTimeout`, soft-stop checks, `ParseDiagnostics` overload, breach re-run with per-rule counters | 0 | M | **Done** (commit b3fd2b55) | +| 2 | `MaxRuleApplicationsPerWord`, `MaxAnalysisShapeGrowth`, cascade depth cap | 1 (shares `ParseContext`) | M | **Done** (commit e68f0984) | +| 3 | `GrammarAnalyzer` + HC0001–HC0008, CLI, "Writing performant HC grammars" guide | — (parallelizable) | M–L | **Done** (commit c8a39aeb) | +| 4 | FieldWorks follow-ups: worker DTO status field, FLEx "diagnose word" + parser-report lint surfacing, set conservative caps in HCLoader | 1–3, FW repo | separate effort | Not started (separate repo) | ## 10. Open questions From 343515b183ed28e2b3fff0e822eba035207fdca6 Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 16:18:33 -0400 Subject: [PATCH 5/6] ComplexityCapCorpusTests: report top 5 words by step count Small addition to the ad hoc Phase 0 calibration harness, left uncommitted from the corpus investigation: keeps a running top-5 (by StepsUsed) instead of only the single max, so a full-corpus re-baseline (see complexity-cap.md Section 10 item 7) shows the shape of the tail, not just one data point. --- .../ComplexityCapCorpusTests.cs | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs b/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs index ec5151bf..58e61145 100644 --- a/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs +++ b/tests/SIL.Machine.Morphology.HermitCrab.Tests/ComplexityCapCorpusTests.cs @@ -73,6 +73,7 @@ private static void RunCorpus(string name) int wordsParsed = 0; int wordsSkipped = 0; var pathologicalWords = new List<(string Word, int Steps)>(); + var topWordsBySteps = new List<(string Word, int Steps, long Ms)>(); foreach (string word in words) { ParseDiagnostics diagnostics; @@ -99,6 +100,11 @@ private static void RunCorpus(string name) if (diagnostics.BudgetExhausted) pathologicalWords.Add((word, diagnostics.StepsUsed)); + topWordsBySteps.Add((word, diagnostics.StepsUsed, wordSw.ElapsedMilliseconds)); + topWordsBySteps.Sort((a, b) => b.Steps.CompareTo(a.Steps)); + if (topWordsBySteps.Count > 5) + topWordsBySteps.RemoveAt(topWordsBySteps.Count - 1); + if (diagnostics.StepsUsed > maxSteps) { maxSteps = diagnostics.StepsUsed; @@ -119,6 +125,10 @@ private static void RunCorpus(string name) + $"suggested default MaxParseSteps (100x observed max) = {Math.Max(maxSteps, 1) * 100}" ); + TestContext.Out.WriteLine($"{name}: top {topWordsBySteps.Count} words by step count:"); + foreach ((string word, int steps, long ms) in topWordsBySteps) + TestContext.Out.WriteLine($" '{word}': {steps} steps, {ms}ms"); + if (pathologicalWords.Count > 0) { TestContext.Out.WriteLine( From c1d7db641e0b9f8849d21d1ab5bc0ee1955963ef Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 17:29:31 -0400 Subject: [PATCH 6/6] complexity-cap.md: update Sena calibration open items with parse-optimization findings A separate investigation (sharded Release-mode full-corpus scan, see docs/hermitcrab-parse-algorithm-analysis.md on the sibling parse-optimization branch, not yet committed anywhere) got much further than this branch's own single-threaded Debug-mode recalibration attempt, which was aborted after ~1 hour at 283/7,121 Sena words to avoid burning many more hours on redundant/inferior data. Updates items 7-8 with that scan's numbers (p90 ~2M steps, ~16% of words >1M steps, worst observed >=39.9M steps, 30s ParseTimeout trips on dozens of legitimate words) and adds item 9: cinacemerwa (37.5M steps, 0 valid parses) crashed the NUnit test host outright, apparently from memory pressure independent of the step/timeout budgets - the current Layer 1/2 budgets bound steps and wall-clock but not allocations. --- complexity-cap.md | 37 +++++++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/complexity-cap.md b/complexity-cap.md index 2fa4ffdc..9ce70f7a 100644 --- a/complexity-cap.md +++ b/complexity-cap.md @@ -444,13 +444,30 @@ across rustify's 100-file rewrite is not. Concretely: 6. **HC0004/HC0008 precision**: self-feeding/cycle detection via unification is approximate; acceptable false-positive rate for a Warning? Start conservative (high-confidence patterns only), widen with field feedback. -7. **Sena calibration is based on a ~1% sample (72/7,121 words)**, not a full corpus run - (see §4.1) — the worst-observed-word figures used to set `DefaultMaxParseSteps`/ - `DefaultParseTimeout` are a floor, not a proven ceiling. Re-baseline against the full - corpus (accept the multi-hour run, or parallelize it) before treating these as final, - and specifically check whether any word exceeds the current 50,000,000-step default. -8. **`DefaultParseTimeout` = 30s will still truncate some legitimate Sena words** (one - observed at 105s). Whether 30s is the right number — vs. a larger default, vs. no - default timeout with only a step budget, vs. a per-consumer-tunable-only knob with no - shipped default at all — is a real product decision that needs field input, not - something this investigation can resolve alone. +7. **Update 2026-07-02, still open:** a separate investigation (sharded 8-way, Release-mode + scan of the full Sena corpus — see `docs/hermitcrab-parse-algorithm-analysis.md`, + currently uncommitted on a sibling `parse-optimization` branch, not yet landed here) + got much further than this branch's own single-threaded Debug-mode attempt (which was + killed after ~1 hour at 283/7,121 words — some individual words alone took 50+ seconds + at that build/threading combination, and the earlier ~1% sample already showed the + corpus has a long tail). That scan found: p90 ≈ 2,000,000 steps; ~16% of words exceed + 1,000,000 steps; worst observed so far ≥ 39,900,000 steps (`kukucitirani`) — under the + 50,000,000-step default, but with much less headroom than the original ~1% sample + suggested, and still not confirmed as a true corpus-wide maximum. Re-baselining against + a complete, verified full-corpus run (ideally the sharded/Release harness, not this + branch's test-suite-based one) remains open. +8. **Update 2026-07-02, confirmed, not yet resolved:** the same investigation confirms + `DefaultParseTimeout` = 30s trips on *dozens* of legitimate Sena words (single-threaded + times of 100–250s observed), not just the one word noted in the original finding above. + The product-decision question (raise the default? drop it in favor of the step budget + alone? make it a no-shipped-default, per-consumer-only knob?) still needs field input. +9. **New finding, 2026-07-02:** the same investigation reports that `cinacemerwa` — Sena's + most expensive known word (37.5M steps, and notably a word that yields *zero* valid + parses) — crashed the NUnit test host outright, apparently from memory pressure during + candidate-explosion, independent of the step/timeout budgets (which bound *steps*, not + *allocations*). This means the current Layer 1/2 budgets do not fully protect against a + pathological word exhausting memory before it exhausts its step or time budget. Whether + this needs a third bound (e.g. a candidate-count or allocation ceiling) or is + sufficiently addressed by the algorithmic fixes under investigation on `parse-optimization` + (which would shrink the candidate set directly, see that branch's + `docs/hermitcrab-parse-algorithm-analysis.md` §4) is undecided.