Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
473 changes: 473 additions & 0 deletions complexity-cap.md

Large diffs are not rendered by default.

115 changes: 115 additions & 0 deletions docs/hermitcrab-grammar-performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Writing performant HermitCrab grammars

HermitCrab's engine speedups (see the `hc-rustify` work) and its complexity-cap safety net
(`complexity-cap.md`) both help pathological grammars fail *safely* — bounded runtime, a status
flag, and per-rule evidence when a parse gives up. Neither one makes a pathological grammar fast.
The real fix is always at the grammar level. This guide catalogues the rule shapes that reliably
cause combinatorial blowups, keyed by the stable diagnostic codes `GrammarAnalyzer.Analyze`
(`hc lint`) emits, plus the interaction patterns that only show up empirically.

## Static checks (`GrammarAnalyzer` / `hc lint`)

### HC0001 — Error: no overt exponent + `MaxApplicationCount > 1`

An affix rule whose every allomorph's output is a pure copy of the input (no inserted segments)
*and* whose `MaxApplicationCount` has been raised above 1 (the XML `multipleApplication`
attribute) will unapply to every word, every time, with nothing to ever make it stop. Analysis
keeps "peeling off" a rule that changed nothing, over and over, up to the configured cap.

**Fix:** give the rule a real, overt exponent (an inserted segment or boundary), or drop
`MaxApplicationCount` back to the default of 1.

### HC0002 — Warning: no overt exponent, single application

Same "adds nothing" shape as HC0001, but capped at one application. Still doubles the candidate
count at every cascade position it's considered at, for no linguistic payoff. Often this is an
unintentional gap in a grammar rather than a deliberate zero-exponent rule (e.g. a rule that's
purely feature-changing).

**Fix:** add an overt exponent if one is missing, or confirm the zero-exponent shape is
intentional (e.g. modeling a floating feature) and leave it — HC0002 is Info-adjacent, not a hard
error.

### HC0003 — Warning: `MaxApplicationCount` raised

Flags the opt-in itself, on any affix rule, independent of whether it has an overt exponent. This
is exactly the knob a pathological grammar reaches for. It's not wrong to raise it — some
agglutinative languages need real recursive affixation — but every raised value should be
justified by an actual attested word shape, not left at "big enough."

**Fix:** set it to the smallest value that covers real words in the language, not a round number
picked for headroom.

### HC0004 — Warning: self-feeding rewrite rule

A `Simultaneous`-mode phonological rule whose output can satisfy its own environment again. Before
complexity-cap's Layer 1, this specific shape (`ReapplyType.SelfOpaquing` in `AnalysisRewriteRule`)
had **no reapplication bound at all** — an unconditional infinite loop the first time a grammar
hit it. Layer 1's step budget now catches it, but it's still wasted work every single parse.

**Fix:** add an environment constraint that excludes the rule's own output (so a second
application can't match), or switch to `Iterative` mode if repeated application really is the
intent — iterative mode terminates naturally once the pattern stops matching.

### HC0005 — Warning: unconstrained deletion

A deletion phonological rule (synthesis removes more material than it keeps) with no left or
right environment constraint at all. During analysis, HermitCrab must hypothesize that the deleted
segment could have been anywhere satisfying the (empty) environment — i.e. everywhere — and
`Morpher.DeletionReapplications` governs how many times it's willing to keep re-guessing.

**Fix:** add a left and/or right environment constraint so reinsertion is only considered where
deletion could plausibly have applied.

### HC0006 — Warning: unconstrained compounding

A compounding rule that constrains the part of speech of neither the head nor the non-head. Every
stem in the lexicon becomes a candidate on *both* sides — a cross-product that interacts with
`Morpher.MaxStemCount` and grows fast with lexicon size.

**Fix:** constrain `HeadRequiredSyntacticFeatureStruct` and/or `NonHeadRequiredSyntacticFeatureStruct`
to the parts of speech that can actually compound in the language.

### HC0007 — Info: adjacent optional/iterative lexical patterns

A lexical guess pattern (e.g. `([Seg])([Seg])`) with two or more optional/iterative segments back
to back. `Morpher.LexicalGuess`'s own comments already note this produces spurious ambiguity:
multiple paths through the pattern match the same literal string, multiplying candidates without
adding coverage.

**Fix:** prefer a single Kleene-star class (`[Seg]*`) over back-to-back optional groups when the
intent is "zero or more of these."

### HC0008 — Info: cyclic feeding pair (best-effort)

Two affix rules that each add no overt exponent, where each rule's output syntactic category is
compatible with the other's input requirement. Structurally, this is the shape of an
`A → B → A → B → ...` cycle that never terminates via a shape change — the specific loophole that
`Morpher.MaxRuleApplicationsPerWord` exists to close, since neither rule's own
`MaxApplicationCount` will ever trip on its own.

This check is intentionally conservative (high-confidence pairs only, per an open question in
complexity-cap.md §10) — it will miss cycles that involve an overt exponent that nonetheless still
loops via some other mechanism, and it won't catch cycles longer than two rules.

**Fix:** verify the two rules can't actually chain into each other indefinitely; if they
legitimately can (rare), set a `MaxRuleApplicationsPerWord` cap.

## What static analysis can't catch

Individually reasonable rules can still combine into exponential blowups — this is inherent to
static analysis over a rule *set*, not a specific bug in `GrammarAnalyzer`. When a word breaches
`Morpher.MaxParseSteps`/`ParseTimeout`, use `Morpher.RerunWithDiagnostics` to re-parse that one word
with per-rule counters enabled and get an empirical top-offender report: *"word X exceeded N
steps; rule Y accounted for most of the applications."* That rule is where to start — check it
against the codes above even if the static pass didn't flag it standalone, since the empirical
report is often revealing an *interaction*, not a single bad rule.

## Layered defense, not a substitute for grammar fixes

None of `MaxParseSteps`, `ParseTimeout`, `MaxRuleApplicationsPerWord`, or `MaxAnalysisShapeGrowth`
make a pathological grammar parse faster or more correctly — they bound the damage (a soft-stop
with partial results, never a hang, never an exception) while the grammar gets fixed. A grammar
that regularly needs those caps to fire is a grammar that needs fixing, not a grammar that's
"handled." Treat a budget breach as a bug report against the grammar, using the codes and the
empirical report above to find the specific rule to fix.
68 changes: 68 additions & 0 deletions src/SIL.Machine.Morphology.HermitCrab.Tool/LintCommand.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
using System.Linq;
using ManyConsole;

namespace SIL.Machine.Morphology.HermitCrab;

/// <summary>
/// Thin CLI wrapper around <see cref="GrammarAnalyzer.Analyze"/> (complexity-cap.md §6.3) — lets
/// machine.py users and CI-style grammar validation run the static lint outside FLEx.
/// </summary>
internal class LintCommand : ConsoleCommand
{
private readonly HCContext _context;
private string _severity;

public LintCommand(HCContext context)
{
_context = context;

IsCommand("lint", "Runs static grammar analysis and reports diagnostics (see complexity-cap.md).");
SkipsCommandSummaryBeforeRunning();
HasOption(
"s|severity=",
"minimum severity to report: info, warning, or error (default: info)",
o => _severity = o
);
}

public override int Run(string[] remainingArguments)
{
DiagnosticSeverity minSeverity = ParseSeverity(_severity);
var diagnostics = GrammarAnalyzer
.Analyze(_context.Language)
.Where(d => d.Severity >= minSeverity)
.OrderBy(d => d.Code)
.ToList();

if (diagnostics.Count == 0)
{
_context.Out.WriteLine("No grammar diagnostics found.");
}
else
{
foreach (GrammarDiagnostic diagnostic in diagnostics)
{
_context.Out.WriteLine("{0} [{1}] {2}", diagnostic.Code, diagnostic.Severity, diagnostic.Message);
_context.Out.WriteLine(" Suggestion: {0}", diagnostic.Suggestion);
}
_context.Out.WriteLine();
_context.Out.WriteLine("{0} diagnostic(s).", diagnostics.Count);
}

_context.Out.WriteLine();
return 0;
}

private static DiagnosticSeverity ParseSeverity(string severity)
{
switch (severity?.ToLowerInvariant())
{
case "warning":
return DiagnosticSeverity.Warning;
case "error":
return DiagnosticSeverity.Error;
default:
return DiagnosticSeverity.Info;
}
}
}
57 changes: 56 additions & 1 deletion src/SIL.Machine.Morphology.HermitCrab.Tool/ParseCommand.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
using System.Collections.Generic;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using ManyConsole;
Expand All @@ -8,6 +9,7 @@ namespace SIL.Machine.Morphology.HermitCrab;
internal class ParseCommand : ConsoleCommand
{
private readonly HCContext _context;
private bool _diagnose;

public ParseCommand(HCContext context)
{
Expand All @@ -16,11 +18,18 @@ public ParseCommand(HCContext context)
IsCommand("parse", "Parses a word");
SkipsCommandSummaryBeforeRunning();
HasAdditionalArguments(1, "<word>");
HasOption(
"d|diagnose",
"reports step budget usage and the top offending rules for this word (see complexity-cap.md)",
o => _diagnose = true
);
}

public override int Run(string[] remainingArguments)
{
string word = remainingArguments[0];
if (_diagnose)
return RunDiagnose(word);
try
{
_context.ParseCount++;
Expand Down Expand Up @@ -58,6 +67,52 @@ public override int Run(string[] remainingArguments)
_context.Out.WriteLine();
return 1;
}
finally
{
_diagnose = false;
}
}

private int RunDiagnose(string word)
{
try
{
ParseDiagnostics diagnostics = _context.Morpher.RerunWithDiagnostics(word, out IEnumerable<Word> results);
int resultCount = results.Count();
_context.Out.WriteLine(
"\"{0}\": {1} result(s), {2} step(s), {3:F1}ms, budget exhausted: {4}{5}",
word,
resultCount,
diagnostics.StepsUsed,
diagnostics.Elapsed.TotalMilliseconds,
diagnostics.BudgetExhausted,
diagnostics.BudgetExhausted ? $" ({diagnostics.Reason})" : ""
);
_context.Out.WriteLine("Top rules by application count:");
foreach ((IHCRule rule, int applications) in diagnostics.TopRules.Take(10))
{
double pct = 100.0 * applications / Math.Max(diagnostics.StepsUsed, 1);
_context.Out.WriteLine(
" {0,8} ({1,5:F1}%) {2} '{3}'",
applications,
pct,
rule.GetType().Name,
rule.Name
);
}
_context.Out.WriteLine();
return 0;
}
catch (InvalidShapeException ise)
{
_context.Out.WriteLine("The word contains an invalid segment at position {0}.", ise.Position + 1);
_context.Out.WriteLine();
return 1;
}
finally
{
_diagnose = false;
}
}

private void PrintTrace(Trace trace, int indent, HashSet<int> lineIndices)
Expand Down
1 change: 1 addition & 0 deletions src/SIL.Machine.Morphology.HermitCrab.Tool/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ public static int Main(string[] args)
new TracingCommand(context),
new TestCommand(context),
new StatsCommand(context),
new LintCommand(context),
};

string input;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ public AnalysisAffixTemplateRule(Morpher morpher, AffixTemplate template)

public IEnumerable<Word> Apply(Word input)
{
if (input.ParseContext?.Step(_template) == false)
return Enumerable.Empty<Word>();

if (!_morpher.RuleSelector(_template))
return Enumerable.Empty<Word>();

Expand Down
3 changes: 3 additions & 0 deletions src/SIL.Machine.Morphology.HermitCrab/AnalysisLanguageRule.cs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ public IEnumerable<Word> Apply(Word input)
var results = new HashSet<Word>(FreezableEqualityComparer<Word>.Default);
for (int i = 0; i < _rules.Count && inputSet.Count > 0; i++)
{
if (input.ParseContext?.Exhausted == true)
break;

if (!_morpher.RuleSelector(_strata[i]))
continue;

Expand Down
33 changes: 32 additions & 1 deletion src/SIL.Machine.Morphology.HermitCrab/AnalysisStratumRule.cs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ namespace SIL.Machine.Morphology.HermitCrab
internal class AnalysisStratumRule : IRule<Word, int>
{
private readonly IRule<Word, int> _mrulesRule;
private readonly PermutationRuleCascade<Word, int> _permutationCascade;
private readonly IRule<Word, int> _prulesRule;
private readonly IRule<Word, int> _templatesRule;
private readonly Stratum _stratum;
Expand Down Expand Up @@ -39,11 +40,12 @@ public AnalysisStratumRule(Morpher morpher, Stratum stratum)
// because morphological rules should be considered optional
// during unapplication (they are obligatory during application,
// but we don't know they have been applied during unapplication).
_mrulesRule = new PermutationRuleCascade<Word, int>(
_permutationCascade = new PermutationRuleCascade<Word, int>(
mrules,
true,
FreezableEqualityComparer<Word>.Default
);
_mrulesRule = _permutationCascade;
break;
case MorphologicalRuleOrder.Unordered:
// Single-threaded when the caller caps within-word parallelism (e.g. it
Expand Down Expand Up @@ -106,8 +108,24 @@ private IRule<Word, int> CompilePhonologicalRule(IPhonologicalRule prule, Morphe
}
}

private bool ExceedsShapeGrowth(Word word)
{
return _morpher.MaxAnalysisShapeGrowth >= 0
&& word.ParseContext != null
&& word.Shape.Count > word.ParseContext.SurfaceLength + _morpher.MaxAnalysisShapeGrowth;
}

public IEnumerable<Word> Apply(Word input)
{
// Re-synced on every call rather than baked in at compile time: MaxRuleApplicationsPerWord
// is a mutable Morpher property that callers set via object-initializer syntax after
// construction (the same pattern MaxParseSteps/ParseTimeout use), which runs after this
// rule was already compiled. No new knob per complexity-cap.md §5.3 — derived from the
// existing per-word unapplication cap (0/unlimited maps to no depth limit).
if (_permutationCascade != null)
_permutationCascade.MaxDepth =
_morpher.MaxRuleApplicationsPerWord > 0 ? _morpher.MaxRuleApplicationsPerWord : -1;

if (_morpher.TraceManager.IsTracing)
_morpher.TraceManager.BeginUnapplyStratum(_stratum, input);

Expand All @@ -132,6 +150,19 @@ public IEnumerable<Word> Apply(Word input)
_morpher.TraceManager.EndUnapplyStratum(_stratum, input);
foreach (Word mruleOutWord in mruleOutWords)
{
// Once the budget is gone, stop collecting outputs immediately rather than draining the
// rest of an already-in-flight (but now-empty-yielding) rule cascade.
if (input.ParseContext?.Exhausted == true)
break;

// Prune candidates whose hypothesized underlying shape has grown too far past the
// surface form — the truly unbounded generator (undone deletions, empty exponents).
// Pruned here so they never reach lexical lookup or the next stratum.
if (ExceedsShapeGrowth(mruleOutWord))
{
continue;
}

// Skip intermediate sources from phonological rules, templates, and morphological rules.
mruleOutWord.Source = origInput;
if (mergeEquivalentAnalyses)
Expand Down
Loading
Loading