Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
b397356
WIP issue #39: total parse/edit with cst.errors - one residual equiva…
johnsoncodehk Jun 11, 2026
dc10568
Total parse/edit complete: a latent Pratt watermark hole closed, equi…
johnsoncodehk Jun 11, 2026
e4fc2f3
Gate the expression-splitting ';' injection class
johnsoncodehk Jun 11, 2026
05c6284
Cross-grammar incremental gate: all 7 grammars, edit ≡ fresh + self-c…
johnsoncodehk Jun 11, 2026
3e7f1d6
Missing-token synthesis: tsc-style "expected 'x'" with structure pres…
johnsoncodehk Jun 11, 2026
bf771a1
Missing-nonterminal synthesis: the tsc "Expression expected" analog
johnsoncodehk Jun 11, 2026
2245f0b
Broken-state edits go incremental: recovering adoption under bar purity
johnsoncodehk Jun 11, 2026
ee1890d
Cross-attempt memo survival: bar-free windows are context-free
johnsoncodehk Jun 11, 2026
b37e1cc
Conditional lexer resync: depth-shift adoption kills the transition c…
johnsoncodehk Jun 11, 2026
4248105
Recovering surgery: bar-clear splices keep the error tree incremental
johnsoncodehk Jun 11, 2026
668f8f5
Diagnostics: viable-set messages + paired-opener related info
johnsoncodehk Jun 11, 2026
2c6e593
Head-to-head bench: Monogram vs tsc updateSourceFile vs tree-sitter
johnsoncodehk Jun 11, 2026
71e14a7
Error-recovery conformance metric: bidirectional agreement vs tsc
johnsoncodehk Jun 11, 2026
f0d2c75
Reject unterminated templates and colon-less case clauses
johnsoncodehk Jun 11, 2026
25b78ba
Formal write-up + bounded-exhaustive edit gate
johnsoncodehk Jun 11, 2026
397a76d
Attribute the transition-edit cost to what profiling actually shows
johnsoncodehk Jun 11, 2026
476ab69
Row-level taint + reject body-less class expressions
johnsoncodehk Jun 11, 2026
d61726b
O(1) shifted-resync check at depth 0 via a pop-on-empty index list
johnsoncodehk Jun 11, 2026
3d8f494
Block bare statement keywords as expressions; for-in takes comma objects
johnsoncodehk Jun 11, 2026
f8a5742
Roadmap: enumerate the parser-acceptance long tail vs tsc
johnsoncodehk Jun 11, 2026
d37332b
Decorators prefix class members; orphan and post-modifier decorators …
johnsoncodehk Jun 11, 2026
d77b803
A ';'-less class field rejects a same-line decorator after it
johnsoncodehk Jun 11, 2026
777fe21
Lexer resync also validates the candidate's leading-trivia flags
johnsoncodehk Jun 11, 2026
aa15e91
Class-member commitment: tsc's parse-time rules, end to end
johnsoncodehk Jun 13, 2026
943be84
Interface heritage: parse repeated extends clauses
johnsoncodehk Jun 13, 2026
2c6ee57
Decl parser-surface: modifier-prefix, ambient module shorthand, globa…
johnsoncodehk Jun 13, 2026
ce69032
Widen decl modifier-prefix to the full accessibility/static set: zero…
johnsoncodehk Jun 13, 2026
aff69bf
Tolerate const class-member modifier and body-less object-literal acc…
johnsoncodehk Jun 13, 2026
d8a0ed1
Index signatures: optional value type and trailing comma
johnsoncodehk Jun 13, 2026
7ec8951
Support optional typed calls: a?.<T>(args)
johnsoncodehk Jun 13, 2026
472691c
await/yield fork: foundation — ctx markers + the name-fork transform
johnsoncodehk Jun 13, 2026
e9fd860
await/yield fork: canon plumbing in the parsers (no-op until forks ex…
johnsoncodehk Jun 13, 2026
3ede30e
await/yield fork: apply withAwaitYield inside the two parsers (no-op)
johnsoncodehk Jun 13, 2026
a51aacc
await/yield fork: cst-match RULE_CANON (no-op until forks exist)
johnsoncodehk Jun 13, 2026
b57e838
await/yield fork: wire async arrows + the reserve mechanism (JS, prov…
johnsoncodehk Jun 13, 2026
fe543ff
await/yield fork: wire TS async arrows — 9 over-accepts cleared
johnsoncodehk Jun 13, 2026
4daaa3f
await/yield fork: wire JS async function expressions (await family)
johnsoncodehk Jun 13, 2026
95e0502
await/yield fork: wire async/generator function declarations + close …
johnsoncodehk Jun 13, 2026
6e8b945
await/yield fork: reserve the single-identifier arrow parameter
johnsoncodehk Jun 13, 2026
e6dd7b3
await/yield fork: class static block body is [+Await]
johnsoncodehk Jun 13, 2026
67f91ee
await/yield fork: reserve at expression position only, keep bindings …
johnsoncodehk Jun 13, 2026
580e589
await/yield fork: method 4-way split (class + object) with order-free…
johnsoncodehk Jun 13, 2026
2bd176d
incremental: gate the [Await]/[Yield] fork under context-flipping edits
johnsoncodehk Jun 13, 2026
9c04bc0
using-declaration binds a BindingIdentifier only (ASI-companion)
johnsoncodehk Jun 13, 2026
b9dba19
Revert the UsingBinding refinement (9c04bc0): net-negative
johnsoncodehk Jun 13, 2026
acad7cb
statement-ASI lands with its companion surface: we-accept 73 -> 50, 0…
johnsoncodehk Jun 13, 2026
eb6162c
over-accept: reject legacy-octal/leading-zero numerics + newline-spli…
johnsoncodehk Jun 13, 2026
b16cb22
over-accept: commit `let [`, reject `new <`, reserve labels, index-si…
johnsoncodehk Jun 13, 2026
dd789de
over-accept: type-parameter name `in` is reserved (only `out` is a co…
johnsoncodehk Jun 13, 2026
c0cb01c
check.ts: run gates concurrently (serial sum -> ~slowest gate)
johnsoncodehk Jun 13, 2026
48d916f
Tighten class/param over-accepts: this-param, extends head, constructor
johnsoncodehk Jun 14, 2026
7113e37
over-accept: object type literal members require a separator
johnsoncodehk Jun 14, 2026
713a2d6
over-accept: type-predicate position, duplicate-static, tuple separators
johnsoncodehk Jun 14, 2026
4b0f36e
incremental: re-derive bar-ending recovery-made rows on adopt (fixes …
johnsoncodehk Jun 14, 2026
4be0336
over-accept: update/assignment operand must be a LeftHandSideExpression
johnsoncodehk Jun 14, 2026
a9bc3e6
over-accept: an arrow function may not be a binary/conditional operand
johnsoncodehk Jun 14, 2026
4a093a8
over-accept: `new` needs a target; optional chain may not contain a p…
johnsoncodehk Jun 14, 2026
635f957
over-accept: a binary/relational expression is not an assignment target
johnsoncodehk Jun 14, 2026
3a84a0d
parser: a CST producer models syntax, not static semantics
johnsoncodehk Jun 14, 2026
af76674
lexer: `/*` is never a regex start — an unterminated block comment is…
johnsoncodehk Jun 14, 2026
a1051f9
parser: `for (using of of …)` has no parse tree
johnsoncodehk Jun 14, 2026
360458b
parser: an optional chain may not follow a bare `new` expression
johnsoncodehk Jun 14, 2026
bd2ea42
parser: keyword/literal types are not `.`-qualifiable (`void.x` has n…
johnsoncodehk Jun 14, 2026
114ff70
parser: a `using` declaration binding is a BindingIdentifier, not a p…
johnsoncodehk Jun 14, 2026
471ec29
docs: README states correctness = the productions, not `tsc`
johnsoncodehk Jun 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ A TextMate grammar is a pile of regexes guessing at a language's structure. It's

Take `typeof x < y`. A regex highlighter has to guess whether `<` opens a generic argument list or is a less-than comparison — and it guesses wrong somewhere, forever. A **parser** doesn't guess; the grammar already decides. Monogram inverts the dependency:

1. **Write the grammar, then prove it.** The grammar is executable — Monogram runs it as a recursive-descent + [Pratt](https://en.wikipedia.org/wiki/Operator-precedence_parser) (operator-precedence) parser over the TypeScript conformance suite, measured *bidirectionally*: it must **accept** every input `tsc` accepts **and reject** every input it rejects.
1. **Write the grammar, then prove it.** The grammar is executable — Monogram runs it as a recursive-descent + [Pratt](https://en.wikipedia.org/wiki/Operator-precedence_parser) (operator-precedence) parser over the TypeScript conformance suite, measured *bidirectionally*: it **accepts** what `tsc` accepts and **rejects** what `tsc` rejects — with `tsc` the [oracle, not the definition](#correctness-the-productions-not-tsc), the two diverging only where `tsc` itself does.

2. **Derive the highlighters from that proven grammar**, never hand-write them. The TextMate, tree-sitter, and Monarch outputs are all generated from the one parser-validated definition, so their correctness is underwritten by the conformance run, not by regex tuning.

Expand All @@ -49,6 +49,21 @@ Two numbers answer two different questions — read them together, not against e

So the two aren't in tension: a near-tie in the broad table can sit right next to a lopsided ledger — the broad average dilutes the difference with easy tokens, while the ledger zooms in on the hard cases it buries.

### Correctness: the productions, not `tsc`

The conformance run measures Monogram against `tsc`, but `tsc` is the **oracle, not the definition**. What the grammar models is the language's **syntactic productions** — and the parser produces a [CST](#what-you-get), which is *pre-semantic*: whether an expression is a valid assignment target, or a `using` binding is an identifier rather than a pattern, is a **static-semantic** rule. That belongs to a CST *consumer* — the CST→AST lowering, or a validator that walks the tree — not to the parser. The parser's one job is to accept exactly the strings the productions derive.

This matters because `tsc`'s *parser* is not the same thing as the language. It draws its own parse-vs-check line, and on a handful of inputs it diverges from the grammar — and from the other engines (V8, Babel) — in **both** directions. Driving Monogram's accept/reject to *exactly* `tsc` would mean reproducing those quirks; instead it follows the productions:

| Input | Monogram | `tsc` parser | V8 / Babel | Why |
|---|---|:--:|:--:|---|
| `obj?.#field` | accept | reject | accept | A private member in an optional chain is valid current ECMAScript — V8 and Babel both accept it; `tsc`'s parser is the lone rejecter. |
| `let v: void.x` | reject | accept | reject | A qualified type name's root is an `IdentifierReference`; `void` is a keyword type, so no production qualifies it. (`undefined.x` *is* valid — `undefined` is identifier-rooted.) |
| `using {a} = b` | reject | accept | reject | A `using` binding is a `BindingIdentifier` (`BindingList[~Pattern]`); the object pattern has no production. `using [a] = b` *is* valid — there `using` is an identifier and `[a]` is an element access. |
| `++ -x` | accept | reject | reject | `++ UnaryExpression` derives it; "operand must be a simple target" is a static-semantic early error, which the parser leaves to a consumer. |

`tsc` rejecting the first and accepting the next two (its parser doesn't enforce those productions until the checker) is exactly why "match `tsc`" can't *be* the definition of correct — only the measurement oracle.

### Broad agreement vs the official grammar

**Parser** (Monogram vs the official parser, [`test/src-coverage.ts`](test/src-coverage.ts)) — **agree** = the same accept/reject verdict on each corpus file (for HTML, full **parse-tree equality** via parse5); **covered** = how much of the official parser's own branches the corpus exercises, so read `agree` as "on the covered portion." (For the non-HTML grammars `agree` is accept/reject; their parse-*tree* correctness is exercised by the Highlighter axis, whose roles are read off the tree.) **Highlighter** (Monogram's derived TextMate grammar vs the official one, [`test/scope-gap.ts`](test/scope-gap.ts)) — both graded against the parser's per-token roles, the [vscode#203212](https://github.com/microsoft/vscode/issues/203212) comparison.
Expand Down Expand Up @@ -227,12 +242,29 @@ The **only-Monogram** wins above are all disambiguations that are *TextMate-expr

"TextMate can't express X" is not a guess or an assertion; it is a claim to be **proven from the model**. TextMate is a line-oriented matcher whose only cross-line memory is a finite stack of scope contexts, so a proof exhibits an X whose correct highlighting provably needs memory that model lacks — unbounded lookback to a token that is not an enclosing context. A failed *attempt* to derive a pattern is not such a proof: a cleverer pattern may exist, and most "impossible for TextMate" folklore is exactly this error — the multiline / nested-generic cases turn out TM-expressible once a parser supplies the pattern, which is why the derived grammar gets them right. Where a construct provably exceeds the model, Monogram's **tree-sitter** target — a real parser over the whole tree — resolves it.

### Total parsing under edits — measured against tsc and tree-sitter

The handle API (`createParser()`) is **total**: every text yields a tree plus `cst.errors`, with tsc-grade diagnostics (`expected ',' or ']'` where every listed token is *provably* still accepted at that position, `to match this '('` related info, zero-width `$missing` nodes that keep a call's shape when its `)` is missing). Two structural guarantees back it:

- **The valid path is byte-identical to the strict parser** — recovery runs only after a strict pass has rejected, so error tolerance costs valid input nothing, by construction.
- **Every edited re-parse is byte-identical to a fresh parse** of the same text — tree *and* errors, broken states included, held exact by generative edit scripts across all seven grammars in CI (`test/incremental-grammars.ts`).

One 9 MB TypeScript document, identical single-character edit scripts (`test/head-to-head.ts`, node v24, Apple silicon; ✎ = per keystroke, median):

| engine | fresh parse | valid ✎ | breaking ✎ | while-broken ✎ | fixing ✎ |
|---|---:|---:|---:|---:|---:|
| **Monogram** | **167 ms** | 0.37 ms | 12 ms | **0.22 ms** | 2.2 ms |
| tsc `updateSourceFile` | 207 ms | 35 ms | 12.0 ms | 11.9 ms | 11.9 ms |
| tree-sitter (official) | 430 ms | **0.18 ms** | **0.29 ms** | 0.30 ms | **0.22 ms** |

Monogram beats tsc on every phase (valid typing ~100×, while-broken ~50×) and beats or matches tree-sitter everywhere except the two **transition** edits (break/fix). Profiling attributes those almost entirely to the bench's 4.5 MB cursor jump: token-column offsets are EOF-relative-biased so that local typing never rewrites the suffix (that is what makes the valid keystroke 0.37 ms), and the bias boundary moves with the cursor — a far jump pays once, proportional to the jump distance, then repeated break/fix transitions at that position settle to **~1.6–2 ms** (the parser passes measure under 1 ms of that).

## What you get

From one grammar definition (a small TypeScript combinator API), five outputs are **fully functional**:

- **A lexer** — tokenizes source straight from the grammar's token definitions; usable on its own (`createLexer(grammar).tokenize`).
- **A CST parser** — recursive descent + Pratt precedence on top of the lexer, producing a **CST** (concrete syntax tree): every token is a node, including punctuation and keywords — roughly 2× an AST's nodes, by design, which is exactly what the highlighter and lossless source reconstruction need.
- **A CST parser** — recursive descent + Pratt precedence on top of the lexer, producing a **CST** (concrete syntax tree): every token is a node, including punctuation and keywords — roughly 2× an AST's nodes, by design, which is exactly what the highlighter and lossless source reconstruction need. A CST is *pre-semantic* (it models the productions, not static semantics — see [Correctness](#correctness-the-productions-not-tsc)).
- **A TextMate grammar** — a `.tmLanguage.json` for VS Code / Sublime syntax highlighting, derived from the same rules, including derived **JSDoc-body** and **regex-internal** sub-grammars. (TextMate *scopes* are the dot-separated labels — `entity.name.function`, `keyword.control` — that a theme maps to colors.)
- **A VS Code language configuration** — `language-configuration.json` (comments, bracket pairs, auto-close/surround, folding) derived from the same tokens.
- **CST node types** — a TypeScript discriminated union (keyed by rule) for typed tree consumers.
Expand Down
3 changes: 3 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Three parser-grounded layers (in `test/`), each comparing against the language's

## What's next

- **Parser-acceptance long tail vs tsc** (measured by `test/recovery-conformance.ts`: recall 61.2%, 108 conformance files we parse-accept that tsc's parser rejects). The remainder is fully enumerated, two buckets:
- **`[Await]`/`[Yield]` parameter contexts** (31 files): `await`/`yield` must be reserved *inside* async/generator bodies and parameter lists, identifiers elsewhere. Needs a context-threading mechanism in the engine — the same shape as `exclude('in', …)` for the no-`in` context, but suppressing identifier *texts* over a subtree. Designed direction, not yet built.
- **Per-shape strictness** (77 files, each class small and named): declaration-modifier ordering (`public @dec method`), private names outside classes (`const #foo`), strict-mode octal literals (`001`), member declarations with `var` (`class C { var x }`), paren-less `new` arguments (`new C0 32`), reserved words in dotted namespace tails, template-literal module names, `extends void`, `super<T>` tagged templates. Each wants the same treatment that landed for `case`/`class`/statement keywords: fix, then prove FN=0 with the accept/reject flip-scan against the corpus.
- **More vscode#203212 bundles** — low-effort first (ini, diff, git config, xml); the large ones (ruby, perl, c/c++, groovy) each need an instrumentable official parser (WASM / native-coverage) + a corpus.
- **Field labels** in the grammar DSL → richer named-field AST types.
- **Highlighter long tail** — the few remaining per-language divergences are documented (in the PR) as either the shared TextMate-vs-parser ceiling or proven architectural floors; where a construct provably exceeds the TextMate model, the derived **tree-sitter** target (a real whole-tree parser) resolves it.
Loading
Loading