Skip to content

Commit 8c34b4e

Browse files
authored
Merge pull request #274 from jetstreamapp/feat/replace-chevrotain
feat: replace chevrotain with hand-rolled parser
2 parents da6997b + 8e83827 commit 8c34b4e

21 files changed

Lines changed: 9071 additions & 11520 deletions

.claude/settings.local.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Bash(npx tsc:*)",
5+
"Bash(npx ts-node -e \":*)",
6+
"Bash(npx tsx -e \":*)",
7+
"Bash(node -e \":*)",
8+
"Bash(node _test_parser.js)",
9+
"Bash(npm test:*)",
10+
"Bash(npx vitest:*)",
11+
"Bash(npm run:*)"
12+
]
13+
}
14+
}

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,31 @@
11
# Changelog
22

3+
## 7.0.0
4+
5+
Mar 14, 2026
6+
7+
🚀🚀🚀🚀🚀 Performed a complete re-write using a hand-rolled parser, complements of Claude Code which did 100% of the work.
8+
9+
The new parser is **~4,600x faster** and has a **75% smaller bundle size** since there are no longer any external dependencies.
10+
11+
| | Old Parser (Chevrotain) | New Parser (Hand-rolled) |
12+
| ------------------------------- | ----------------------- | ------------------------ |
13+
| **Parse time per iteration** | 3,224 ms | 0.7 ms |
14+
| **Bundle size (ESM, minified)** | 194 KB | 48 KB |
15+
| **External dependencies** | 2 (chevrotain, lodash) | 0 |
16+
17+
In addition, we re-wrote the CLI to drop all 3rd party dependencies since the CLI is very simple and doesn't need a library to manage it.
18+
19+
### 💥Breaking changes💥
20+
21+
This library no longer exports some of the types used by the previous parser, shown below. These types were only exported as they were used in some public APIs.
22+
23+
This is very unlikely to impact most users, and if it does, the required changes should be very minimal.
24+
25+
```typescript
26+
export type { CstNode, CstParser, ILexingError, IRecognitionException } from 'chevrotain';
27+
```
28+
329
## 6.3.1
430

531
Dec 15, 2025

PARSER_REWRITE_PLAN.md

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# Replace Chevrotain with Hand-Rolled Recursive Descent Parser
2+
3+
## Context
4+
5+
The SOQL parser library currently uses chevrotain (~500KB bundled) for lexing, parsing (CST generation), and CST-to-AST visitor transformation. This is a 3-pass pipeline: `SOQL → Tokens → CST → AST`. We want to replace it with a hand-rolled recursive descent parser that produces the AST directly in 2 passes: `SOQL → Tokens → AST`. This eliminates the entire CST intermediate layer and the visitor, yielding better performance, smaller bundle size, and zero runtime dependencies (besides `commander` for the CLI).
6+
7+
This is a **semver major** release since we're dropping 4 re-exported chevrotain types from the public API.
8+
9+
## Architecture Overview
10+
11+
### What changes
12+
| File | Action | Lines (current) |
13+
|---|---|---|
14+
| `src/parser/lexer.ts` | **Rewrite** - Hand-rolled tokenizer | 1,074 |
15+
| `src/parser/parser.ts` | **Rewrite** - Recursive descent parser with direct AST output | 861 |
16+
| `src/parser/visitor.ts` | **Delete** - Logic absorbed into parser.ts | 993 |
17+
| `src/models.ts` | **Delete** - Chevrotain CST context types, no longer needed | 331 |
18+
| `src/utils.ts` | **Update** - Remove chevrotain import + `isToken()` | minor |
19+
| `src/index.ts` | **Update** - Remove chevrotain re-exports, update import paths | minor |
20+
| `src/composer/composer.ts` | **Update** - Change `import { parseQuery }` path from `visitor` to `parser` | 1 line |
21+
| `package.json` | **Update** - Remove `chevrotain`, `lodash.get` from dependencies | minor |
22+
| `test/test.spec.ts` | **Update** - Change `isQueryValid` import from `visitor` to `parser` | 1 line |
23+
24+
### What stays unchanged (the compatibility contract)
25+
- `src/api/api-models.ts` - All AST output types (Query, Subquery, Field*, Condition*, etc.)
26+
- `src/api/public-utils.ts` - Utility functions (getField, getFlattenedFields, type guards)
27+
- `src/composer/composer.ts` - AST-to-SOQL composition (only import path changes)
28+
- `src/formatter/formatter.ts` - Pretty-printing
29+
- All test case data files (131 parse cases, 161 validity cases, 18 format cases, etc.)
30+
31+
---
32+
33+
## Phase 1: Lexer (`src/parser/lexer.ts`)
34+
35+
### Token Representation
36+
```typescript
37+
const enum TokenKind {
38+
// Keywords, operators, symbols, literals, identifiers
39+
// ~150 members, arranged contiguously by category for fast range checks
40+
SELECT, FROM, WHERE, AND, OR, NOT, /* ...all keywords... */
41+
EQUAL, NOT_EQUAL, LT, GT, LTE, GTE, /* operators */
42+
L_PAREN, R_PAREN, DOT, COMMA, COLON, /* symbols */
43+
UNSIGNED_INTEGER, SIGNED_INTEGER, /* ...numeric types... */
44+
STRING_LITERAL, DATETIME, DATE, IDENTIFIER, EOF
45+
}
46+
47+
interface Token {
48+
kind: TokenKind;
49+
text: string; // original matched text
50+
start: number; // offset in source
51+
}
52+
```
53+
54+
Using `const enum` for TokenKind compiles to integer literals - token kind checks become integer comparisons instead of object identity checks.
55+
56+
### Tokenization Algorithm
57+
Single-pass character scanner (`tokenize(input: string): Token[]`):
58+
1. Skip whitespace (no token emitted)
59+
2. Match in priority order:
60+
- Multi-char operators first: `!=`, `<>`, `<=`, `>=`
61+
- Single-char symbols: `(`, `)`, `[`, `]`, `,`, `:`, `;`, `*`, `+`, `-`, `.`, `=`, `<`, `>`
62+
- String literal: `'...'` with `\\` escape handling
63+
- Numbers: unsigned integer/decimal only at lexer level (signed numbers handled in parser by combining MINUS + number)
64+
- Currency-prefixed numbers: 3 alpha chars + digits (before identifier matching since they start with alpha)
65+
- DateTime: `\d{4}-\d{2}-\d{2}T...` (attempt before Date)
66+
- Date: `\d{4}-\d{2}-\d{2}` (only if not followed by `T`)
67+
- Identifiers/Keywords: `[a-zA-Z][a-zA-Z0-9_]*` with optional dot-separated parts (no `..`). After matching, case-insensitive lookup in keyword `Map<string, TokenKind>`
68+
69+
3. Multi-word keywords: After matching `ORDER`/`GROUP`, peek ahead for whitespace + `BY`. If found, emit single `ORDER_BY`/`GROUP_BY` token. Same for `NOT IN` and `DATA CATEGORY`.
70+
71+
### Category Helpers (replace chevrotain token categories)
72+
```typescript
73+
function isDateFunction(kind: TokenKind): boolean { /* range check */ }
74+
function isAggregateFunction(kind: TokenKind): boolean { /* range check */ }
75+
function isIdentifierLike(kind: TokenKind): boolean { /* IDENTIFIER or any non-reserved keyword */ }
76+
// etc.
77+
```
78+
79+
### Exports
80+
`TokenKind`, `Token`, `tokenize()`, category helper functions
81+
82+
---
83+
84+
## Phase 2: Parser (`src/parser/parser.ts`)
85+
86+
### Structure
87+
```typescript
88+
class SoqlParser {
89+
private tokens: Token[];
90+
private pos: number;
91+
private config: ParseQueryConfig;
92+
93+
// Navigation helpers
94+
private peek(): Token
95+
private advance(): Token
96+
private expect(kind: TokenKind): Token
97+
private match(kind: TokenKind): boolean
98+
private check(kind: TokenKind): boolean
99+
private isAtEnd(): boolean
100+
101+
// ~30 parse methods, each directly returning AST nodes
102+
}
103+
```
104+
105+
### Grammar Rule → Method Mapping
106+
107+
Each parse method directly constructs and returns the AST types from `api-models.ts`:
108+
109+
**Entry points:**
110+
- `parseSelectStatement()``Query | Subquery`
111+
- `parseSelectStatementPartial()``Query` (partial, with optional SELECT/FROM)
112+
113+
**SELECT clause:**
114+
- `parseSelectClause()``FieldType[]`
115+
- `parseSelectField()``FieldType` (disambiguates via 2-token lookahead: function name + `(` = function call, `(` + SELECT = subquery, TYPEOF = typeof, else identifier)
116+
- `parseSelectClauseFunctionIdentifier()``FieldFunctionExpression`
117+
- `parseSelectClauseSubqueryIdentifier()``FieldSubquery`
118+
- `parseSelectClauseTypeOf()``FieldTypeOf`
119+
- `parseSelectClauseIdentifier()``Field | FieldWithAlias | FieldRelationship | FieldRelationshipWithAlias`
120+
121+
**FROM clause:**
122+
- `parseFromClause()``{ sObject, alias?, sObjectPrefix? }`
123+
- Alias disambiguation: don't consume next token as alias if it's `OFFSET <number>` (the OFFSET clause)
124+
125+
**WHERE/HAVING (most complex area):**
126+
- `parseWhereClause()``WhereClause` (right-recursive linked list)
127+
- `parseConditionExpression()` → chain of conditions linked by AND/OR
128+
- `parseExpression()` → individual `Condition` with `openParen`/`closeParen` counts
129+
- Parenthesis tracking via mutable counter passed through parse methods
130+
- NOT/negation: consume `NOT` prefix, track as `WhereClauseWithoutNegationOperator`
131+
132+
**Operators:**
133+
- `parseRelationalOperator()``=`, `!=`, `<`, `>`, `<=`, `>=`, `LIKE`
134+
- `parseSetOperator()``IN`, `NOT IN`, `INCLUDES`, `EXCLUDES`
135+
136+
**Values:**
137+
- `parseAtomicExpression()``{ value, literalType, dateLiteralVariable? }`
138+
- Handles: strings, numbers (combining MINUS + unsigned if needed), booleans, null, dates, date literals, date N-literals with `:N`, apex bind variables, subqueries
139+
- `parseArrayExpression()` → array of values for IN/INCLUDES/EXCLUDES
140+
141+
**Functions:**
142+
- `parseDateFunction()``FunctionExp | FieldFunctionExpression`
143+
- `parseAggregateFunction()``FunctionExp | FieldFunctionExpression`
144+
- `parseLocationFunction()` → handles DISTANCE(field, GEOLOCATION(...), unit)
145+
- `parseOtherFunction()` → FORMAT, TOLABEL, CONVERTTIMEZONE, CONVERTCURRENCY, GROUPING
146+
- `parseFieldsFunction()` → FIELDS(ALL|CUSTOM|STANDARD)
147+
- `parseFunctionExpression()` → recursive parameter parsing (functions can nest)
148+
149+
**Other clauses:**
150+
- `parseGroupByClause()``GroupByClause[]` (handles CUBE, ROLLUP, date functions, fields)
151+
- `parseHavingClause()``HavingClause` (same linked-list structure as WHERE)
152+
- `parseOrderByClause()``OrderByClause[]` (field or function, ASC/DESC, NULLS FIRST/LAST)
153+
- `parseWithClause()``SECURITY_ENFORCED | USER_MODE | SYSTEM_MODE | DATA CATEGORY`
154+
- `parseLimitClause()``number`
155+
- `parseOffsetClause()``number`
156+
- `parseUsingScopeClause()``string`
157+
- `parseForViewOrReference()``ForClause`
158+
- `parseUpdateTrackingViewstat()``UpdateClause`
159+
160+
**Apex bind variables (when `allowApexBindVariables` is true):**
161+
- `parseApexBindVariableExpression()` → string representation
162+
- Handles: `:varName`, `:obj.field`, `:new Class<T>().method()[0].field`, `:fn(args)`
163+
164+
### Error Handling
165+
166+
**Normal mode:** Throw `Error` with descriptive message including expected vs actual token.
167+
168+
**`ignoreParseErrors` mode:** Wrap each clause-level parse in try-catch. On error, call `synchronize()` which advances to the next clause keyword (WHERE, ORDER, GROUP, HAVING, LIMIT, OFFSET, FOR, UPDATE, WITH, USING) or EOF.
169+
170+
**`logErrors` mode:** `console.log` errors before throwing.
171+
172+
### Exports
173+
`parseQuery()`, `isQueryValid()`, `ParseQueryConfig` interface
174+
175+
---
176+
177+
## Phase 3: Integration & Cleanup
178+
179+
1. **Update `src/index.ts`:**
180+
- Change: `export { parseQuery, isQueryValid } from './parser/parser'` (was `./parser/visitor`)
181+
- Remove: `export type { CstNode, CstParser, ILexingError, IRecognitionException } from 'chevrotain'`
182+
- Keep all other exports unchanged
183+
184+
2. **Update `src/utils.ts`:**
185+
- Remove `import { IToken } from 'chevrotain'`
186+
- Remove `isToken()` function (only used by deleted visitor.ts)
187+
188+
3. **Update `src/composer/composer.ts`:**
189+
- Change import: `from '../parser/visitor'``from '../parser/parser'`
190+
191+
4. **Delete files:**
192+
- `src/parser/visitor.ts`
193+
- `src/models.ts`
194+
195+
5. **Update `package.json`:**
196+
- Remove from dependencies: `chevrotain`, `lodash.get`
197+
- Move `lodash.get` to devDependencies (still used in test/public-utils.spec.ts)
198+
- Remove from devDependencies: `@types/lodash.get`
199+
200+
6. **Update test imports:**
201+
- `test/test.spec.ts`: change `isQueryValid` import from `../src/parser/visitor` to `../src/parser/parser`
202+
203+
---
204+
205+
## Phase 4: Verification
206+
207+
1. **Run full test suite:** `npm test` - all 330+ tests must pass
208+
2. **Run build:** `npm run build` - verify all 5 build targets (ESM, CJS, LWC, CLI, declarations)
209+
3. **Bundle size comparison:**
210+
- Current CJS: ~190KB minified, ESM: ~427KB
211+
- Target: ~30-50KB CJS, ~80-120KB ESM (60-80% reduction)
212+
4. **Performance benchmark:** Un-skip `test/performance-test.spec.ts`, run 131 queries x 1000 iterations, compare against baseline (run baseline first on current code)
213+
214+
---
215+
216+
## Agent Delegation Strategy
217+
218+
### Agent 1: Lexer
219+
- Build complete `src/parser/lexer.ts`
220+
- All token types, tokenizer, category helpers
221+
- Can be tested independently by tokenizing the 131 test case SOQL strings
222+
- ~400-500 lines
223+
224+
### Agent 2: Parser (depends on Agent 1)
225+
- Build complete `src/parser/parser.ts`
226+
- All ~30 parse methods producing AST directly
227+
- Reference `src/api/api-models.ts` for exact output types
228+
- Reference current `src/parser/parser.ts` (grammar rules) and `src/parser/visitor.ts` (AST construction logic) for behavior
229+
- ~900-1100 lines
230+
231+
### Agent 3: Integration & Testing (depends on Agent 2)
232+
- Wire up all imports, delete old files, update package.json
233+
- Run tests, fix failures iteratively
234+
- Performance benchmark
235+
236+
### Agent 4: (if needed) Fix remaining test failures
237+
- Targeted fixes based on specific failing test cases
238+
239+
---
240+
241+
## Key Design Decisions
242+
243+
1. **Signed numbers handled in parser, not lexer** - Avoids lexer context-sensitivity. Parser combines `MINUS` + `UNSIGNED_INTEGER` when in value position.
244+
2. **Multi-word keywords combined in lexer** - `ORDER BY`, `GROUP BY`, `NOT IN`, `DATA CATEGORY` emitted as single tokens for simpler parser logic.
245+
3. **Non-reserved keywords lexed as keyword kinds** - Parser uses `isIdentifierLike()` to accept them where identifiers are expected.
246+
4. **No intermediate CST** - Parser directly constructs `api-models.ts` types, eliminating an entire transformation pass.
247+
5. **Re-usable parser instance** - Module-level instance, reset state per parse call (matches current pattern).
248+
249+
## Critical Files Reference
250+
- `src/api/api-models.ts` - **THE** contract. Every AST type the parser must produce.
251+
- `test/test-cases.ts` - 131 test cases with exact expected AST output. The compatibility oracle.
252+
- `src/parser/parser.ts` (current) - Grammar rules to replicate.
253+
- `src/parser/visitor.ts` (current) - AST construction logic to absorb into new parser.
254+
- `src/parser/lexer.ts` (current) - Token universe to replicate.

0 commit comments

Comments
 (0)