|
| 1 | +# Replace Chevrotain with Hand-Rolled Recursive Descent Parser |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +The SOQL parser library currently uses chevrotain (~500KB bundled) for lexing, parsing (CST generation), and CST-to-AST visitor transformation. This is a 3-pass pipeline: `SOQL → Tokens → CST → AST`. We want to replace it with a hand-rolled recursive descent parser that produces the AST directly in 2 passes: `SOQL → Tokens → AST`. This eliminates the entire CST intermediate layer and the visitor, yielding better performance, smaller bundle size, and zero runtime dependencies (besides `commander` for the CLI). |
| 6 | + |
| 7 | +This is a **semver major** release since we're dropping 4 re-exported chevrotain types from the public API. |
| 8 | + |
| 9 | +## Architecture Overview |
| 10 | + |
| 11 | +### What changes |
| 12 | +| File | Action | Lines (current) | |
| 13 | +|---|---|---| |
| 14 | +| `src/parser/lexer.ts` | **Rewrite** - Hand-rolled tokenizer | 1,074 | |
| 15 | +| `src/parser/parser.ts` | **Rewrite** - Recursive descent parser with direct AST output | 861 | |
| 16 | +| `src/parser/visitor.ts` | **Delete** - Logic absorbed into parser.ts | 993 | |
| 17 | +| `src/models.ts` | **Delete** - Chevrotain CST context types, no longer needed | 331 | |
| 18 | +| `src/utils.ts` | **Update** - Remove chevrotain import + `isToken()` | minor | |
| 19 | +| `src/index.ts` | **Update** - Remove chevrotain re-exports, update import paths | minor | |
| 20 | +| `src/composer/composer.ts` | **Update** - Change `import { parseQuery }` path from `visitor` to `parser` | 1 line | |
| 21 | +| `package.json` | **Update** - Remove `chevrotain`, `lodash.get` from dependencies | minor | |
| 22 | +| `test/test.spec.ts` | **Update** - Change `isQueryValid` import from `visitor` to `parser` | 1 line | |
| 23 | + |
| 24 | +### What stays unchanged (the compatibility contract) |
| 25 | +- `src/api/api-models.ts` - All AST output types (Query, Subquery, Field*, Condition*, etc.) |
| 26 | +- `src/api/public-utils.ts` - Utility functions (getField, getFlattenedFields, type guards) |
| 27 | +- `src/composer/composer.ts` - AST-to-SOQL composition (only import path changes) |
| 28 | +- `src/formatter/formatter.ts` - Pretty-printing |
| 29 | +- All test case data files (131 parse cases, 161 validity cases, 18 format cases, etc.) |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## Phase 1: Lexer (`src/parser/lexer.ts`) |
| 34 | + |
| 35 | +### Token Representation |
| 36 | +```typescript |
| 37 | +const enum TokenKind { |
| 38 | + // Keywords, operators, symbols, literals, identifiers |
| 39 | + // ~150 members, arranged contiguously by category for fast range checks |
| 40 | + SELECT, FROM, WHERE, AND, OR, NOT, /* ...all keywords... */ |
| 41 | + EQUAL, NOT_EQUAL, LT, GT, LTE, GTE, /* operators */ |
| 42 | + L_PAREN, R_PAREN, DOT, COMMA, COLON, /* symbols */ |
| 43 | + UNSIGNED_INTEGER, SIGNED_INTEGER, /* ...numeric types... */ |
| 44 | + STRING_LITERAL, DATETIME, DATE, IDENTIFIER, EOF |
| 45 | +} |
| 46 | + |
| 47 | +interface Token { |
| 48 | + kind: TokenKind; |
| 49 | + text: string; // original matched text |
| 50 | + start: number; // offset in source |
| 51 | +} |
| 52 | +``` |
| 53 | + |
| 54 | +Using `const enum` for TokenKind compiles to integer literals - token kind checks become integer comparisons instead of object identity checks. |
| 55 | + |
| 56 | +### Tokenization Algorithm |
| 57 | +Single-pass character scanner (`tokenize(input: string): Token[]`): |
| 58 | +1. Skip whitespace (no token emitted) |
| 59 | +2. Match in priority order: |
| 60 | + - Multi-char operators first: `!=`, `<>`, `<=`, `>=` |
| 61 | + - Single-char symbols: `(`, `)`, `[`, `]`, `,`, `:`, `;`, `*`, `+`, `-`, `.`, `=`, `<`, `>` |
| 62 | + - String literal: `'...'` with `\\` escape handling |
| 63 | + - Numbers: unsigned integer/decimal only at lexer level (signed numbers handled in parser by combining MINUS + number) |
| 64 | + - Currency-prefixed numbers: 3 alpha chars + digits (before identifier matching since they start with alpha) |
| 65 | + - DateTime: `\d{4}-\d{2}-\d{2}T...` (attempt before Date) |
| 66 | + - Date: `\d{4}-\d{2}-\d{2}` (only if not followed by `T`) |
| 67 | + - Identifiers/Keywords: `[a-zA-Z][a-zA-Z0-9_]*` with optional dot-separated parts (no `..`). After matching, case-insensitive lookup in keyword `Map<string, TokenKind>` |
| 68 | + |
| 69 | +3. Multi-word keywords: After matching `ORDER`/`GROUP`, peek ahead for whitespace + `BY`. If found, emit single `ORDER_BY`/`GROUP_BY` token. Same for `NOT IN` and `DATA CATEGORY`. |
| 70 | + |
| 71 | +### Category Helpers (replace chevrotain token categories) |
| 72 | +```typescript |
| 73 | +function isDateFunction(kind: TokenKind): boolean { /* range check */ } |
| 74 | +function isAggregateFunction(kind: TokenKind): boolean { /* range check */ } |
| 75 | +function isIdentifierLike(kind: TokenKind): boolean { /* IDENTIFIER or any non-reserved keyword */ } |
| 76 | +// etc. |
| 77 | +``` |
| 78 | + |
| 79 | +### Exports |
| 80 | +`TokenKind`, `Token`, `tokenize()`, category helper functions |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Phase 2: Parser (`src/parser/parser.ts`) |
| 85 | + |
| 86 | +### Structure |
| 87 | +```typescript |
| 88 | +class SoqlParser { |
| 89 | + private tokens: Token[]; |
| 90 | + private pos: number; |
| 91 | + private config: ParseQueryConfig; |
| 92 | + |
| 93 | + // Navigation helpers |
| 94 | + private peek(): Token |
| 95 | + private advance(): Token |
| 96 | + private expect(kind: TokenKind): Token |
| 97 | + private match(kind: TokenKind): boolean |
| 98 | + private check(kind: TokenKind): boolean |
| 99 | + private isAtEnd(): boolean |
| 100 | + |
| 101 | + // ~30 parse methods, each directly returning AST nodes |
| 102 | +} |
| 103 | +``` |
| 104 | + |
| 105 | +### Grammar Rule → Method Mapping |
| 106 | + |
| 107 | +Each parse method directly constructs and returns the AST types from `api-models.ts`: |
| 108 | + |
| 109 | +**Entry points:** |
| 110 | +- `parseSelectStatement()` → `Query | Subquery` |
| 111 | +- `parseSelectStatementPartial()` → `Query` (partial, with optional SELECT/FROM) |
| 112 | + |
| 113 | +**SELECT clause:** |
| 114 | +- `parseSelectClause()` → `FieldType[]` |
| 115 | +- `parseSelectField()` → `FieldType` (disambiguates via 2-token lookahead: function name + `(` = function call, `(` + SELECT = subquery, TYPEOF = typeof, else identifier) |
| 116 | +- `parseSelectClauseFunctionIdentifier()` → `FieldFunctionExpression` |
| 117 | +- `parseSelectClauseSubqueryIdentifier()` → `FieldSubquery` |
| 118 | +- `parseSelectClauseTypeOf()` → `FieldTypeOf` |
| 119 | +- `parseSelectClauseIdentifier()` → `Field | FieldWithAlias | FieldRelationship | FieldRelationshipWithAlias` |
| 120 | + |
| 121 | +**FROM clause:** |
| 122 | +- `parseFromClause()` → `{ sObject, alias?, sObjectPrefix? }` |
| 123 | +- Alias disambiguation: don't consume next token as alias if it's `OFFSET <number>` (the OFFSET clause) |
| 124 | + |
| 125 | +**WHERE/HAVING (most complex area):** |
| 126 | +- `parseWhereClause()` → `WhereClause` (right-recursive linked list) |
| 127 | +- `parseConditionExpression()` → chain of conditions linked by AND/OR |
| 128 | +- `parseExpression()` → individual `Condition` with `openParen`/`closeParen` counts |
| 129 | +- Parenthesis tracking via mutable counter passed through parse methods |
| 130 | +- NOT/negation: consume `NOT` prefix, track as `WhereClauseWithoutNegationOperator` |
| 131 | + |
| 132 | +**Operators:** |
| 133 | +- `parseRelationalOperator()` → `=`, `!=`, `<`, `>`, `<=`, `>=`, `LIKE` |
| 134 | +- `parseSetOperator()` → `IN`, `NOT IN`, `INCLUDES`, `EXCLUDES` |
| 135 | + |
| 136 | +**Values:** |
| 137 | +- `parseAtomicExpression()` → `{ value, literalType, dateLiteralVariable? }` |
| 138 | +- Handles: strings, numbers (combining MINUS + unsigned if needed), booleans, null, dates, date literals, date N-literals with `:N`, apex bind variables, subqueries |
| 139 | +- `parseArrayExpression()` → array of values for IN/INCLUDES/EXCLUDES |
| 140 | + |
| 141 | +**Functions:** |
| 142 | +- `parseDateFunction()` → `FunctionExp | FieldFunctionExpression` |
| 143 | +- `parseAggregateFunction()` → `FunctionExp | FieldFunctionExpression` |
| 144 | +- `parseLocationFunction()` → handles DISTANCE(field, GEOLOCATION(...), unit) |
| 145 | +- `parseOtherFunction()` → FORMAT, TOLABEL, CONVERTTIMEZONE, CONVERTCURRENCY, GROUPING |
| 146 | +- `parseFieldsFunction()` → FIELDS(ALL|CUSTOM|STANDARD) |
| 147 | +- `parseFunctionExpression()` → recursive parameter parsing (functions can nest) |
| 148 | + |
| 149 | +**Other clauses:** |
| 150 | +- `parseGroupByClause()` → `GroupByClause[]` (handles CUBE, ROLLUP, date functions, fields) |
| 151 | +- `parseHavingClause()` → `HavingClause` (same linked-list structure as WHERE) |
| 152 | +- `parseOrderByClause()` → `OrderByClause[]` (field or function, ASC/DESC, NULLS FIRST/LAST) |
| 153 | +- `parseWithClause()` → `SECURITY_ENFORCED | USER_MODE | SYSTEM_MODE | DATA CATEGORY` |
| 154 | +- `parseLimitClause()` → `number` |
| 155 | +- `parseOffsetClause()` → `number` |
| 156 | +- `parseUsingScopeClause()` → `string` |
| 157 | +- `parseForViewOrReference()` → `ForClause` |
| 158 | +- `parseUpdateTrackingViewstat()` → `UpdateClause` |
| 159 | + |
| 160 | +**Apex bind variables (when `allowApexBindVariables` is true):** |
| 161 | +- `parseApexBindVariableExpression()` → string representation |
| 162 | +- Handles: `:varName`, `:obj.field`, `:new Class<T>().method()[0].field`, `:fn(args)` |
| 163 | + |
| 164 | +### Error Handling |
| 165 | + |
| 166 | +**Normal mode:** Throw `Error` with descriptive message including expected vs actual token. |
| 167 | + |
| 168 | +**`ignoreParseErrors` mode:** Wrap each clause-level parse in try-catch. On error, call `synchronize()` which advances to the next clause keyword (WHERE, ORDER, GROUP, HAVING, LIMIT, OFFSET, FOR, UPDATE, WITH, USING) or EOF. |
| 169 | + |
| 170 | +**`logErrors` mode:** `console.log` errors before throwing. |
| 171 | + |
| 172 | +### Exports |
| 173 | +`parseQuery()`, `isQueryValid()`, `ParseQueryConfig` interface |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +## Phase 3: Integration & Cleanup |
| 178 | + |
| 179 | +1. **Update `src/index.ts`:** |
| 180 | + - Change: `export { parseQuery, isQueryValid } from './parser/parser'` (was `./parser/visitor`) |
| 181 | + - Remove: `export type { CstNode, CstParser, ILexingError, IRecognitionException } from 'chevrotain'` |
| 182 | + - Keep all other exports unchanged |
| 183 | + |
| 184 | +2. **Update `src/utils.ts`:** |
| 185 | + - Remove `import { IToken } from 'chevrotain'` |
| 186 | + - Remove `isToken()` function (only used by deleted visitor.ts) |
| 187 | + |
| 188 | +3. **Update `src/composer/composer.ts`:** |
| 189 | + - Change import: `from '../parser/visitor'` → `from '../parser/parser'` |
| 190 | + |
| 191 | +4. **Delete files:** |
| 192 | + - `src/parser/visitor.ts` |
| 193 | + - `src/models.ts` |
| 194 | + |
| 195 | +5. **Update `package.json`:** |
| 196 | + - Remove from dependencies: `chevrotain`, `lodash.get` |
| 197 | + - Move `lodash.get` to devDependencies (still used in test/public-utils.spec.ts) |
| 198 | + - Remove from devDependencies: `@types/lodash.get` |
| 199 | + |
| 200 | +6. **Update test imports:** |
| 201 | + - `test/test.spec.ts`: change `isQueryValid` import from `../src/parser/visitor` to `../src/parser/parser` |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## Phase 4: Verification |
| 206 | + |
| 207 | +1. **Run full test suite:** `npm test` - all 330+ tests must pass |
| 208 | +2. **Run build:** `npm run build` - verify all 5 build targets (ESM, CJS, LWC, CLI, declarations) |
| 209 | +3. **Bundle size comparison:** |
| 210 | + - Current CJS: ~190KB minified, ESM: ~427KB |
| 211 | + - Target: ~30-50KB CJS, ~80-120KB ESM (60-80% reduction) |
| 212 | +4. **Performance benchmark:** Un-skip `test/performance-test.spec.ts`, run 131 queries x 1000 iterations, compare against baseline (run baseline first on current code) |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +## Agent Delegation Strategy |
| 217 | + |
| 218 | +### Agent 1: Lexer |
| 219 | +- Build complete `src/parser/lexer.ts` |
| 220 | +- All token types, tokenizer, category helpers |
| 221 | +- Can be tested independently by tokenizing the 131 test case SOQL strings |
| 222 | +- ~400-500 lines |
| 223 | + |
| 224 | +### Agent 2: Parser (depends on Agent 1) |
| 225 | +- Build complete `src/parser/parser.ts` |
| 226 | +- All ~30 parse methods producing AST directly |
| 227 | +- Reference `src/api/api-models.ts` for exact output types |
| 228 | +- Reference current `src/parser/parser.ts` (grammar rules) and `src/parser/visitor.ts` (AST construction logic) for behavior |
| 229 | +- ~900-1100 lines |
| 230 | + |
| 231 | +### Agent 3: Integration & Testing (depends on Agent 2) |
| 232 | +- Wire up all imports, delete old files, update package.json |
| 233 | +- Run tests, fix failures iteratively |
| 234 | +- Performance benchmark |
| 235 | + |
| 236 | +### Agent 4: (if needed) Fix remaining test failures |
| 237 | +- Targeted fixes based on specific failing test cases |
| 238 | + |
| 239 | +--- |
| 240 | + |
| 241 | +## Key Design Decisions |
| 242 | + |
| 243 | +1. **Signed numbers handled in parser, not lexer** - Avoids lexer context-sensitivity. Parser combines `MINUS` + `UNSIGNED_INTEGER` when in value position. |
| 244 | +2. **Multi-word keywords combined in lexer** - `ORDER BY`, `GROUP BY`, `NOT IN`, `DATA CATEGORY` emitted as single tokens for simpler parser logic. |
| 245 | +3. **Non-reserved keywords lexed as keyword kinds** - Parser uses `isIdentifierLike()` to accept them where identifiers are expected. |
| 246 | +4. **No intermediate CST** - Parser directly constructs `api-models.ts` types, eliminating an entire transformation pass. |
| 247 | +5. **Re-usable parser instance** - Module-level instance, reset state per parse call (matches current pattern). |
| 248 | + |
| 249 | +## Critical Files Reference |
| 250 | +- `src/api/api-models.ts` - **THE** contract. Every AST type the parser must produce. |
| 251 | +- `test/test-cases.ts` - 131 test cases with exact expected AST output. The compatibility oracle. |
| 252 | +- `src/parser/parser.ts` (current) - Grammar rules to replicate. |
| 253 | +- `src/parser/visitor.ts` (current) - AST construction logic to absorb into new parser. |
| 254 | +- `src/parser/lexer.ts` (current) - Token universe to replicate. |
0 commit comments