Commit fe7673b
committed
fix(sqlite): correct StmtLocation/StmtLen for non-ASCII characters in comments
ANTLR4-go stores its input stream as []rune, so all token positions
returned by GetStart().GetStart() and GetStop().GetStop() are rune
indices, not byte offsets. The SQLite parser was storing these values
directly as StmtLocation and StmtLen, which are later consumed by
source.Pluck() using byte-based Go string slicing (source[head:tail]).
For source files that contain multi-byte UTF-8 characters (non-ASCII)
in comments, the rune index diverges from the byte offset, causing the
plucked query text to be truncated. Each 2-byte character (e.g. Ü, é)
caused one byte to be dropped from the end of the query; each 3-byte
character (e.g. ♥) caused two bytes to be dropped; and so on.
Fix this by building a rune-index to byte-offset map from the source
string before processing the ANTLR parse tree, then converting the
ANTLR rune positions to byte offsets before storing them in the AST.
The internal loc tracking variable continues to use rune indices (for
consistency with the ANTLR token positions), while only the values
written into StmtLocation and StmtLen are converted to byte offsets.
Add TestParseNonASCIIComment covering 2-, 3-, and 4-byte characters in
dash comments, multiple non-ASCII characters, and the multi-statement
case where an incorrect loc for one statement would propagate and
corrupt the StmtLocation of the following statement.1 parent ce83d3f commit fe7673b
2 files changed
+125
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
40 | 56 | | |
41 | 57 | | |
42 | 58 | | |
43 | 59 | | |
44 | 60 | | |
45 | | - | |
| 61 | + | |
| 62 | + | |
46 | 63 | | |
47 | 64 | | |
48 | 65 | | |
| |||
57 | 74 | | |
58 | 75 | | |
59 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
60 | 81 | | |
61 | 82 | | |
62 | 83 | | |
63 | 84 | | |
64 | 85 | | |
65 | 86 | | |
66 | | - | |
| 87 | + | |
67 | 88 | | |
68 | 89 | | |
69 | 90 | | |
| |||
72 | 93 | | |
73 | 94 | | |
74 | 95 | | |
75 | | - | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
76 | 99 | | |
77 | 100 | | |
78 | 101 | | |
79 | | - | |
80 | | - | |
| 102 | + | |
| 103 | + | |
81 | 104 | | |
82 | 105 | | |
83 | 106 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
0 commit comments