fix: avoid splitting surrogate pairs when truncating wide characters#360
Open
greymoth-jp wants to merge 1 commit into
Open
fix: avoid splitting surrogate pairs when truncating wide characters#360greymoth-jp wants to merge 1 commit into
greymoth-jp wants to merge 1 commit into
Conversation
truncateWidth() takes a substr/slice fast path when str.length === strlen(str), which is also true for surrogate-pair characters such as CJK Extension B or emoji (2 code units, 2 columns). Cutting by code unit on the truncation boundary can leave a lone surrogate. Exclude surrogate pairs from the fast path and trim by code point so a wide character is never split.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
truncateWidthtakes a fast path whenstr.length === strlen(str):The idea is "every character is one code unit and one column, so cutting by code unit is safe." That holds for ASCII, but it's also true for surrogate-pair characters such as CJK Extension B (e.g. 𠮷, used in the surname 𠮷田) or emoji: they are two code units and two columns, so
lengthandstrlenstay equal.substr(and theslice(0, -1)loop underneath) cut by code unit, so truncating on the boundary of such a character leaves a lone surrogate:In a table this shows up as a
�in the cell whenever a wide character lands on the truncation point.The change keeps the fast path for plain strings, skips it when a surrogate is present, and trims by code point in the slow path so a wide character is never split. BMP input (including the existing full-width CJK cases) is unaffected. Added tests for a CJK Extension B character and an emoji.