perf: use getBytes for faster ASCII char narrowing in base64 encode#893
Open
He-Pin wants to merge 2 commits into
Open
perf: use getBytes for faster ASCII char narrowing in base64 encode#893He-Pin wants to merge 2 commits into
He-Pin wants to merge 2 commits into
Conversation
Replace per-char charAt(i).toByte loop with a single getBytes(0, len, dst, 0) system call for narrowing ASCII chars to bytes in encodeStringToString. Benchmark results (Scala Native vs jrsonnet): - std.base64 (string): 2.17x slower -> 1.21x slower (44% gap reduction) - std.base64Decode: 1.58x slower -> 1.47x slower (improved) - std.base64DecodeBytes: 1.19x faster -> 1.14x faster (stable) - std.base64_byte_array: 1.38x faster -> 1.31x faster (stable) The getBytes method is a single system call that copies bytes faster than a per-char loop. Zone is preserved for the output buffer to maintain allocation efficiency.
Motivation: All three base64 functions (encodeToString, encodeStringToString, decode) were copying source data from GC arrays to zone-allocated buffers before passing to the C library. This intermediate copy is unnecessary because Scala Native's GC does not move objects during foreign calls — the array pointer from `.at(0)` remains valid throughout the C function execution. Modification: - Pass `srcBytes.at(0)` / `input.at(0)` directly to base64_encode/decode - Remove intermediate zone source allocation and memcpy in all 3 methods - Zone is still used for output buffers (C writes into them) Result: Eliminates one allocation + one memcpy per base64 call for source data. Output-side zone allocation is preserved since the C library writes into it and we need to copy results to GC-managed arrays afterward.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The
std.base64function for string input was 2.61x slower than jrsonnet (Rust implementation) on Scala Native. The main bottleneck was the per-charactercharAt(i).toByteloop used to narrow ASCII chars to bytes before passing to the aklomp/base64 SIMD encoder.Key Design Decision
Use
String.getBytes(0, len, dst, 0)system call instead of per-charactercharAt(i).toByteloop for ASCII char narrowing. This is a single system call that copies bytes faster than a per-char loop. Zone is preserved for the output buffer to maintain allocation efficiency.Modification
Changed
sjsonnet/src-native/sjsonnet/stdlib/PlatformBase64.scala:charAt(i).toByteloop withgetBytes(0, len, srcBytes, 0)system callArray[Byte]for source datamemcpyto copy from Array[Byte] to Zone buffer@nowarn("cat=deprecation")for deprecatedgetBytesmethodBenchmark Results
Scala Native vs jrsonnet (hyperfine)
JMH (JVM)
Baseline JMH benchmarks are stable; the change only affects Scala Native path.
Analysis
The
getBytes(0, len, dst, 0)method is a system call that copies bytes faster than a per-char loop. For a 3.5KB Lorem-ipsum-style input, this avoids two full passes over the data before the SIMD encoder sees it.The optimization is conservative: Zone is still used for the output buffer to maintain allocation efficiency. Only the source data preparation is optimized.
References
Result
./mill __.test)./mill __.reformat)