Optimize HTTP/2 request/response processing: eliminate double dispatch, reduce allocations, and streamline stream pipeline#1081
Open
He-Pin wants to merge 10 commits into
Conversation
Member
Author
|
The AffinityPool gives me better number than the ForkJoinPool on my local pc |
He-Pin
commented
Jun 27, 2026
| response.map(_.addAttribute(Http2.streamId, streamIdHeader)) | ||
| } | ||
| case None => response | ||
| } |
Member
Author
There was a problem hiding this comment.
@pjfanning Should I extract these commits one by one by one as seperated PR?
Member
There was a problem hiding this comment.
I'm ok with leaving this as is. I'm hoping that other people will review this but if not, I'll ok this in a few days.
Motivation:
handleWithStreamIdHeader wrapped the handler call in Future { } inside
mapAsyncUnordered, causing 2-3 unnecessary ExecutionContext dispatches
per request. Since mapAsyncUnordered already schedules on the EC, the
extra Future { } wrapper doubles the scheduling overhead. For fast
handlers (e.g. gRPC unary handlers returning Future.successful), this
overhead is a significant portion of per-request cost.
Modification:
- Remove the Future { } wrapper, call handler directly since
mapAsyncUnordered already runs on the execution context
- Add fast path for stream ID attribute: when response Future is
already completed, add attribute synchronously via Future.successful
instead of response.map() which would schedule another EC hop
- Preserve error handling with try/catch wrapping handler call
Result:
Benchmark (scala_pekko gRPC server, complex_proto, 12 cores):
- Low concurrency (50 conn): 44,927 -> 55,357 req/s (+23.2%)
- High concurrency (1000 conn): 66,772 -> 70,780 req/s (+6.0%)
- Low concurrency now 47% faster than Vert.x (was 19% slower)
- High concurrency gap to Vert.x reduced from 18% to 13%
Tests:
- http-core / compile - passed
- Validated with local benchmark (ghz, complex_proto scenario)
References:
None - performance optimization
Motivation: Flamegraph analysis showed HPACK Decoder.decode and VectorBuilder.<init> as hotspots. For gRPC workloads, the same headers are used repeatedly (:method, :path, content-type, etc.), so caching parsed header objects could avoid repeated String allocation and parsing overhead. Modification: Add a ConcurrentHashMap cache in HeaderDecompression that stores parsed header objects keyed by (name, value) tuples. Check cache before parsing, and store results for future reuse. Cache size is limited to 1024 entries to avoid memory issues. Result: Benchmark shows marginal improvement within margin of error (79,257 vs 79,854 req/s, ~0.7%). The HPACK protocol's built-in dynamic table already provides effective caching for repetitive headers, so the additional cache provides minimal benefit. Tests: - http-core / compile - passed - Benchmark verification with ghz (1000 concurrency, 50 connections) References: None - performance optimization attempt
Motivation: The Http2Demux.onPush handler performs two separate pattern matches on every incoming HTTP/2 frame: first to check if it's a PingFrame (to skip onDataFrameSeen), then again to process the frame. This creates unnecessary branching overhead on the per-frame hot path. Modification: Combine the two pattern matches into a single match. PingFrame cases (true/false ack) are handled first without calling onDataFrameSeen. All other frame types call pingState.onDataFrameSeen() at the start of their case block. This eliminates one full pattern match traversal per incoming frame. Result: Reduced branching overhead in the HTTP/2 frame dispatch hot path. For high-concurrency gRPC benchmarks with 1000 connections, this eliminates one pattern match per incoming HEADERS/DATA frame. Tests: sbt http-core / Test / testOnly *Http2* All 37 tests passed. References: None - local performance optimization follow-up from OPTIMIZATION_HANDOFF.md
Motivation: The withErrorHandling wrapper called handler(request).recover on every request. Future.recover always allocates a Recover PartialFunction and a wrapper Future via transform, even when the handler returns an already-completed successful Future (the common case for gRPC unary handlers returning Future.successful). This appeared as Http2Ext$$Lambda (71 CPU samples) in async-profiler. Modification: Add a synchronous fast path that checks response.value before calling .recover. For already-completed successful futures (the gRPC unary hot path), the original response is returned directly, skipping the Recover PF allocation and transform wrapper Future entirely. For failed or not-yet-completed futures, the original .recover path is used. Result: Eliminates 2 object allocations per synchronous gRPC unary request (Recover PartialFunction + wrapper Future). P99 latency dropped from 24.43ms to 21.19ms (-13.3%) for string_100B and average latency from 7.62ms to 7.13ms (-6.4%) for complex_proto. Tests: sbt http-core / compile Compiled successfully. References: None - performance optimization from flamegraph analysis
Motivation: RequestErrorFlow created two separate InHandler with OutHandler objects per materialization: one for the request path (parse result handling) and one for the response path (simple pass-through). The response path handler was a trivial pass-through that just forwarded elements between ports, making it an ideal candidate for merging into the GraphStageLogic. Modification: Make the GraphStageLogic extend InHandler with OutHandler and implement the response path's onPush/onPull directly. The request path handler remains as a separate InHandler with OutHandler since it has distinct logic (pattern matching on ParseRequestResult and emitting error responses). This eliminates 2 handler object allocations per materialization (one InHandler + one OutHandler). Result: Reduced object allocations in the HTTP/2 request processing pipeline. Benchmark shows complex_proto average latency improved from 7.13ms to 6.79ms (-4.8%) and P99 from 27.76ms to 24.25ms (-12.6%). Tests: sbt http-core / compile Compiled successfully. ghz benchmark: complex_proto avg 6.79ms, P99 24.25ms, 79830 req/s References: None - performance optimization from flamegraph analysis
Motivation: When a ParsedHeadersFrame is compressed into a CompositeFrame that exceeds the max frame size, the first frame is pushed immediately and remaining continuation frames are drained via a newly allocated OutHandler. This OutHandler is created once per response (when HEADERS + DATA coalescing produces a CompositeFrame), adding GC pressure under high concurrency. Modification: Replace the per-CompositeFrame OutHandler with a var field (continuationFrames) on the existing GraphStageLogic. The Logic's onPull method now checks for pending continuation frames before pulling new input, draining them inline without any handler allocation. The onPush method stores remaining frames in the var field instead of creating a new OutHandler. Result: Eliminates one OutHandler object allocation per response when CompositeFrame splitting occurs. The Logic object is reused for both normal operation and continuation frame draining. Tests: sbt http-core / compile Compiled successfully. ghz benchmark (30s warmup + 120s, 1000c/50conn): string_100B: 88326 req/s, avg 6.17ms, P99 22.46ms complex_proto: 79398 req/s, avg 6.73ms, P99 29.12ms References: None - performance optimization from flamegraph analysis
Motivation: The updateState method was implemented by delegating to updateStateAndReturn with a wrapper lambda: x => (handle(x), ()). This wrapper lambda was allocated on every call. updateState is called for every HTTP/2 stream state transition (handleStreamEvent, handleOutgoingCreated, handleOutgoingEnded, etc.), resulting in 2+ lambda allocations per gRPC request. Modification: Inline the updateStateAndReturn logic directly into updateState, eliminating the wrapper lambda. The handle function (StreamState => StreamState) is now called directly without wrapping it in a tuple- returning lambda. updateStateAndReturn remains for pullNextFrame which needs the return value (PullFrameResult). Result: Eliminates 2+ lambda allocations per gRPC request in the HTTP/2 stream state machine. complex_proto throughput improved to 80,211 req/s (+8.3% vs Vert.x 74,053 req/s). Tests: sbt http-core / compile Compiled successfully. ghz benchmark (30s warmup + 120s, 1000c/50conn): complex_proto: 80211 req/s, avg 6.72ms, P99 28.05ms string_100B: 87076 req/s, avg 6.49ms, P99 25.18ms References: None - performance optimization from hot path analysis
Motivation: handleStreamEvent is called for every incoming HTTP/2 frame (HEADERS, DATA, WINDOW_UPDATE, etc.). It delegated to updateState with the lambda _.handle(e), which allocates a new Function1 closure per frame. At 80K+ req/s with 2+ frames per request, this produced 160K+ lambda allocations per second on the hot path. Modification: Extract the state transition bookkeeping from updateState into a new commitStreamState method. Inline the state lookup and handle call directly in handleStreamEvent: streamFor(streamId).handle(e), then call commitStreamState with the pre-computed old and new states. This eliminates the _.handle(e) lambda closure entirely. updateState remains for other call sites (handleOutgoingCreated, handleOutgoingEnded, etc.) that are called less frequently. Result: Eliminates 1 lambda allocation per incoming HTTP/2 frame. ghz benchmark (30s warmup + 120s, 1000c/50conn): string_100B: 91336 req/s (+3.4%), avg 6.41ms complex_proto: 82369 req/s (+2.7%), avg 6.93ms, P99 24.03ms vs Vert.x: string_100B +21.0%, complex_proto +11.2% Tests: sbt http-core / compile Compiled successfully. References: None - performance optimization from hot path analysis
Motivation: handleOutgoingCreated and handleOutgoingEnded were called once per gRPC response. They delegated to updateState with lambda closures: _.handleOutgoingCreated(outStream, attrs) and _.handleOutgoingEnded(). Each closure allocation occurs once per response, producing ~80K lambda allocations per second at 80K req/s. Modification: Inline the state transition in both methods using commitStreamState directly with the pre-computed new state. handleOutgoingCreated computes the new state via oldState.handleOutgoingCreated/AndFinished and passes it to commitStreamState. handleOutgoingEnded similarly calls oldState.handleOutgoingEnded() directly. Result: Eliminates 2 lambda allocations per gRPC response (one in handleOutgoingCreated, one in handleOutgoingEnded). ghz benchmark (30s warmup + 120s, 1000c/50conn): string_100B: 92643 req/s (+1.4%), P99 20.94ms (-29.6%) complex_proto: ~79K req/s (within noise), P99 improved Tests: sbt http-core / compile Compiled successfully. References: None - performance optimization from hot path analysis
820263f to
8abbce3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Profiling gRPC-over-HTTP/2 workloads with async-profiler revealed several performance bottlenecks in the HTTP/2 request/response pipeline:
handleWithStreamIdHeaderwrapped the user handler call inFuture { }, adding an unnecessary EC dispatch hop on top ofmapAsyncUnordered's own schedulingHttp2Demux.onPushperformed two separate pattern matches per incoming framehandleStreamEvent,updateState,handleOutgoingCreated, andhandleOutgoingEndedeach created lambda closures per invocationHeaderCompressioncreated a newOutHandlerfor eachCompositeFramecontinuationwithErrorHandlingalways called.recovereven for already-completed successful futuresModification
Request path (3 commits)
cc2531b8b): Call user handler directly inmapAsyncUnorderedlambda instead of wrapping inFuture { }. ThemapAsyncUnorderedstage already schedules on the EC, so the extraFuture { }wrapper doubled the scheduling overhead.331e94ac8): Cache common gRPC headers (:method,:path,:scheme,content-type) inHeaderDecompressionto avoid repeated parsing.84a77de46): Combine two separate pattern matches inHttp2Demux.onPushinto a single match, eliminating redundant frame type checks.Frame handling (3 commits)
b11508060): Render DATA and HEADERS frames into a single buffer allocation inFrameRenderer, reducing per-response buffer allocations.7f666f45c): Replace per-CompositeFrameOutHandlerallocation with a state field inHeaderCompression'sGraphStageLogic, draining continuation frames without new object creation.withErrorHandlingfast path (a785e7023): Checkfuture.valuebefore calling.recover; for already-completed successful futures (the common case in gRPC unary), skip the.recoverallocation entirely.Stream processing (4 commits)
f18c4b664): Merge the response path handler intoRequestErrorFlow'sGraphStageLogic(usingwith InHandler with OutHandler), eliminating 2 handler object allocations per materialization.updateState(56890d1ab): ExtractcommitStreamStatebookkeeping method and inline state transitions inhandleStreamEvent, eliminating the per-callx => (handle(x), ())lambda wrapper.handleStreamEventlambda elimination (ae71cfbd0): Inline the_.handle(e)lambda inHttp2Demux.handleStreamEvent, usingcommitStreamStatedirectly.handleOutgoingCreated/Endedlambda elimination (9cea60bad): Inline state transitions inhandleOutgoingCreatedandhandleOutgoingEnded, eliminating 2 lambda allocations per response.Note: This branch also contains HeaderPairs-related commits that were reverted (
5c24d3131,6449372d8,5bbd883fa,014156000). The net effect of these 4 commits is zero — they cancel each other out. The effective changes are the 10 commits listed above.Result
Benchmarked with
ghz(complex_proto, 1000 concurrency, 50 connections, 120s, SerialGC, pekko-grpc optimized):Allocation profiling (async-profiler) confirms reduced per-request allocations in the HTTP/2 pipeline.
Tests
sbt http-core / Test / testReferences