perf: batch hash join chain-traversal probe lookups for memory-level parallelism#22677
perf: batch hash join chain-traversal probe lookups for memory-level parallelism#22677Dandandan wants to merge 1 commit into
Conversation
…parallelism In the chained (non-unique-key) path of `get_matched_indices_with_limit_offset`, each probe row's `map.find` cache miss fed directly into the chain walk that consumed it, so the hash-table probes were serialized one row at a time. Process probe rows in windows of 16: first resolve every row's head-of-chain index (`map.find`), then traverse the chains. Separating lookup from traversal lets the independent hash-table probes — the dominant cache miss here — have several misses outstanding at once (memory-level parallelism) instead of each one stalling the row that follows it. `traverse_chain` remains the sole authority for the output limit and resume offset and is still called in probe-row order, so the resume protocol is unchanged. Heads looked up for rows past the limit are discarded and recomputed on the next call. Adds a unit test asserting the windowed lookup produces identical output to a single unbounded call across the window boundary, for several limits including ones that split mid-chain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/batch-chain-traversal (8abebd8) to 0da8961 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/batch-chain-traversal (8abebd8) to 0da8961 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/batch-chain-traversal (8abebd8) to 0da8961 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
In the hash join probe path,
JoinHashMap::get_matched_indices_with_limit_offsethandles the non-unique-key (chained) case by, for each probe row, looking up the head of the collision chain (map.find) and immediately walking that chain. Because themap.findresult feeds straight into the chain walk that consumes it, the hash-table probes are effectively serialized: each row's hash-table cache miss must resolve before the next row's lookup begins.The
map.findmiss is the dominant cost in this path, and these lookups are independent across probe rows — an ideal candidate for memory-level parallelism.What changes are included in this PR?
Process the probe rows in windows of 16 with two phases per window:
map.find) for every row in the window into a small stack array. These probes are independent, so their cache misses overlap (several outstanding at once) instead of each stalling the next row.traverse_chainremains the sole authority for the per-call output limit and resume offset, and is still invoked in probe-row order, so the limit/offset resume protocol is unchanged. Heads looked up for rows past the output limit are simply discarded and recomputed on the next call (the lookup is a pure function of the hash). The unique-key fast path and the mid-chain resume handling are untouched.Are these changes tested?
Yes:
test_limit_offset_window_boundary_matches_unboundedasserts the windowed lookup yields output identical to a single unbounded call, across more than one window (20 probe rows) with chains/singletons/misses, for several limits — including ones that split mid-chain and don't divide the window size.cargo test -p datafusion-physical-plan --lib joins::— 968 tests pass.cargo test -p datafusion-sqllogictest --test sqllogictests -- joins— passes.Are there any user-facing changes?
No. Internal performance optimization with identical results.