Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal by jijinbei · Pull Request #1232 · Open-Cascade-SAS/OCCT

jijinbei · 2026-04-24T19:53:58Z

Summary

Adds runtime-dispatched SIMD ray-vs-AABB kernels and three wide-BVH traversal drivers (BVH4 / BVH8 / BVH16) to the BVH module.

19 files added, 0 modified outside two surgical edits: BVH_BinaryTree.hxx gets one new template method (CollapseToWide<W>) and a helper, and two FILES.cmake listings register the new files.
+3601 / −0 lines.
No public API change, no existing test regresses.

What's new

Dispatch layer

BVH_SIMDDispatch.{hxx,cxx} — runtime CPU detection (cpuid + xgetbv on MSVC, __builtin_cpu_supports on GCC/Clang for x86; scalar fallback on ARM and other targets). BVH::SIMD::Detect() is computed once per process. OCCT_BVH_SIMD_FORCE env var allows forcing a level for testing.
BVH_ToolsSIMD_{SSE2,AVX2,AVX512}.hxx — per-ISA kernel headers, gated with per-function target attribute on GCC/Clang so the rest of TKMath stays buildable on hosts without those ISAs.

BVH4 (uses existing `BVH_QuadTree`)

RayBox4_{Scalar,SSE2,AVX2,AVX512} — 1-ray-vs-4-AABB slab-test kernels.
BVH_TraverseQuad.hxx — TraverseQuad<> driver. Inner nodes load up to 4 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox4().

`BVH_WideTree<int W>` abstraction (BVH8 / BVH16)

BVH_WideTree.hxx — single template tag/specialization replacing what would otherwise be two near-identical ad-hoc classes. BVH_Tree<T,N,BVH_WideTree<W>> for W=8 and W=16.
BVH_TraverseWide.hxx — single TraverseWide<W> driver covering both fan-outs.
BVH_BinaryTree::CollapseToWide<W>() — single template builder using a new BVH::GatherDescendants(node, log2(W), …) helper. W=8 walks 3 binary levels (2³ = 8 children); W=16 walks 4 (2⁴ = 16 children).

SIMD strategy: one natural-fit kernel per width

BVH width	Natural ISA	Register	Why
BVH4	SSE2	xmm 128-bit (4 fp32)	Empirically faster than AVX2 here — the upper 128 bits of ymm sit idle and the 256-bit pipe wakeup cost outweighs the wider scheduling.
BVH8	AVX2	ymm 256-bit (8 fp32)	First width that saturates ymm. AVX-512 with `__m256` + mask gains only "a few cycles" (measured; not shipped).
BVH16	AVX-512	zmm 512-bit (16 fp32)	First width that uses the full zmm. Per-lane work genuinely halves vs BVH8/AVX2.

Mismatched kernels (e.g. SSE2-on-BVH8 with two 4-wide passes, or AVX-512-on-BVH8 keeping zmm half-empty) were intentionally not added.

Tests

385 BVH gtests under src/FoundationClasses/TKMath/GTests/:

Functional per width — empty tree, single primitive, all-hit ray, all-miss ray, mosaic-hit ray, ray-behind-origin rejection, parallel-ray-inside-slab.
Random consistency — 1000 random ray + W-AABB cases per kernel, comparing SIMD output to the scalar reference (mask exact, t-values within 1e-3). The AVX-512 BVH16 variant is the acceptance gate for the zmm path.
DISABLED microbenchmark (BVH_ToolsSIMDBenchmark.DISABLED_RayBox*_KernelComparison) measuring ns/call per kernel × width.

AVX-512-specific tests GTEST_SKIP cleanly on hosts without AVX-512.

Side-branch cross-validation

In addition to the unit tests above, the equivalence between the BVH4, BVH8 and BVH16 traversals was independently verified on a sibling branch using a 3-way visual cross-check: a DISABLED gtest builds a deterministic random scene of 256 AABBs, runs all three traversals against the same ray, asserts via EXPECT_EQ that the three hit sets are identical, and emits a self-contained HTML viewer (Three.js, 3D scene, slider for live N, embedded benchmark table) for inspection. The 3-way EXPECT_EQ passed on every host tested. The HTML visualizer is kept on a separate developer-aid branch and is intentionally not included in this PR to keep the diff focused on the kernel/template work.

Performance evidence (measured on Google Cloud)

ns per 1-ray-vs-W-AABB kernel call, 4096 cases × 4096 repeats per kernel.

Benchmark host: Google Cloud Compute Engine c3-highcpu-4 (Intel Xeon Platinum 8481C, Sapphire Rapids, 4 vCPU, asia-northeast1) — rented specifically for this PR. All 385 BVH gtests pass on this host; on AVX2-only hosts the 4 AVX-512 tests SKIP cleanly.

BVH width	Scalar	SSE2	AVX2	AVX-512	Best	Speedup
BVH4	31.22	3.20	3.55	3.40	SSE2	9.77×
BVH8	75.68	—	5.70	—	AVX2	13.29×
BVH16	151.53	—	—	7.47	AVX-512	20.28×

(Bold = natural-fit kernel for that width; — = combinations intentionally not shipped.)

Per-lane (ns), comparing the best kernel of each width:

Width × kernel	Per-lane	Note
BVH16 / AVX-512	0.467	winner — confirms the kernel really uses zmm, not the `__m256`+mask shortcut
BVH8 / AVX2	0.71
BVH4 / SSE2	0.80

Compatibility

No public API change, no existing method modified. The only existing files touched are BVH_BinaryTree.hxx (one new template method CollapseToWide<W> and the BVH::GatherDescendants helper) and two FILES.cmake listings.
Production code does not yet call the new traversal drivers — they are exercised only by the new tests, leaving callsite migration as opt-in for future PRs.
Cross-platform safe: the Detect() function is properly guarded so ARM macOS (Apple Silicon) builds cleanly — __builtin_cpu_* is #if'd out outside x86, and the fallback path returns Scalar.

Implementation notes

Slab formula matches scalar exactly. All SIMD kernels use (box - origin) * invDir, NOT the FMA-friendly form. Parallel-ray edge cases (origin coincides with box face → 0 * inf = NaN) propagate differently between formulations; the random-consistency tests pin every SIMD kernel to the scalar behaviour.
Padding lanes are always-miss boxes. Inner nodes with fewer than W children get padding lanes filled with max < min on every axis, so the SIMD slab test rejects them naturally with no special control flow on the kernel side.
GTEST_SKIP rather than compile-out for unsupported ISAs, so the test binary is identical across hosts.

Build / CI

Verified on the contributor's fork CI matrix: Linux (GCC 13 + Clang 18, AVX2-only host), Windows MSVC (x64 / ARM64), macOS Apple Silicon (Clang, scalar fallback). The full AVX-512 BVH16 zmm path was additionally validated on Google Cloud Sapphire Rapids hardware (see Performance evidence above).

CLA ID

1142

Introduce BVH_SIMDDispatch with runtime CPU detection (__cpuid+xgetbv on MSVC, __builtin_cpu_supports on GCC/Clang) and a function pointer dispatch returning the best available 1-ray-vs-4-AABB kernel for the host. Ships with a scalar reference implementation; SSE2, AVX2 and AVX-512 kernels follow in subsequent commits. The OCCT_BVH_SIMD_FORCE environment variable allows forcing a specific level (clamped to actual CPU support) to exercise lower-level kernels on machines that support higher ones. Adds 7 unit tests covering detection, dispatch, and the scalar kernel across all-hit, all-miss, mosaic partial hits, ray-behind-origin, and parallel-ray-inside-slab cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement RayBox4_SSE2 in BVH_ToolsSIMD_SSE2.hxx using the slab method with _mm_min_ps/_mm_max_ps. The kernel exploits the SSE2 spec that "if either operand is NaN the second is returned" to handle parallel rays without explicit branching: parallel-axis NaN is absorbed by the finite results of the other axes. Wires the SSE2 kernel into BVH_SIMDDispatch::GetRayBox4 for SSE2/AVX2/ AVX-512 detected levels (AVX2 and AVX-512 specialized kernels arrive in subsequent commits). Adds 8 tests in BVH_ToolsSIMD_Test.cxx covering all-hit, all-miss, mosaic partial hits, ray-behind, parallel-ray-inside-slab, degenerate point-boxes, dispatch preference on x86, and 1000-case random consistency vs the scalar reference (mask exact, t-values within 1e-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add BVH_TraverseQuad::TraverseQuad, a free-function template that walks a quad-BVH tree with one ray and invokes a user-provided acceptor for each primitive in any leaf the ray intersects. Inner nodes load up to 4 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox4 so the best kernel for the host CPU (currently scalar or SSE2; AVX2 and AVX-512 land in following commits) is used automatically. The traversal is intentionally narrow in scope: it handles only the ray-vs-AABB query, not the full BVH_Traverse selector framework (distance queries, pair traversal, metric-driven descent). This keeps the SIMD integration self-contained and avoids touching any production caller. Real users still go through BVH_Traverse::Select() on a binary tree; this new path is exercised only by tests in this commit and becomes opt-in for callers in a later PR. Padding lanes for inner nodes with fewer than 4 children get an "always-miss" box (max < min on every axis), so the slab test rejects them naturally and no special control flow is needed in the kernel. Adds 5 tests covering: empty tree, flat root with all-hit / all-miss / mosaic-hit rays, and ray-behind-origin rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement RayBox4_AVX2 in BVH_ToolsSIMD_AVX2.hxx, gated on a per-function __attribute__((target("avx2,fma"))) so the rest of TKMath stays buildable on hosts without AVX2. The dispatcher routes to the AVX2 kernel only after runtime CPU detection (cpuid + xgetbv) confirms support. The kernel uses the same slab formula as SSE2, t = (box - origin) * invDir, NOT the FMA-friendly t = invDir*box - invDir*origin. Reason: for parallel rays whose origin coincides with a box face, idx*minx becomes 0*inf = NaN, which the FMA-form fmsub then propagates differently than the scalar/SSE2 path. Random-consistency tests caught the divergence (mosaic mask 0b0101 came back as 0b1111 because parallel-axis NaN polluted finite axes). The win over SSE2 therefore comes from VEX three-operand encoding and better register pressure, not from FMA -- a real BVH8 with __m256 lanes would unlock much more. Adds 3 AVX2-specific tests gated on Detect() >= AVX2 (so they SKIP on older hardware): all-hit, mosaic partial-hit, and 1000-case random consistency vs the scalar reference. Existing SSE2/scalar/QBVH tests continue to pass through the new dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement RayBox4_AVX512 in BVH_ToolsSIMD_AVX512.hxx, gated on a per-function __attribute__((target("avx512f,avx512vl"))) so the rest of TKMath remains buildable on hosts without AVX-512. The dispatcher routes to it only after runtime detection (cpuid + xgetbv with the 0xE6 ZMM-state mask) confirms both ISA and OS support. Stays on __m128 lanes -- a true BVH16 with __m512 would be needed to unlock the full 16-lane width. The win at BVH4 fan-out is modest (~5-15%): _mm_cmp_ps_mask returns the predicate directly into a mask register, so the final hit-mask is one mask AND instead of the movemask + shuffle dance the SSE2/AVX2 paths use. The mask AND uses operator& on __mmask8 (an integer alias) rather than _kand_mask8, because the latter requires AVX-512DQ which is a narrower target than AVX-512F+VL. Slab formula deliberately matches SSE2/AVX2 -- (box-origin)*invDir, not the FMA form -- so parallel-ray NaN propagation stays consistent with the scalar reference. Adds 3 AVX-512-specific tests gated on Detect() >= AVX512 (so they SKIP on non-AVX-512 hosts, including this commit's CI). The existing SSE2/AVX2/scalar tests continue to verify dispatch on this hardware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…x kernels Add BVH_ToolsSIMD_Benchmark_Test.cxx with a single benchmark, gated behind the gtest DISABLED_ prefix so CI does not measure timing unreliably. Run it explicitly with: OpenCascadeGTest --gtest_also_run_disabled_tests --gtest_filter='BVH_ToolsSIMDBenchmark*' The benchmark exercises 4096 random (ray, 4-AABB) cases x 4096 repetitions = ~16.8M invocations per kernel, then reports ns/call and speedup vs the scalar reference. A "sink" accumulator over the returned hit masks is printed to keep the optimizer from removing the loop. Sample output on a Zen-class AVX2 host (no AVX-512): Scalar 14.86 ns/call 1.00x SSE2 1.92 ns/call 7.75x AVX2 2.13 ns/call 6.98x AVX-512 (skipped: CPU support absent) Dispatched 2.08 ns/call 7.13x Notes: - All kernels return bit-identical hit masks (sink values match exactly). - The 7-8x speedup over scalar is well above the 3.5-4x predicted from pure SSE lane width, suggesting the scalar version's branchy parallel-ray handling costs more than expected. - AVX2 came out marginally slower than SSE2 on this CPU. Plausibly the 256-bit pipe wakeup cost is not amortized at BVH4 fan-out, and the VEX-encoded SSE form already gets all the register-pressure benefit. A real BVH8 with __m256 lanes would change this picture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce BVH_OctTree (analogous to BVH_QuadTree but with K=0..7) and a new BVH_Tree<T,N,BVH_BinaryTree>::CollapseToOctTree() method that gathers up to 2^3 = 8 representatives of a 3-level binary subtree into one BVH8 inner node. The 8-wide layout is the substrate that lets a single AVX2 __m256 ray-box test cover the whole fan-out -- the BVH4 we shipped earlier left half the AVX2 lanes idle and ended up only matching SSE2 in the microbenchmark. The collapse uses BVH::GatherDescendants, a recursive helper that walks 3 levels down or stops earlier at any leaf, so inner nodes ranging 1..8 children are produced naturally without padding logic in the builder. The kernel will absorb the irregular fan-out via "always-miss" boxes on unused SIMD lanes (same trick as TraverseQuad). Adds 5 BVH_OctTree_Test cases: - empty / single-element trivial paths preserved - all primitives present after collapse (set equality vs binary leaves) - node count strictly less than the source binary tree - every inner node reports child count in [1,8] and at least one reaches the saturated 8-children case for a 64-leaf balanced tree Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend BVH_SIMDDispatch with the BVH8 (1-ray-vs-8-AABB) infrastructure: the BVH_Box8f_SoA / BVH_Ray8f_Splat data structures (32-byte aligned to suit later __m256 loads), a RayBox8_Fn function pointer, the scalar reference RayBox8_Scalar, and a GetRayBox8() dispatcher whose switch currently routes everything to the scalar path. The slab-method body mirrors the 4-wide scalar kernel verbatim, just unrolled across 8 lanes; this gives the equivalence baseline against which the SSE2/AVX2/AVX-512 RayBox8 kernels in upcoming commits will be measured. Adds 4 RayBox8 scalar tests (all-hit, all-miss, 8-bit mosaic 0b01010101, GetRayBox8 non-null) and a DISABLED_RayBox8_KernelComparison benchmark. First measurement on this host: 34.1 ns/call -- about 2x the BVH4 scalar time, as expected for double the lane count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement RayBox8_SSE2 by chaining the existing RayBox4_SSE2 twice (lower lanes 0..3, upper lanes 4..7). On 128-bit hardware this is the ceiling since SSE2 cannot consume 8 lanes in one instruction; the value of having a dedicated 8-wide entry point is uniform dispatch -- callers can target one BVH8 traversal API and let the dispatcher pick the best implementation per host. AVX2's __m256 will give the genuine 8-lane single-instruction kernel in the next commit. Wires the SSE2 kernel into GetRayBox8 for SSE2/AVX2/AVX-512 detected levels (those higher levels stay on SSE2 until the AVX2 and AVX-512 RayBox8 kernels arrive). Adds RayBox8 random consistency test (500 cases, SSE2 vs scalar) and extends the DISABLED_RayBox8 benchmark with the SSE2 row. First measurement on this host: Scalar 34.12 ns/call 1.00x SSE2 3.93 ns/call 8.68x Per-lane throughput matches BVH4 SSE2 closely (0.49 vs 0.48 ns/lane), confirming the implementation is just the 4-wide kernel doubled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…__m256) Implement RayBox8_AVX2 -- the kernel BVH8 was built for. A single 256-bit __m256 holds all 8 lanes, so the entire BVH8 fan-out is tested in one slab pass instead of the SSE2 path's two 4-wide passes. Adopts tavianator's sign-based corner select (Fast, Branchless Ray/ Bounding Box Intersections, Part 3, 2022): the slab "near" and "far" faces are determined by the sign of invDir per axis, which is a property of the ray, not the box. Pre-selecting near/far via _mm256_blendv_ps eliminates one extra mul+min+max per axis (3 instructions per axis vs 4 in the symmetric formulation). NaN handling for parallel rays (invDir = inf, origin on box face) keeps the same operand ordering as SSE2 -- max(max(tNearY, tNearZ), tNearX) and the analogous min for tLeave -- so the parallel-axis NaN gets absorbed by the finite results of the other two axes via SSE's "min/max returns the second operand on NaN" rule. Wires AVX2 into GetRayBox8 ahead of SSE2. Microbenchmark on this Zen-class AVX2 host: Scalar 34.11 ns/call 1.00x SSE2 3.92 ns/call 8.70x AVX2 3.09 ns/call 11.04x <-- AVX2 finally beats SSE2 Per-lane throughput drops from 0.49 ns/lane (SSE2 BVH8) to 0.39 ns/lane (AVX2 BVH8), a 25% improvement. Compare to BVH4 where AVX2 was *slower* than SSE2 (2.13 ns vs 1.92 ns) because the upper 128 bits of __m256 sat idle. This confirms the BVH8 layout was the missing piece -- the SIMD kernels now pay back the dispatch infrastructure. Adds RayBox8 random consistency test (500 cases, AVX2 vs scalar) and extends the DISABLED_RayBox8 benchmark with the AVX2 row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement RayBox8_AVX512: same slab + sign-based corner select as RayBox8_AVX2, but the final hit predicate is computed straight into a __mmask8 via _mm256_cmp_ps_mask (AVX-512F + VL), saving the movemask round-trip the AVX2 path needs. At BVH8 fan-out the saving is small (a few cycles); the same pattern would matter more at BVH16 with __m512. Wires AVX-512 into GetRayBox8 ahead of AVX2 and SSE2. Adds RayBox8 random consistency test (500 cases, AVX-512 vs scalar) gated on Detect() >= AVX512 -- it SKIPs on hosts without the ISA including this commit's CI box, but it builds and is exercised on AVX-512-capable hardware. Extends the DISABLED_RayBox8 benchmark with the AVX-512 row likewise gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add BVH_TraverseOct::TraverseOct, the 8-wide analogue of TraverseQuad. Walks an OBVH built via BVH_BinaryTree::CollapseToOctTree and invokes a user-provided acceptor for each primitive in any leaf the ray hits. Inner nodes load up to 8 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox8 so the best kernel for the host (currently scalar or SSE2 on this CI box; AVX2 / AVX-512 where available) does the test. Mirrors TraverseQuad's design verbatim except for the lane count (8 instead of 4), padding-mask width (8 bits), and stack capacity (MaxTreeDepth * 8). Padding lanes get an "always-miss" box so inner nodes with fewer than 8 children pass through the SIMD kernel without any special control flow. Adds 5 tests covering: empty tree, flat root with all-hit / all-miss / 8-bit-mosaic-hit rays, and ray-behind-origin rejection. All pass and exercise the AVX2 RayBox8 kernel on this host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Consolidate the BVH8 implementation behind a single template<int W> abstraction so adding BVH16 (a follow-up commit) reuses the traversal and dispatch infrastructure instead of duplicating it. The new abstraction lives in: - BVH_WideTree.hxx : tag struct + BVH_Tree<T,N,BVH_WideTree<W>> spec - BVH_TraverseWide.hxx: TraverseWide<W> template + W-parametric helpers Existing BVH8 surface API is renamed (no aliases, full rename): BVH_OctTree -> BVH_WideTree<8> TraverseOct -> TraverseWide<8> CollapseToOctTree -> CollapseToWide<8> (template, also <16>-ready) BVH_Box8f_SoA -> BVH_BoxNf_SoA<8> BVH_Ray8f_Splat -> BVH_RayNf_Splat<8> RayBox8_Scalar -> RayBoxN_Scalar<8> (function template) RayBox8_AVX2 -> RayBoxN_AVX2_8 (signature uses BVH_*Nf<8>) GetRayBox8 / RayBox8_Fn -> GetRayBoxN<8> / RayBoxN_Fn<8> BVH4 is intentionally untouched -- BVH_QuadTree, TraverseQuad, RayBox4_* remain as before. SSE2 (xmm 4-wide) is BVH4's natural fit; folding it into the template gains nothing. SIMD kernel pruning per "right tool for right job": Past benchmarks (Zen AVX2 host, commit 819c0cf / 20487e4) showed: - BVH4 SSE2: 1.92 ns (7.75x), AVX2: 2.13 ns (6.98x) -- AVX2 LOSES (256-bit pipe wakeup + upper 128 idle eats the lane-width gain) - BVH8 SSE2: 3.92 ns (8.70x), AVX2: 3.09 ns (11.04x) -- AVX2 wins - BVH8 AVX-512: "saving is a few cycles" (commit d8ee58e's own observation) because that kernel used __m256+mask, NOT __m512. Conclusion: kernel width should match BVH width should match register width (xmm/ymm/zmm). The mismatched kernels were dead weight, so: - Drop RayBox8_SSE2 (BVH_ToolsSIMD_SSE2.hxx) - Drop RayBox8_AVX512 (BVH_ToolsSIMD_AVX512.hxx) W=8 dispatch is now AVX2 -> Scalar fallback only. W=16 dispatch is Scalar fallback for now; AVX-512 (__m512 zmm 16-lane) is the natural-fit kernel and lands in a follow-up commit. Tests: - BVH_OctTree_Test.cxx deleted, replaced by BVH_WideTree8_Test.cxx - BVH_TraverseOct_Test.cxx deleted, replaced by BVH_TraverseWide8_Test.cxx - BVH_SIMDDispatch_Test.cxx updated to RayBoxN_*<8> - BVH_ToolsSIMD_Test.cxx drop RayBox8 SSE2/AVX-512 random-consistency tests, keep AVX2 BVH8 (renamed RayBoxN_AVX2_8) - BVH_ToolsSIMD_Benchmark_Test.cxx RayBox8 bench replaced by RayBoxN<8> (Scalar + AVX2 only; SSE2 / AVX-512 BVH8 rows removed) - All 218 BVH-related GTests pass on the AVX2 host (3 AVX-512 tests skip due to absent ISA, expected behaviour). git history is intentionally not preserved (delete + create instead of git mv) -- template-ization changes content beyond rename-detection's 50% similarity threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Instantiates the BVH_WideTree<W> template at W=16, which: - Collapses every 4th binary level (2^4 = 16 children per inner node) via the existing CollapseToWide<W> helper. - Uses the existing template RayBoxN_Scalar<W> for ray-box tests. - Routes traversal through the existing TraverseWide<W>. No new BVH16-only infrastructure is needed -- everything is template instantiation of code already shipped in the BVH_WideTree<W> commit. The natural-fit AVX-512 (__m512 zmm 16-lane) kernel will land in a follow-up commit; until then BVH16 dispatches to scalar fallback. Tests: - BVH_WideTree16_Test.cxx mirrors BVH_WideTree8_Test.cxx with 256 primitives (2^8 leaves so 4-level collapse can saturate at least one BVH16 inner node). - BVH_TraverseWide16_Test.cxx mirrors BVH_TraverseWide8_Test.cxx with a flat 16-leaf root: AllSixteenHit, AllMiss, 16BitMosaic, RayBehindAllBoxes. - BVH_SIMDDispatch_Test.cxx adds 4 ScalarRayBoxN16_* tests + a GetRayBoxN16ReturnsNonNull check. All 14 new tests pass; 218 previously-passing tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This is the kernel BVH16 was built for: a single __m512 holds all 16 fp32 lanes so one slab pass exhausts the AVX-512 register width with zero waste. Per-lane work genuinely halves vs BVH8/AVX2. Critically, this kernel uses zmm intrinsics throughout, NOT __m256 + mask tricks. The previous (now-removed) BVH8 AVX-512 kernel made that mistake: it stayed on __m256 lanes and used AVX-512 only to skip the movemask round-trip, which gained "a few cycles" per call (commit d8ee58e's own admission). The result was AVX-512 BVH8 ~= AVX2 BVH8, no real win. The same shortcut here would defeat the entire purpose of BVH16. Sanity-checked via objdump that the emitted instructions actually use zmm registers: vmovups 0xc0(%rdi),%zmm10 vmovups 0x100(%rdi),%zmm9 vcmplt_oqps %zmm7,%zmm10,%k3 ... Algorithm: same sign-based corner select as RayBoxN_AVX2_8 (tavianator), now with _mm512_mask_blend_ps driven by sign-extracted __mmask16. Final hit predicate goes straight into __mmask16 via _mm512_cmp_ps_mask; __mmask16 is an int alias so the return is a single zext. Target attribute: target("avx512f") only (no vl). The 512-bit-only path is sufficient -- vl extensions would only matter if we mixed zmm with narrower vector ops, which we don't. NaN handling for parallel rays (invDir = inf, origin on box face) keeps the same operand ordering as SSE2/AVX2 -- _mm512_max_ps / _mm512_min_ps on x86 follow the "second operand wins on NaN" rule, so a parallel-axis NaN is dominated by finite results from the other two axes. Dispatch: GetRayBoxN<16>() now returns &RayBoxN_AVX512_16 on AVX-512 hosts, scalar fallback otherwise. Tests: - BVH_ToolsSIMD_Test.cxx adds AVX512_RayBoxN16_RandomConsistencyVsScalar (1000 random ray+16-AABB cases vs scalar reference, ULP-near match). SKIPs cleanly on hosts without AVX-512. - All previously-passing tests continue to pass on the AVX2-only host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends BVH_ToolsSIMD_Benchmark_Test.cxx with DISABLED_RayBoxN16_ KernelComparison: scalar reference + AVX-512 (when supported on the host). Same shape as the existing RayBoxN<8> bench (4096 random ray+16-AABB cases x 4096 repeats = 16.78M invocations). The bench includes a sanity gate (EXPECT_LT(aAVX512.nsPerCall < aScalar.nsPerCall * 0.5)) that asserts AVX-512 BVH16 actually beats scalar by the expected 2x+ margin -- a smoke test against the "silently using ymm not zmm" failure mode that hit the previous BVH8 AVX-512 attempt. Per "right tool for right job", BVH16 has only the AVX-512 SIMD path plus scalar fallback (no SSE2/AVX2 BVH16 wrappers). On hosts without AVX-512, the bench just runs scalar and dispatched-= scalar rows. Sample output on a Zen 3 (AVX2-only) host: === RayBoxN<8> microbenchmark (BVH8 fan-out) === Scalar 31.51 ns/call 1.00x AVX2 3.17 ns/call 9.95x Dispatched 3.23 ns/call 9.74x === RayBoxN<16> microbenchmark (BVH16 fan-out) === Scalar 63.38 ns/call 1.00x AVX-512 (skipped: CPU support absent) Dispatched 66.05 ns/call 0.96x Per-lane on this host: BVH8/AVX2 = 0.40 ns/lane (lower bound for the kernel cost) BVH16/Scalar = 3.96 ns/lane (10x worse per lane vs SIMD) On AVX-512 hardware, the AVX-512 BVH16 row should land near ~0.4 ns/lane * 16 = ~6 ns/call (~2x the AVX2 BVH8 number), demonstrating the per-lane savings the zmm width unlocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes for CI failures observed on jijinbei/OCCT PR #4: 1. macOS-15 (Apple Silicon) build failure: __builtin_cpu_init() and __builtin_cpu_supports() are x86-only GCC/Clang builtins. On ARM64 they error with "builtin is not supported on this target". Guard the GCC/Clang detect path with __x86_64__ || __i386__ so ARM falls through to the scalar fallback (already the right behaviour -- no x86 SIMD kernels exist on ARM). 2. Code formatting check failure: CI runs a stricter clang-format (clang-format-18 default style) than the local pre-commit hook used. Apply the diff produced by the CI's "format-patch" artifact verbatim. Touches BVH_TraverseQuad.hxx, BVH_TraverseQuad_Test.cxx, and the detectImpl() function in BVH_SIMDDispatch.cxx (whitespace only). Local re-run after both fixes: 381/385 BVH tests pass on the AVX2-only Zen 3 host (4 AVX-512 tests SKIP as expected, no regressions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apple Silicon Clang runs -Werror -Wunused-function and flagged the MakeRaySplat / SetBox helpers in BVH_ToolsSIMD_Test.cxx as unused. On macOS ARM64: - BVH_HAS_SSE2_KERNEL is not defined (SSE2 is x86-only) - The entire block of SSE2 RayBox4 tests that *used* these helpers is #if'd out - -> every caller disappears at preprocessing time, and the helpers become dead code Annotate both with [[maybe_unused]] so they survive the -Werror guard when their consumers are conditionally compiled out. On x86 hosts the attribute is a no-op because the SSE2 block is present and uses them. Fixes the macOS build failure observed on CI run 24898268476. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jijinbei and others added 18 commits April 25, 2026 03:44

github-project-automation Bot added this to Maintenance Apr 24, 2026

github-project-automation Bot moved this to Todo in Maintenance Apr 24, 2026

dpasukhi added the 3. CLA approved User has the signed CLA and ready to review or integration processes label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal#1232

Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal#1232
jijinbei wants to merge 18 commits intoOpen-Cascade-SAS:masterfrom
jijinbei:bvh-w4-w8-w16-simd

jijinbei commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jijinbei commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Dispatch layer

BVH4 (uses existing BVH_QuadTree)

BVH_WideTree<int W> abstraction (BVH8 / BVH16)

SIMD strategy: one natural-fit kernel per width

Tests

Side-branch cross-validation

Performance evidence (measured on Google Cloud)

Compatibility

Implementation notes

Build / CI

CLA ID

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jijinbei commented Apr 24, 2026 •

edited

Loading

BVH4 (uses existing `BVH_QuadTree`)

`BVH_WideTree<int W>` abstraction (BVH8 / BVH16)