Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal#1232
Open
jijinbei wants to merge 18 commits intoOpen-Cascade-SAS:masterfrom
Open
Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal#1232jijinbei wants to merge 18 commits intoOpen-Cascade-SAS:masterfrom
jijinbei wants to merge 18 commits intoOpen-Cascade-SAS:masterfrom
Conversation
Introduce BVH_SIMDDispatch with runtime CPU detection (__cpuid+xgetbv on MSVC, __builtin_cpu_supports on GCC/Clang) and a function pointer dispatch returning the best available 1-ray-vs-4-AABB kernel for the host. Ships with a scalar reference implementation; SSE2, AVX2 and AVX-512 kernels follow in subsequent commits. The OCCT_BVH_SIMD_FORCE environment variable allows forcing a specific level (clamped to actual CPU support) to exercise lower-level kernels on machines that support higher ones. Adds 7 unit tests covering detection, dispatch, and the scalar kernel across all-hit, all-miss, mosaic partial hits, ray-behind-origin, and parallel-ray-inside-slab cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox4_SSE2 in BVH_ToolsSIMD_SSE2.hxx using the slab method with _mm_min_ps/_mm_max_ps. The kernel exploits the SSE2 spec that "if either operand is NaN the second is returned" to handle parallel rays without explicit branching: parallel-axis NaN is absorbed by the finite results of the other axes. Wires the SSE2 kernel into BVH_SIMDDispatch::GetRayBox4 for SSE2/AVX2/ AVX-512 detected levels (AVX2 and AVX-512 specialized kernels arrive in subsequent commits). Adds 8 tests in BVH_ToolsSIMD_Test.cxx covering all-hit, all-miss, mosaic partial hits, ray-behind, parallel-ray-inside-slab, degenerate point-boxes, dispatch preference on x86, and 1000-case random consistency vs the scalar reference (mask exact, t-values within 1e-3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add BVH_TraverseQuad::TraverseQuad, a free-function template that walks a quad-BVH tree with one ray and invokes a user-provided acceptor for each primitive in any leaf the ray intersects. Inner nodes load up to 4 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox4 so the best kernel for the host CPU (currently scalar or SSE2; AVX2 and AVX-512 land in following commits) is used automatically. The traversal is intentionally narrow in scope: it handles only the ray-vs-AABB query, not the full BVH_Traverse selector framework (distance queries, pair traversal, metric-driven descent). This keeps the SIMD integration self-contained and avoids touching any production caller. Real users still go through BVH_Traverse::Select() on a binary tree; this new path is exercised only by tests in this commit and becomes opt-in for callers in a later PR. Padding lanes for inner nodes with fewer than 4 children get an "always-miss" box (max < min on every axis), so the slab test rejects them naturally and no special control flow is needed in the kernel. Adds 5 tests covering: empty tree, flat root with all-hit / all-miss / mosaic-hit rays, and ray-behind-origin rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox4_AVX2 in BVH_ToolsSIMD_AVX2.hxx, gated on a per-function
__attribute__((target("avx2,fma"))) so the rest of TKMath stays buildable
on hosts without AVX2. The dispatcher routes to the AVX2 kernel only after
runtime CPU detection (cpuid + xgetbv) confirms support.
The kernel uses the same slab formula as SSE2, t = (box - origin) * invDir,
NOT the FMA-friendly t = invDir*box - invDir*origin. Reason: for parallel
rays whose origin coincides with a box face, idx*minx becomes 0*inf = NaN,
which the FMA-form fmsub then propagates differently than the scalar/SSE2
path. Random-consistency tests caught the divergence (mosaic mask 0b0101
came back as 0b1111 because parallel-axis NaN polluted finite axes).
The win over SSE2 therefore comes from VEX three-operand encoding and
better register pressure, not from FMA -- a real BVH8 with __m256 lanes
would unlock much more.
Adds 3 AVX2-specific tests gated on Detect() >= AVX2 (so they SKIP on
older hardware): all-hit, mosaic partial-hit, and 1000-case random
consistency vs the scalar reference. Existing SSE2/scalar/QBVH tests
continue to pass through the new dispatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox4_AVX512 in BVH_ToolsSIMD_AVX512.hxx, gated on a
per-function __attribute__((target("avx512f,avx512vl"))) so the rest of
TKMath remains buildable on hosts without AVX-512. The dispatcher
routes to it only after runtime detection (cpuid + xgetbv with the 0xE6
ZMM-state mask) confirms both ISA and OS support.
Stays on __m128 lanes -- a true BVH16 with __m512 would be needed to
unlock the full 16-lane width. The win at BVH4 fan-out is modest
(~5-15%): _mm_cmp_ps_mask returns the predicate directly into a mask
register, so the final hit-mask is one mask AND instead of the
movemask + shuffle dance the SSE2/AVX2 paths use. The mask AND uses
operator& on __mmask8 (an integer alias) rather than _kand_mask8,
because the latter requires AVX-512DQ which is a narrower target than
AVX-512F+VL.
Slab formula deliberately matches SSE2/AVX2 -- (box-origin)*invDir, not
the FMA form -- so parallel-ray NaN propagation stays consistent with
the scalar reference.
Adds 3 AVX-512-specific tests gated on Detect() >= AVX512 (so they
SKIP on non-AVX-512 hosts, including this commit's CI). The existing
SSE2/AVX2/scalar tests continue to verify dispatch on this hardware.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…x kernels
Add BVH_ToolsSIMD_Benchmark_Test.cxx with a single benchmark, gated
behind the gtest DISABLED_ prefix so CI does not measure timing
unreliably. Run it explicitly with:
OpenCascadeGTest --gtest_also_run_disabled_tests
--gtest_filter='BVH_ToolsSIMDBenchmark*'
The benchmark exercises 4096 random (ray, 4-AABB) cases x 4096
repetitions = ~16.8M invocations per kernel, then reports ns/call and
speedup vs the scalar reference. A "sink" accumulator over the returned
hit masks is printed to keep the optimizer from removing the loop.
Sample output on a Zen-class AVX2 host (no AVX-512):
Scalar 14.86 ns/call 1.00x
SSE2 1.92 ns/call 7.75x
AVX2 2.13 ns/call 6.98x
AVX-512 (skipped: CPU support absent)
Dispatched 2.08 ns/call 7.13x
Notes:
- All kernels return bit-identical hit masks (sink values match exactly).
- The 7-8x speedup over scalar is well above the 3.5-4x predicted from
pure SSE lane width, suggesting the scalar version's branchy
parallel-ray handling costs more than expected.
- AVX2 came out marginally slower than SSE2 on this CPU. Plausibly the
256-bit pipe wakeup cost is not amortized at BVH4 fan-out, and the
VEX-encoded SSE form already gets all the register-pressure benefit.
A real BVH8 with __m256 lanes would change this picture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce BVH_OctTree (analogous to BVH_QuadTree but with K=0..7) and a new BVH_Tree<T,N,BVH_BinaryTree>::CollapseToOctTree() method that gathers up to 2^3 = 8 representatives of a 3-level binary subtree into one BVH8 inner node. The 8-wide layout is the substrate that lets a single AVX2 __m256 ray-box test cover the whole fan-out -- the BVH4 we shipped earlier left half the AVX2 lanes idle and ended up only matching SSE2 in the microbenchmark. The collapse uses BVH::GatherDescendants, a recursive helper that walks 3 levels down or stops earlier at any leaf, so inner nodes ranging 1..8 children are produced naturally without padding logic in the builder. The kernel will absorb the irregular fan-out via "always-miss" boxes on unused SIMD lanes (same trick as TraverseQuad). Adds 5 BVH_OctTree_Test cases: - empty / single-element trivial paths preserved - all primitives present after collapse (set equality vs binary leaves) - node count strictly less than the source binary tree - every inner node reports child count in [1,8] and at least one reaches the saturated 8-children case for a 64-leaf balanced tree Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend BVH_SIMDDispatch with the BVH8 (1-ray-vs-8-AABB) infrastructure: the BVH_Box8f_SoA / BVH_Ray8f_Splat data structures (32-byte aligned to suit later __m256 loads), a RayBox8_Fn function pointer, the scalar reference RayBox8_Scalar, and a GetRayBox8() dispatcher whose switch currently routes everything to the scalar path. The slab-method body mirrors the 4-wide scalar kernel verbatim, just unrolled across 8 lanes; this gives the equivalence baseline against which the SSE2/AVX2/AVX-512 RayBox8 kernels in upcoming commits will be measured. Adds 4 RayBox8 scalar tests (all-hit, all-miss, 8-bit mosaic 0b01010101, GetRayBox8 non-null) and a DISABLED_RayBox8_KernelComparison benchmark. First measurement on this host: 34.1 ns/call -- about 2x the BVH4 scalar time, as expected for double the lane count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox8_SSE2 by chaining the existing RayBox4_SSE2 twice (lower lanes 0..3, upper lanes 4..7). On 128-bit hardware this is the ceiling since SSE2 cannot consume 8 lanes in one instruction; the value of having a dedicated 8-wide entry point is uniform dispatch -- callers can target one BVH8 traversal API and let the dispatcher pick the best implementation per host. AVX2's __m256 will give the genuine 8-lane single-instruction kernel in the next commit. Wires the SSE2 kernel into GetRayBox8 for SSE2/AVX2/AVX-512 detected levels (those higher levels stay on SSE2 until the AVX2 and AVX-512 RayBox8 kernels arrive). Adds RayBox8 random consistency test (500 cases, SSE2 vs scalar) and extends the DISABLED_RayBox8 benchmark with the SSE2 row. First measurement on this host: Scalar 34.12 ns/call 1.00x SSE2 3.93 ns/call 8.68x Per-lane throughput matches BVH4 SSE2 closely (0.49 vs 0.48 ns/lane), confirming the implementation is just the 4-wide kernel doubled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…__m256) Implement RayBox8_AVX2 -- the kernel BVH8 was built for. A single 256-bit __m256 holds all 8 lanes, so the entire BVH8 fan-out is tested in one slab pass instead of the SSE2 path's two 4-wide passes. Adopts tavianator's sign-based corner select (Fast, Branchless Ray/ Bounding Box Intersections, Part 3, 2022): the slab "near" and "far" faces are determined by the sign of invDir per axis, which is a property of the ray, not the box. Pre-selecting near/far via _mm256_blendv_ps eliminates one extra mul+min+max per axis (3 instructions per axis vs 4 in the symmetric formulation). NaN handling for parallel rays (invDir = inf, origin on box face) keeps the same operand ordering as SSE2 -- max(max(tNearY, tNearZ), tNearX) and the analogous min for tLeave -- so the parallel-axis NaN gets absorbed by the finite results of the other two axes via SSE's "min/max returns the second operand on NaN" rule. Wires AVX2 into GetRayBox8 ahead of SSE2. Microbenchmark on this Zen-class AVX2 host: Scalar 34.11 ns/call 1.00x SSE2 3.92 ns/call 8.70x AVX2 3.09 ns/call 11.04x <-- AVX2 finally beats SSE2 Per-lane throughput drops from 0.49 ns/lane (SSE2 BVH8) to 0.39 ns/lane (AVX2 BVH8), a 25% improvement. Compare to BVH4 where AVX2 was *slower* than SSE2 (2.13 ns vs 1.92 ns) because the upper 128 bits of __m256 sat idle. This confirms the BVH8 layout was the missing piece -- the SIMD kernels now pay back the dispatch infrastructure. Adds RayBox8 random consistency test (500 cases, AVX2 vs scalar) and extends the DISABLED_RayBox8 benchmark with the AVX2 row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox8_AVX512: same slab + sign-based corner select as RayBox8_AVX2, but the final hit predicate is computed straight into a __mmask8 via _mm256_cmp_ps_mask (AVX-512F + VL), saving the movemask round-trip the AVX2 path needs. At BVH8 fan-out the saving is small (a few cycles); the same pattern would matter more at BVH16 with __m512. Wires AVX-512 into GetRayBox8 ahead of AVX2 and SSE2. Adds RayBox8 random consistency test (500 cases, AVX-512 vs scalar) gated on Detect() >= AVX512 -- it SKIPs on hosts without the ISA including this commit's CI box, but it builds and is exercised on AVX-512-capable hardware. Extends the DISABLED_RayBox8 benchmark with the AVX-512 row likewise gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add BVH_TraverseOct::TraverseOct, the 8-wide analogue of TraverseQuad. Walks an OBVH built via BVH_BinaryTree::CollapseToOctTree and invokes a user-provided acceptor for each primitive in any leaf the ray hits. Inner nodes load up to 8 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox8 so the best kernel for the host (currently scalar or SSE2 on this CI box; AVX2 / AVX-512 where available) does the test. Mirrors TraverseQuad's design verbatim except for the lane count (8 instead of 4), padding-mask width (8 bits), and stack capacity (MaxTreeDepth * 8). Padding lanes get an "always-miss" box so inner nodes with fewer than 8 children pass through the SIMD kernel without any special control flow. Adds 5 tests covering: empty tree, flat root with all-hit / all-miss / 8-bit-mosaic-hit rays, and ray-behind-origin rejection. All pass and exercise the AVX2 RayBox8 kernel on this host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the BVH8 implementation behind a single template<int W> abstraction so adding BVH16 (a follow-up commit) reuses the traversal and dispatch infrastructure instead of duplicating it. The new abstraction lives in: - BVH_WideTree.hxx : tag struct + BVH_Tree<T,N,BVH_WideTree<W>> spec - BVH_TraverseWide.hxx: TraverseWide<W> template + W-parametric helpers Existing BVH8 surface API is renamed (no aliases, full rename): BVH_OctTree -> BVH_WideTree<8> TraverseOct -> TraverseWide<8> CollapseToOctTree -> CollapseToWide<8> (template, also <16>-ready) BVH_Box8f_SoA -> BVH_BoxNf_SoA<8> BVH_Ray8f_Splat -> BVH_RayNf_Splat<8> RayBox8_Scalar -> RayBoxN_Scalar<8> (function template) RayBox8_AVX2 -> RayBoxN_AVX2_8 (signature uses BVH_*Nf<8>) GetRayBox8 / RayBox8_Fn -> GetRayBoxN<8> / RayBoxN_Fn<8> BVH4 is intentionally untouched -- BVH_QuadTree, TraverseQuad, RayBox4_* remain as before. SSE2 (xmm 4-wide) is BVH4's natural fit; folding it into the template gains nothing. SIMD kernel pruning per "right tool for right job": Past benchmarks (Zen AVX2 host, commit 819c0cf / 20487e4) showed: - BVH4 SSE2: 1.92 ns (7.75x), AVX2: 2.13 ns (6.98x) -- AVX2 LOSES (256-bit pipe wakeup + upper 128 idle eats the lane-width gain) - BVH8 SSE2: 3.92 ns (8.70x), AVX2: 3.09 ns (11.04x) -- AVX2 wins - BVH8 AVX-512: "saving is a few cycles" (commit d8ee58e's own observation) because that kernel used __m256+mask, NOT __m512. Conclusion: kernel width should match BVH width should match register width (xmm/ymm/zmm). The mismatched kernels were dead weight, so: - Drop RayBox8_SSE2 (BVH_ToolsSIMD_SSE2.hxx) - Drop RayBox8_AVX512 (BVH_ToolsSIMD_AVX512.hxx) W=8 dispatch is now AVX2 -> Scalar fallback only. W=16 dispatch is Scalar fallback for now; AVX-512 (__m512 zmm 16-lane) is the natural-fit kernel and lands in a follow-up commit. Tests: - BVH_OctTree_Test.cxx deleted, replaced by BVH_WideTree8_Test.cxx - BVH_TraverseOct_Test.cxx deleted, replaced by BVH_TraverseWide8_Test.cxx - BVH_SIMDDispatch_Test.cxx updated to RayBoxN_*<8> - BVH_ToolsSIMD_Test.cxx drop RayBox8 SSE2/AVX-512 random-consistency tests, keep AVX2 BVH8 (renamed RayBoxN_AVX2_8) - BVH_ToolsSIMD_Benchmark_Test.cxx RayBox8 bench replaced by RayBoxN<8> (Scalar + AVX2 only; SSE2 / AVX-512 BVH8 rows removed) - All 218 BVH-related GTests pass on the AVX2 host (3 AVX-512 tests skip due to absent ISA, expected behaviour). git history is intentionally not preserved (delete + create instead of git mv) -- template-ization changes content beyond rename-detection's 50% similarity threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Instantiates the BVH_WideTree<W> template at W=16, which:
- Collapses every 4th binary level (2^4 = 16 children per inner node)
via the existing CollapseToWide<W> helper.
- Uses the existing template RayBoxN_Scalar<W> for ray-box tests.
- Routes traversal through the existing TraverseWide<W>.
No new BVH16-only infrastructure is needed -- everything is template
instantiation of code already shipped in the BVH_WideTree<W> commit.
The natural-fit AVX-512 (__m512 zmm 16-lane) kernel will land in a
follow-up commit; until then BVH16 dispatches to scalar fallback.
Tests:
- BVH_WideTree16_Test.cxx mirrors BVH_WideTree8_Test.cxx with
256 primitives (2^8 leaves so 4-level
collapse can saturate at least one
BVH16 inner node).
- BVH_TraverseWide16_Test.cxx mirrors BVH_TraverseWide8_Test.cxx with
a flat 16-leaf root: AllSixteenHit,
AllMiss, 16BitMosaic, RayBehindAllBoxes.
- BVH_SIMDDispatch_Test.cxx adds 4 ScalarRayBoxN16_* tests + a
GetRayBoxN16ReturnsNonNull check.
All 14 new tests pass; 218 previously-passing tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is the kernel BVH16 was built for: a single __m512 holds all 16 fp32 lanes so one slab pass exhausts the AVX-512 register width with zero waste. Per-lane work genuinely halves vs BVH8/AVX2. Critically, this kernel uses zmm intrinsics throughout, NOT __m256 + mask tricks. The previous (now-removed) BVH8 AVX-512 kernel made that mistake: it stayed on __m256 lanes and used AVX-512 only to skip the movemask round-trip, which gained "a few cycles" per call (commit d8ee58e's own admission). The result was AVX-512 BVH8 ~= AVX2 BVH8, no real win. The same shortcut here would defeat the entire purpose of BVH16. Sanity-checked via objdump that the emitted instructions actually use zmm registers: vmovups 0xc0(%rdi),%zmm10 vmovups 0x100(%rdi),%zmm9 vcmplt_oqps %zmm7,%zmm10,%k3 ... Algorithm: same sign-based corner select as RayBoxN_AVX2_8 (tavianator), now with _mm512_mask_blend_ps driven by sign-extracted __mmask16. Final hit predicate goes straight into __mmask16 via _mm512_cmp_ps_mask; __mmask16 is an int alias so the return is a single zext. Target attribute: target("avx512f") only (no vl). The 512-bit-only path is sufficient -- vl extensions would only matter if we mixed zmm with narrower vector ops, which we don't. NaN handling for parallel rays (invDir = inf, origin on box face) keeps the same operand ordering as SSE2/AVX2 -- _mm512_max_ps / _mm512_min_ps on x86 follow the "second operand wins on NaN" rule, so a parallel-axis NaN is dominated by finite results from the other two axes. Dispatch: GetRayBoxN<16>() now returns &RayBoxN_AVX512_16 on AVX-512 hosts, scalar fallback otherwise. Tests: - BVH_ToolsSIMD_Test.cxx adds AVX512_RayBoxN16_RandomConsistencyVsScalar (1000 random ray+16-AABB cases vs scalar reference, ULP-near match). SKIPs cleanly on hosts without AVX-512. - All previously-passing tests continue to pass on the AVX2-only host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends BVH_ToolsSIMD_Benchmark_Test.cxx with DISABLED_RayBoxN16_
KernelComparison: scalar reference + AVX-512 (when supported on the
host). Same shape as the existing RayBoxN<8> bench (4096 random
ray+16-AABB cases x 4096 repeats = 16.78M invocations).
The bench includes a sanity gate (EXPECT_LT(aAVX512.nsPerCall <
aScalar.nsPerCall * 0.5)) that asserts AVX-512 BVH16 actually beats
scalar by the expected 2x+ margin -- a smoke test against the
"silently using ymm not zmm" failure mode that hit the previous BVH8
AVX-512 attempt.
Per "right tool for right job", BVH16 has only the AVX-512 SIMD path
plus scalar fallback (no SSE2/AVX2 BVH16 wrappers). On hosts without
AVX-512, the bench just runs scalar and dispatched-= scalar rows.
Sample output on a Zen 3 (AVX2-only) host:
=== RayBoxN<8> microbenchmark (BVH8 fan-out) ===
Scalar 31.51 ns/call 1.00x
AVX2 3.17 ns/call 9.95x
Dispatched 3.23 ns/call 9.74x
=== RayBoxN<16> microbenchmark (BVH16 fan-out) ===
Scalar 63.38 ns/call 1.00x
AVX-512 (skipped: CPU support absent)
Dispatched 66.05 ns/call 0.96x
Per-lane on this host:
BVH8/AVX2 = 0.40 ns/lane (lower bound for the kernel cost)
BVH16/Scalar = 3.96 ns/lane (10x worse per lane vs SIMD)
On AVX-512 hardware, the AVX-512 BVH16 row should land near
~0.4 ns/lane * 16 = ~6 ns/call (~2x the AVX2 BVH8 number),
demonstrating the per-lane savings the zmm width unlocks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for CI failures observed on jijinbei/OCCT PR #4: 1. macOS-15 (Apple Silicon) build failure: __builtin_cpu_init() and __builtin_cpu_supports() are x86-only GCC/Clang builtins. On ARM64 they error with "builtin is not supported on this target". Guard the GCC/Clang detect path with __x86_64__ || __i386__ so ARM falls through to the scalar fallback (already the right behaviour -- no x86 SIMD kernels exist on ARM). 2. Code formatting check failure: CI runs a stricter clang-format (clang-format-18 default style) than the local pre-commit hook used. Apply the diff produced by the CI's "format-patch" artifact verbatim. Touches BVH_TraverseQuad.hxx, BVH_TraverseQuad_Test.cxx, and the detectImpl() function in BVH_SIMDDispatch.cxx (whitespace only). Local re-run after both fixes: 381/385 BVH tests pass on the AVX2-only Zen 3 host (4 AVX-512 tests SKIP as expected, no regressions). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apple Silicon Clang runs -Werror -Wunused-function and flagged the
MakeRaySplat / SetBox helpers in BVH_ToolsSIMD_Test.cxx as unused.
On macOS ARM64:
- BVH_HAS_SSE2_KERNEL is not defined (SSE2 is x86-only)
- The entire block of SSE2 RayBox4 tests that *used* these helpers is
#if'd out
- -> every caller disappears at preprocessing time, and the helpers
become dead code
Annotate both with [[maybe_unused]] so they survive the -Werror guard
when their consumers are conditionally compiled out. On x86 hosts the
attribute is a no-op because the SSE2 block is present and uses them.
Fixes the macOS build failure observed on CI run 24898268476.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds runtime-dispatched SIMD ray-vs-AABB kernels and three wide-BVH traversal drivers (BVH4 / BVH8 / BVH16) to the BVH module.
BVH_BinaryTree.hxxgets one new template method (CollapseToWide<W>) and a helper, and twoFILES.cmakelistings register the new files.What's new
Dispatch layer
BVH_SIMDDispatch.{hxx,cxx}— runtime CPU detection (cpuid + xgetbv on MSVC,__builtin_cpu_supportson GCC/Clang for x86; scalar fallback on ARM and other targets).BVH::SIMD::Detect()is computed once per process.OCCT_BVH_SIMD_FORCEenv var allows forcing a level for testing.BVH_ToolsSIMD_{SSE2,AVX2,AVX512}.hxx— per-ISA kernel headers, gated with per-functiontargetattribute on GCC/Clang so the rest of TKMath stays buildable on hosts without those ISAs.BVH4 (uses existing
BVH_QuadTree)RayBox4_{Scalar,SSE2,AVX2,AVX512}— 1-ray-vs-4-AABB slab-test kernels.BVH_TraverseQuad.hxx—TraverseQuad<>driver. Inner nodes load up to 4 child AABBs into a SoA record and dispatch toBVH::SIMD::GetRayBox4().BVH_WideTree<int W>abstraction (BVH8 / BVH16)BVH_WideTree.hxx— single template tag/specialization replacing what would otherwise be two near-identical ad-hoc classes.BVH_Tree<T,N,BVH_WideTree<W>>forW=8andW=16.BVH_TraverseWide.hxx— singleTraverseWide<W>driver covering both fan-outs.BVH_BinaryTree::CollapseToWide<W>()— single template builder using a newBVH::GatherDescendants(node, log2(W), …)helper.W=8walks 3 binary levels (2³ = 8 children);W=16walks 4 (2⁴ = 16 children).SIMD strategy: one natural-fit kernel per width
__m256+ mask gains only "a few cycles" (measured; not shipped).Mismatched kernels (e.g. SSE2-on-BVH8 with two 4-wide passes, or AVX-512-on-BVH8 keeping zmm half-empty) were intentionally not added.
Tests
385 BVH gtests under
src/FoundationClasses/TKMath/GTests/:BVH_ToolsSIMDBenchmark.DISABLED_RayBox*_KernelComparison) measuring ns/call per kernel × width.AVX-512-specific tests
GTEST_SKIPcleanly on hosts without AVX-512.Side-branch cross-validation
In addition to the unit tests above, the equivalence between the BVH4, BVH8 and BVH16 traversals was independently verified on a sibling branch using a 3-way visual cross-check: a DISABLED gtest builds a deterministic random scene of 256 AABBs, runs all three traversals against the same ray, asserts via

EXPECT_EQthat the three hit sets are identical, and emits a self-contained HTML viewer (Three.js, 3D scene, slider for live N, embedded benchmark table) for inspection. The 3-wayEXPECT_EQpassed on every host tested. The HTML visualizer is kept on a separate developer-aid branch and is intentionally not included in this PR to keep the diff focused on the kernel/template work.Performance evidence (measured on Google Cloud)
ns per 1-ray-vs-W-AABB kernel call, 4096 cases × 4096 repeats per kernel.
Benchmark host: Google Cloud Compute Engine
c3-highcpu-4(Intel Xeon Platinum 8481C, Sapphire Rapids, 4 vCPU,asia-northeast1) — rented specifically for this PR. All 385 BVH gtests pass on this host; on AVX2-only hosts the 4 AVX-512 tests SKIP cleanly.(Bold = natural-fit kernel for that width;
—= combinations intentionally not shipped.)Per-lane (ns), comparing the best kernel of each width:
__m256+mask shortcutCompatibility
BVH_BinaryTree.hxx(one new template methodCollapseToWide<W>and theBVH::GatherDescendantshelper) and twoFILES.cmakelistings.Detect()function is properly guarded so ARM macOS (Apple Silicon) builds cleanly —__builtin_cpu_*is#if'd out outside x86, and the fallback path returnsScalar.Implementation notes
(box - origin) * invDir, NOT the FMA-friendly form. Parallel-ray edge cases (origin coincides with box face →0 * inf = NaN) propagate differently between formulations; the random-consistency tests pin every SIMD kernel to the scalar behaviour.max < minon every axis, so the SIMD slab test rejects them naturally with no special control flow on the kernel side.GTEST_SKIPrather than compile-out for unsupported ISAs, so the test binary is identical across hosts.Build / CI
Verified on the contributor's fork CI matrix: Linux (GCC 13 + Clang 18, AVX2-only host), Windows MSVC (x64 / ARM64), macOS Apple Silicon (Clang, scalar fallback). The full AVX-512 BVH16 zmm path was additionally validated on Google Cloud Sapphire Rapids hardware (see Performance evidence above).
CLA ID
1142