Skip to content

Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal#1232

Open
jijinbei wants to merge 18 commits intoOpen-Cascade-SAS:masterfrom
jijinbei:bvh-w4-w8-w16-simd
Open

Foundation Classes, Math - SIMD-accelerated BVH4/BVH8/BVH16 ray-AABB traversal#1232
jijinbei wants to merge 18 commits intoOpen-Cascade-SAS:masterfrom
jijinbei:bvh-w4-w8-w16-simd

Conversation

@jijinbei
Copy link
Copy Markdown
Contributor

@jijinbei jijinbei commented Apr 24, 2026

Summary

Adds runtime-dispatched SIMD ray-vs-AABB kernels and three wide-BVH traversal drivers (BVH4 / BVH8 / BVH16) to the BVH module.

  • 19 files added, 0 modified outside two surgical edits: BVH_BinaryTree.hxx gets one new template method (CollapseToWide<W>) and a helper, and two FILES.cmake listings register the new files.
  • +3601 / −0 lines.
  • No public API change, no existing test regresses.

What's new

Dispatch layer

  • BVH_SIMDDispatch.{hxx,cxx} — runtime CPU detection (cpuid + xgetbv on MSVC, __builtin_cpu_supports on GCC/Clang for x86; scalar fallback on ARM and other targets). BVH::SIMD::Detect() is computed once per process. OCCT_BVH_SIMD_FORCE env var allows forcing a level for testing.
  • BVH_ToolsSIMD_{SSE2,AVX2,AVX512}.hxx — per-ISA kernel headers, gated with per-function target attribute on GCC/Clang so the rest of TKMath stays buildable on hosts without those ISAs.

BVH4 (uses existing BVH_QuadTree)

  • RayBox4_{Scalar,SSE2,AVX2,AVX512} — 1-ray-vs-4-AABB slab-test kernels.
  • BVH_TraverseQuad.hxxTraverseQuad<> driver. Inner nodes load up to 4 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox4().

BVH_WideTree<int W> abstraction (BVH8 / BVH16)

  • BVH_WideTree.hxx — single template tag/specialization replacing what would otherwise be two near-identical ad-hoc classes. BVH_Tree<T,N,BVH_WideTree<W>> for W=8 and W=16.
  • BVH_TraverseWide.hxx — single TraverseWide<W> driver covering both fan-outs.
  • BVH_BinaryTree::CollapseToWide<W>() — single template builder using a new BVH::GatherDescendants(node, log2(W), …) helper. W=8 walks 3 binary levels (2³ = 8 children); W=16 walks 4 (2⁴ = 16 children).

SIMD strategy: one natural-fit kernel per width

BVH width Natural ISA Register Why
BVH4 SSE2 xmm 128-bit (4 fp32) Empirically faster than AVX2 here — the upper 128 bits of ymm sit idle and the 256-bit pipe wakeup cost outweighs the wider scheduling.
BVH8 AVX2 ymm 256-bit (8 fp32) First width that saturates ymm. AVX-512 with __m256 + mask gains only "a few cycles" (measured; not shipped).
BVH16 AVX-512 zmm 512-bit (16 fp32) First width that uses the full zmm. Per-lane work genuinely halves vs BVH8/AVX2.

Mismatched kernels (e.g. SSE2-on-BVH8 with two 4-wide passes, or AVX-512-on-BVH8 keeping zmm half-empty) were intentionally not added.

Tests

385 BVH gtests under src/FoundationClasses/TKMath/GTests/:

  • Functional per width — empty tree, single primitive, all-hit ray, all-miss ray, mosaic-hit ray, ray-behind-origin rejection, parallel-ray-inside-slab.
  • Random consistency — 1000 random ray + W-AABB cases per kernel, comparing SIMD output to the scalar reference (mask exact, t-values within 1e-3). The AVX-512 BVH16 variant is the acceptance gate for the zmm path.
  • DISABLED microbenchmark (BVH_ToolsSIMDBenchmark.DISABLED_RayBox*_KernelComparison) measuring ns/call per kernel × width.

AVX-512-specific tests GTEST_SKIP cleanly on hosts without AVX-512.

Side-branch cross-validation

In addition to the unit tests above, the equivalence between the BVH4, BVH8 and BVH16 traversals was independently verified on a sibling branch using a 3-way visual cross-check: a DISABLED gtest builds a deterministic random scene of 256 AABBs, runs all three traversals against the same ray, asserts via EXPECT_EQ that the three hit sets are identical, and emits a self-contained HTML viewer (Three.js, 3D scene, slider for live N, embedded benchmark table) for inspection. The 3-way EXPECT_EQ passed on every host tested. The HTML visualizer is kept on a separate developer-aid branch and is intentionally not included in this PR to keep the diff focused on the kernel/template work.
Screenshot From 2026-04-25 02-40-50

Performance evidence (measured on Google Cloud)

ns per 1-ray-vs-W-AABB kernel call, 4096 cases × 4096 repeats per kernel.

Benchmark host: Google Cloud Compute Engine c3-highcpu-4 (Intel Xeon Platinum 8481C, Sapphire Rapids, 4 vCPU, asia-northeast1) — rented specifically for this PR. All 385 BVH gtests pass on this host; on AVX2-only hosts the 4 AVX-512 tests SKIP cleanly.

BVH width Scalar SSE2 AVX2 AVX-512 Best Speedup
BVH4 31.22 3.20 3.55 3.40 SSE2 9.77×
BVH8 75.68 5.70 AVX2 13.29×
BVH16 151.53 7.47 AVX-512 20.28×

(Bold = natural-fit kernel for that width; = combinations intentionally not shipped.)

Per-lane (ns), comparing the best kernel of each width:

Width × kernel Per-lane Note
BVH16 / AVX-512 0.467 winner — confirms the kernel really uses zmm, not the __m256+mask shortcut
BVH8 / AVX2 0.71
BVH4 / SSE2 0.80

Compatibility

  • No public API change, no existing method modified. The only existing files touched are BVH_BinaryTree.hxx (one new template method CollapseToWide<W> and the BVH::GatherDescendants helper) and two FILES.cmake listings.
  • Production code does not yet call the new traversal drivers — they are exercised only by the new tests, leaving callsite migration as opt-in for future PRs.
  • Cross-platform safe: the Detect() function is properly guarded so ARM macOS (Apple Silicon) builds cleanly — __builtin_cpu_* is #if'd out outside x86, and the fallback path returns Scalar.

Implementation notes

  • Slab formula matches scalar exactly. All SIMD kernels use (box - origin) * invDir, NOT the FMA-friendly form. Parallel-ray edge cases (origin coincides with box face → 0 * inf = NaN) propagate differently between formulations; the random-consistency tests pin every SIMD kernel to the scalar behaviour.
  • Padding lanes are always-miss boxes. Inner nodes with fewer than W children get padding lanes filled with max < min on every axis, so the SIMD slab test rejects them naturally with no special control flow on the kernel side.
  • GTEST_SKIP rather than compile-out for unsupported ISAs, so the test binary is identical across hosts.

Build / CI

Verified on the contributor's fork CI matrix: Linux (GCC 13 + Clang 18, AVX2-only host), Windows MSVC (x64 / ARM64), macOS Apple Silicon (Clang, scalar fallback). The full AVX-512 BVH16 zmm path was additionally validated on Google Cloud Sapphire Rapids hardware (see Performance evidence above).

CLA ID

1142

jijinbei and others added 18 commits April 25, 2026 03:44
Introduce BVH_SIMDDispatch with runtime CPU detection (__cpuid+xgetbv on
MSVC, __builtin_cpu_supports on GCC/Clang) and a function pointer
dispatch returning the best available 1-ray-vs-4-AABB kernel for the
host. Ships with a scalar reference implementation; SSE2, AVX2 and
AVX-512 kernels follow in subsequent commits.

The OCCT_BVH_SIMD_FORCE environment variable allows forcing a specific
level (clamped to actual CPU support) to exercise lower-level kernels
on machines that support higher ones.

Adds 7 unit tests covering detection, dispatch, and the scalar kernel
across all-hit, all-miss, mosaic partial hits, ray-behind-origin, and
parallel-ray-inside-slab cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox4_SSE2 in BVH_ToolsSIMD_SSE2.hxx using the slab method
with _mm_min_ps/_mm_max_ps. The kernel exploits the SSE2 spec that
"if either operand is NaN the second is returned" to handle parallel
rays without explicit branching: parallel-axis NaN is absorbed by the
finite results of the other axes.

Wires the SSE2 kernel into BVH_SIMDDispatch::GetRayBox4 for SSE2/AVX2/
AVX-512 detected levels (AVX2 and AVX-512 specialized kernels arrive
in subsequent commits).

Adds 8 tests in BVH_ToolsSIMD_Test.cxx covering all-hit, all-miss,
mosaic partial hits, ray-behind, parallel-ray-inside-slab, degenerate
point-boxes, dispatch preference on x86, and 1000-case random
consistency vs the scalar reference (mask exact, t-values within 1e-3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add BVH_TraverseQuad::TraverseQuad, a free-function template that walks
a quad-BVH tree with one ray and invokes a user-provided acceptor for
each primitive in any leaf the ray intersects. Inner nodes load up to
4 child AABBs into a SoA record and dispatch to BVH::SIMD::GetRayBox4
so the best kernel for the host CPU (currently scalar or SSE2; AVX2
and AVX-512 land in following commits) is used automatically.

The traversal is intentionally narrow in scope: it handles only the
ray-vs-AABB query, not the full BVH_Traverse selector framework
(distance queries, pair traversal, metric-driven descent). This keeps
the SIMD integration self-contained and avoids touching any production
caller. Real users still go through BVH_Traverse::Select() on a binary
tree; this new path is exercised only by tests in this commit and
becomes opt-in for callers in a later PR.

Padding lanes for inner nodes with fewer than 4 children get an
"always-miss" box (max < min on every axis), so the slab test rejects
them naturally and no special control flow is needed in the kernel.

Adds 5 tests covering: empty tree, flat root with all-hit / all-miss /
mosaic-hit rays, and ray-behind-origin rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox4_AVX2 in BVH_ToolsSIMD_AVX2.hxx, gated on a per-function
__attribute__((target("avx2,fma"))) so the rest of TKMath stays buildable
on hosts without AVX2. The dispatcher routes to the AVX2 kernel only after
runtime CPU detection (cpuid + xgetbv) confirms support.

The kernel uses the same slab formula as SSE2, t = (box - origin) * invDir,
NOT the FMA-friendly t = invDir*box - invDir*origin. Reason: for parallel
rays whose origin coincides with a box face, idx*minx becomes 0*inf = NaN,
which the FMA-form fmsub then propagates differently than the scalar/SSE2
path. Random-consistency tests caught the divergence (mosaic mask 0b0101
came back as 0b1111 because parallel-axis NaN polluted finite axes).
The win over SSE2 therefore comes from VEX three-operand encoding and
better register pressure, not from FMA -- a real BVH8 with __m256 lanes
would unlock much more.

Adds 3 AVX2-specific tests gated on Detect() >= AVX2 (so they SKIP on
older hardware): all-hit, mosaic partial-hit, and 1000-case random
consistency vs the scalar reference. Existing SSE2/scalar/QBVH tests
continue to pass through the new dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox4_AVX512 in BVH_ToolsSIMD_AVX512.hxx, gated on a
per-function __attribute__((target("avx512f,avx512vl"))) so the rest of
TKMath remains buildable on hosts without AVX-512. The dispatcher
routes to it only after runtime detection (cpuid + xgetbv with the 0xE6
ZMM-state mask) confirms both ISA and OS support.

Stays on __m128 lanes -- a true BVH16 with __m512 would be needed to
unlock the full 16-lane width. The win at BVH4 fan-out is modest
(~5-15%): _mm_cmp_ps_mask returns the predicate directly into a mask
register, so the final hit-mask is one mask AND instead of the
movemask + shuffle dance the SSE2/AVX2 paths use. The mask AND uses
operator& on __mmask8 (an integer alias) rather than _kand_mask8,
because the latter requires AVX-512DQ which is a narrower target than
AVX-512F+VL.

Slab formula deliberately matches SSE2/AVX2 -- (box-origin)*invDir, not
the FMA form -- so parallel-ray NaN propagation stays consistent with
the scalar reference.

Adds 3 AVX-512-specific tests gated on Detect() >= AVX512 (so they
SKIP on non-AVX-512 hosts, including this commit's CI). The existing
SSE2/AVX2/scalar tests continue to verify dispatch on this hardware.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…x kernels

Add BVH_ToolsSIMD_Benchmark_Test.cxx with a single benchmark, gated
behind the gtest DISABLED_ prefix so CI does not measure timing
unreliably. Run it explicitly with:
  OpenCascadeGTest --gtest_also_run_disabled_tests
                   --gtest_filter='BVH_ToolsSIMDBenchmark*'

The benchmark exercises 4096 random (ray, 4-AABB) cases x 4096
repetitions = ~16.8M invocations per kernel, then reports ns/call and
speedup vs the scalar reference. A "sink" accumulator over the returned
hit masks is printed to keep the optimizer from removing the loop.

Sample output on a Zen-class AVX2 host (no AVX-512):
  Scalar          14.86 ns/call   1.00x
  SSE2             1.92 ns/call   7.75x
  AVX2             2.13 ns/call   6.98x
  AVX-512        (skipped: CPU support absent)
  Dispatched       2.08 ns/call   7.13x

Notes:
- All kernels return bit-identical hit masks (sink values match exactly).
- The 7-8x speedup over scalar is well above the 3.5-4x predicted from
  pure SSE lane width, suggesting the scalar version's branchy
  parallel-ray handling costs more than expected.
- AVX2 came out marginally slower than SSE2 on this CPU. Plausibly the
  256-bit pipe wakeup cost is not amortized at BVH4 fan-out, and the
  VEX-encoded SSE form already gets all the register-pressure benefit.
  A real BVH8 with __m256 lanes would change this picture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce BVH_OctTree (analogous to BVH_QuadTree but with K=0..7) and a
new BVH_Tree<T,N,BVH_BinaryTree>::CollapseToOctTree() method that gathers
up to 2^3 = 8 representatives of a 3-level binary subtree into one BVH8
inner node. The 8-wide layout is the substrate that lets a single AVX2
__m256 ray-box test cover the whole fan-out -- the BVH4 we shipped
earlier left half the AVX2 lanes idle and ended up only matching SSE2 in
the microbenchmark.

The collapse uses BVH::GatherDescendants, a recursive helper that walks
3 levels down or stops earlier at any leaf, so inner nodes ranging
1..8 children are produced naturally without padding logic in the
builder. The kernel will absorb the irregular fan-out via "always-miss"
boxes on unused SIMD lanes (same trick as TraverseQuad).

Adds 5 BVH_OctTree_Test cases:
- empty / single-element trivial paths preserved
- all primitives present after collapse (set equality vs binary leaves)
- node count strictly less than the source binary tree
- every inner node reports child count in [1,8] and at least one
  reaches the saturated 8-children case for a 64-leaf balanced tree

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend BVH_SIMDDispatch with the BVH8 (1-ray-vs-8-AABB) infrastructure:
the BVH_Box8f_SoA / BVH_Ray8f_Splat data structures (32-byte aligned to
suit later __m256 loads), a RayBox8_Fn function pointer, the scalar
reference RayBox8_Scalar, and a GetRayBox8() dispatcher whose switch
currently routes everything to the scalar path.

The slab-method body mirrors the 4-wide scalar kernel verbatim, just
unrolled across 8 lanes; this gives the equivalence baseline against
which the SSE2/AVX2/AVX-512 RayBox8 kernels in upcoming commits will be
measured.

Adds 4 RayBox8 scalar tests (all-hit, all-miss, 8-bit mosaic 0b01010101,
GetRayBox8 non-null) and a DISABLED_RayBox8_KernelComparison benchmark.
First measurement on this host: 34.1 ns/call -- about 2x the BVH4 scalar
time, as expected for double the lane count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox8_SSE2 by chaining the existing RayBox4_SSE2 twice
(lower lanes 0..3, upper lanes 4..7). On 128-bit hardware this is the
ceiling since SSE2 cannot consume 8 lanes in one instruction; the value
of having a dedicated 8-wide entry point is uniform dispatch -- callers
can target one BVH8 traversal API and let the dispatcher pick the best
implementation per host. AVX2's __m256 will give the genuine 8-lane
single-instruction kernel in the next commit.

Wires the SSE2 kernel into GetRayBox8 for SSE2/AVX2/AVX-512 detected
levels (those higher levels stay on SSE2 until the AVX2 and AVX-512
RayBox8 kernels arrive).

Adds RayBox8 random consistency test (500 cases, SSE2 vs scalar) and
extends the DISABLED_RayBox8 benchmark with the SSE2 row.

First measurement on this host:
  Scalar       34.12 ns/call   1.00x
  SSE2          3.93 ns/call   8.68x

Per-lane throughput matches BVH4 SSE2 closely (0.49 vs 0.48 ns/lane),
confirming the implementation is just the 4-wide kernel doubled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…__m256)

Implement RayBox8_AVX2 -- the kernel BVH8 was built for. A single 256-bit
__m256 holds all 8 lanes, so the entire BVH8 fan-out is tested in one
slab pass instead of the SSE2 path's two 4-wide passes.

Adopts tavianator's sign-based corner select (Fast, Branchless Ray/
Bounding Box Intersections, Part 3, 2022): the slab "near" and "far"
faces are determined by the sign of invDir per axis, which is a property
of the ray, not the box. Pre-selecting near/far via _mm256_blendv_ps
eliminates one extra mul+min+max per axis (3 instructions per axis vs 4
in the symmetric formulation).

NaN handling for parallel rays (invDir = inf, origin on box face) keeps
the same operand ordering as SSE2 -- max(max(tNearY, tNearZ), tNearX)
and the analogous min for tLeave -- so the parallel-axis NaN gets
absorbed by the finite results of the other two axes via SSE's
"min/max returns the second operand on NaN" rule.

Wires AVX2 into GetRayBox8 ahead of SSE2.

Microbenchmark on this Zen-class AVX2 host:
  Scalar       34.11 ns/call   1.00x
  SSE2          3.92 ns/call   8.70x
  AVX2          3.09 ns/call  11.04x   <-- AVX2 finally beats SSE2

Per-lane throughput drops from 0.49 ns/lane (SSE2 BVH8) to 0.39 ns/lane
(AVX2 BVH8), a 25% improvement. Compare to BVH4 where AVX2 was *slower*
than SSE2 (2.13 ns vs 1.92 ns) because the upper 128 bits of __m256 sat
idle. This confirms the BVH8 layout was the missing piece -- the SIMD
kernels now pay back the dispatch infrastructure.

Adds RayBox8 random consistency test (500 cases, AVX2 vs scalar) and
extends the DISABLED_RayBox8 benchmark with the AVX2 row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement RayBox8_AVX512: same slab + sign-based corner select as
RayBox8_AVX2, but the final hit predicate is computed straight into a
__mmask8 via _mm256_cmp_ps_mask (AVX-512F + VL), saving the movemask
round-trip the AVX2 path needs. At BVH8 fan-out the saving is small (a
few cycles); the same pattern would matter more at BVH16 with __m512.

Wires AVX-512 into GetRayBox8 ahead of AVX2 and SSE2.

Adds RayBox8 random consistency test (500 cases, AVX-512 vs scalar)
gated on Detect() >= AVX512 -- it SKIPs on hosts without the ISA
including this commit's CI box, but it builds and is exercised on
AVX-512-capable hardware. Extends the DISABLED_RayBox8 benchmark with
the AVX-512 row likewise gated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add BVH_TraverseOct::TraverseOct, the 8-wide analogue of TraverseQuad.
Walks an OBVH built via BVH_BinaryTree::CollapseToOctTree and invokes a
user-provided acceptor for each primitive in any leaf the ray hits.
Inner nodes load up to 8 child AABBs into a SoA record and dispatch to
BVH::SIMD::GetRayBox8 so the best kernel for the host (currently scalar
or SSE2 on this CI box; AVX2 / AVX-512 where available) does the test.

Mirrors TraverseQuad's design verbatim except for the lane count (8
instead of 4), padding-mask width (8 bits), and stack capacity
(MaxTreeDepth * 8). Padding lanes get an "always-miss" box so inner
nodes with fewer than 8 children pass through the SIMD kernel without
any special control flow.

Adds 5 tests covering: empty tree, flat root with all-hit / all-miss /
8-bit-mosaic-hit rays, and ray-behind-origin rejection. All pass and
exercise the AVX2 RayBox8 kernel on this host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the BVH8 implementation behind a single template<int W>
abstraction so adding BVH16 (a follow-up commit) reuses the traversal
and dispatch infrastructure instead of duplicating it.

The new abstraction lives in:
  - BVH_WideTree.hxx    : tag struct + BVH_Tree<T,N,BVH_WideTree<W>> spec
  - BVH_TraverseWide.hxx: TraverseWide<W> template + W-parametric helpers

Existing BVH8 surface API is renamed (no aliases, full rename):
  BVH_OctTree              -> BVH_WideTree<8>
  TraverseOct              -> TraverseWide<8>
  CollapseToOctTree        -> CollapseToWide<8>  (template, also <16>-ready)
  BVH_Box8f_SoA            -> BVH_BoxNf_SoA<8>
  BVH_Ray8f_Splat          -> BVH_RayNf_Splat<8>
  RayBox8_Scalar           -> RayBoxN_Scalar<8>  (function template)
  RayBox8_AVX2             -> RayBoxN_AVX2_8     (signature uses BVH_*Nf<8>)
  GetRayBox8 / RayBox8_Fn  -> GetRayBoxN<8> / RayBoxN_Fn<8>

BVH4 is intentionally untouched -- BVH_QuadTree, TraverseQuad, RayBox4_*
remain as before. SSE2 (xmm 4-wide) is BVH4's natural fit; folding it
into the template gains nothing.

SIMD kernel pruning per "right tool for right job":
  Past benchmarks (Zen AVX2 host, commit 819c0cf / 20487e4) showed:
    - BVH4 SSE2:  1.92 ns (7.75x), AVX2:  2.13 ns (6.98x) -- AVX2 LOSES
      (256-bit pipe wakeup + upper 128 idle eats the lane-width gain)
    - BVH8 SSE2:  3.92 ns (8.70x), AVX2:  3.09 ns (11.04x) -- AVX2 wins
    - BVH8 AVX-512: "saving is a few cycles" (commit d8ee58e's own
      observation) because that kernel used __m256+mask, NOT __m512.
  Conclusion: kernel width should match BVH width should match register
  width (xmm/ymm/zmm). The mismatched kernels were dead weight, so:
    - Drop RayBox8_SSE2     (BVH_ToolsSIMD_SSE2.hxx)
    - Drop RayBox8_AVX512   (BVH_ToolsSIMD_AVX512.hxx)
  W=8 dispatch is now AVX2 -> Scalar fallback only.
  W=16 dispatch is Scalar fallback for now; AVX-512 (__m512 zmm 16-lane)
  is the natural-fit kernel and lands in a follow-up commit.

Tests:
  - BVH_OctTree_Test.cxx       deleted, replaced by BVH_WideTree8_Test.cxx
  - BVH_TraverseOct_Test.cxx   deleted, replaced by BVH_TraverseWide8_Test.cxx
  - BVH_SIMDDispatch_Test.cxx  updated to RayBoxN_*<8>
  - BVH_ToolsSIMD_Test.cxx     drop RayBox8 SSE2/AVX-512 random-consistency
                               tests, keep AVX2 BVH8 (renamed RayBoxN_AVX2_8)
  - BVH_ToolsSIMD_Benchmark_Test.cxx  RayBox8 bench replaced by RayBoxN<8>
                                       (Scalar + AVX2 only; SSE2 / AVX-512
                                       BVH8 rows removed)
  - All 218 BVH-related GTests pass on the AVX2 host (3 AVX-512 tests skip
    due to absent ISA, expected behaviour).

git history is intentionally not preserved (delete + create instead of
git mv) -- template-ization changes content beyond rename-detection's
50% similarity threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Instantiates the BVH_WideTree<W> template at W=16, which:
  - Collapses every 4th binary level (2^4 = 16 children per inner node)
    via the existing CollapseToWide<W> helper.
  - Uses the existing template RayBoxN_Scalar<W> for ray-box tests.
  - Routes traversal through the existing TraverseWide<W>.

No new BVH16-only infrastructure is needed -- everything is template
instantiation of code already shipped in the BVH_WideTree<W> commit.
The natural-fit AVX-512 (__m512 zmm 16-lane) kernel will land in a
follow-up commit; until then BVH16 dispatches to scalar fallback.

Tests:
  - BVH_WideTree16_Test.cxx     mirrors BVH_WideTree8_Test.cxx with
                                256 primitives (2^8 leaves so 4-level
                                collapse can saturate at least one
                                BVH16 inner node).
  - BVH_TraverseWide16_Test.cxx mirrors BVH_TraverseWide8_Test.cxx with
                                a flat 16-leaf root: AllSixteenHit,
                                AllMiss, 16BitMosaic, RayBehindAllBoxes.
  - BVH_SIMDDispatch_Test.cxx   adds 4 ScalarRayBoxN16_* tests + a
                                GetRayBoxN16ReturnsNonNull check.

All 14 new tests pass; 218 previously-passing tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This is the kernel BVH16 was built for: a single __m512 holds all 16
fp32 lanes so one slab pass exhausts the AVX-512 register width with
zero waste. Per-lane work genuinely halves vs BVH8/AVX2.

Critically, this kernel uses zmm intrinsics throughout, NOT __m256 +
mask tricks. The previous (now-removed) BVH8 AVX-512 kernel made that
mistake: it stayed on __m256 lanes and used AVX-512 only to skip the
movemask round-trip, which gained "a few cycles" per call (commit
d8ee58e's own admission). The result was AVX-512 BVH8 ~= AVX2 BVH8,
no real win. The same shortcut here would defeat the entire purpose
of BVH16.

Sanity-checked via objdump that the emitted instructions actually use
zmm registers:
  vmovups 0xc0(%rdi),%zmm10
  vmovups 0x100(%rdi),%zmm9
  vcmplt_oqps %zmm7,%zmm10,%k3
  ...

Algorithm: same sign-based corner select as RayBoxN_AVX2_8 (tavianator),
now with _mm512_mask_blend_ps driven by sign-extracted __mmask16. Final
hit predicate goes straight into __mmask16 via _mm512_cmp_ps_mask;
__mmask16 is an int alias so the return is a single zext.

Target attribute: target("avx512f") only (no vl). The 512-bit-only path
is sufficient -- vl extensions would only matter if we mixed zmm with
narrower vector ops, which we don't.

NaN handling for parallel rays (invDir = inf, origin on box face) keeps
the same operand ordering as SSE2/AVX2 -- _mm512_max_ps / _mm512_min_ps
on x86 follow the "second operand wins on NaN" rule, so a parallel-axis
NaN is dominated by finite results from the other two axes.

Dispatch: GetRayBoxN<16>() now returns &RayBoxN_AVX512_16 on AVX-512
hosts, scalar fallback otherwise.

Tests:
  - BVH_ToolsSIMD_Test.cxx adds AVX512_RayBoxN16_RandomConsistencyVsScalar
    (1000 random ray+16-AABB cases vs scalar reference, ULP-near match).
    SKIPs cleanly on hosts without AVX-512.
  - All previously-passing tests continue to pass on the AVX2-only host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends BVH_ToolsSIMD_Benchmark_Test.cxx with DISABLED_RayBoxN16_
KernelComparison: scalar reference + AVX-512 (when supported on the
host). Same shape as the existing RayBoxN<8> bench (4096 random
ray+16-AABB cases x 4096 repeats = 16.78M invocations).

The bench includes a sanity gate (EXPECT_LT(aAVX512.nsPerCall <
aScalar.nsPerCall * 0.5)) that asserts AVX-512 BVH16 actually beats
scalar by the expected 2x+ margin -- a smoke test against the
"silently using ymm not zmm" failure mode that hit the previous BVH8
AVX-512 attempt.

Per "right tool for right job", BVH16 has only the AVX-512 SIMD path
plus scalar fallback (no SSE2/AVX2 BVH16 wrappers). On hosts without
AVX-512, the bench just runs scalar and dispatched-= scalar rows.

Sample output on a Zen 3 (AVX2-only) host:

  === RayBoxN<8> microbenchmark (BVH8 fan-out) ===
    Scalar           31.51 ns/call    1.00x
    AVX2              3.17 ns/call    9.95x
    Dispatched        3.23 ns/call    9.74x

  === RayBoxN<16> microbenchmark (BVH16 fan-out) ===
    Scalar           63.38 ns/call    1.00x
    AVX-512       (skipped: CPU support absent)
    Dispatched       66.05 ns/call    0.96x

Per-lane on this host:
  BVH8/AVX2 = 0.40 ns/lane (lower bound for the kernel cost)
  BVH16/Scalar = 3.96 ns/lane (10x worse per lane vs SIMD)
On AVX-512 hardware, the AVX-512 BVH16 row should land near
~0.4 ns/lane * 16 = ~6 ns/call (~2x the AVX2 BVH8 number),
demonstrating the per-lane savings the zmm width unlocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for CI failures observed on jijinbei/OCCT PR #4:

1. macOS-15 (Apple Silicon) build failure: __builtin_cpu_init() and
   __builtin_cpu_supports() are x86-only GCC/Clang builtins. On ARM64
   they error with "builtin is not supported on this target".
   Guard the GCC/Clang detect path with __x86_64__ || __i386__ so ARM
   falls through to the scalar fallback (already the right behaviour --
   no x86 SIMD kernels exist on ARM).

2. Code formatting check failure: CI runs a stricter clang-format
   (clang-format-18 default style) than the local pre-commit hook used.
   Apply the diff produced by the CI's "format-patch" artifact verbatim.
   Touches BVH_TraverseQuad.hxx, BVH_TraverseQuad_Test.cxx, and the
   detectImpl() function in BVH_SIMDDispatch.cxx (whitespace only).

Local re-run after both fixes: 381/385 BVH tests pass on the AVX2-only
Zen 3 host (4 AVX-512 tests SKIP as expected, no regressions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apple Silicon Clang runs -Werror -Wunused-function and flagged the
MakeRaySplat / SetBox helpers in BVH_ToolsSIMD_Test.cxx as unused.
On macOS ARM64:
  - BVH_HAS_SSE2_KERNEL is not defined (SSE2 is x86-only)
  - The entire block of SSE2 RayBox4 tests that *used* these helpers is
    #if'd out
  - -> every caller disappears at preprocessing time, and the helpers
    become dead code

Annotate both with [[maybe_unused]] so they survive the -Werror guard
when their consumers are conditionally compiled out. On x86 hosts the
attribute is a no-op because the SSE2 block is present and uses them.

Fixes the macOS build failure observed on CI run 24898268476.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dpasukhi dpasukhi added the 3. CLA approved User has the signed CLA and ready to review or integration processes label Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3. CLA approved User has the signed CLA and ready to review or integration processes

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants