Skip to content

Add fixed_capacity_map to cudax#7705

Open
srinivasyadav18 wants to merge 84 commits into
NVIDIA:mainfrom
srinivasyadav18:cuco_static_map
Open

Add fixed_capacity_map to cudax#7705
srinivasyadav18 wants to merge 84 commits into
NVIDIA:mainfrom
srinivasyadav18:cuco_static_map

Conversation

@srinivasyadav18

@srinivasyadav18 srinivasyadav18 commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Description

closes #7463

This PR migrates cuCollections static_map into cudax as cuda::experimental::cuco::static_map.

Minimal scope: implements insert, contains, clear, and trivial accessors, with capacity validation provided by make_valid_capacity and is_valid_capacity. Tests mirror the cuCollections layout and use a parameterized matrix covering key type, probing scheme, CG size, and bucket size.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot

copy-pr-bot Bot commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Feb 18, 2026
@PointKernel PointKernel self-requested a review February 18, 2026 20:10
Comment thread cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/detail/prime.hpp
Comment thread cudax/include/cuda/experimental/__cuco/__detail/utils.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/utils.hpp Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/extent.cuh Outdated
@copy-pr-bot

copy-pr-bot Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@PointKernel PointKernel marked this pull request as ready for review June 3, 2026 18:11
@PointKernel PointKernel requested a review from a team as a code owner June 3, 2026 18:11
@PointKernel PointKernel requested a review from andralex June 3, 2026 18:11
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jun 3, 2026
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a complete open-addressing hash table infrastructure for CUDA Experimental, comprising device reference operations, grid kernels, host orchestration, and a public static_map container with static/dynamic capacity modes and optional key erasure, plus comprehensive test coverage.

Changes

Open-addressing and static_map port

Layer / File(s) Summary
Type traits and bitwise comparison
cudax/include/cuda/experimental/__cuco/traits.hpp, cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh
is_bitwise_comparable, is_tuple_like traits and aligned __bitwise_compare template support bitwise-safe type detection and fast equality paths (4/8-byte specializations via reinterpretation, general memcmp fallback).
Prime utilities and capacity rounding
cudax/include/cuda/experimental/__cuco/__detail/prime.hpp, cudax/include/cuda/experimental/__cuco/capacity.cuh
Deterministic 64-bit primality testing via trial division + Miller–Rabin, modular arithmetic with __int128 fast path, and make_valid_capacity rounding for linear/double-hashing with overflow guards.
Probing schemes and iterator base
cudax/include/cuda/experimental/__cuco/__detail/probing_scheme_base.cuh, cudax/include/cuda/experimental/__cuco/probing_scheme.cuh
__probing_scheme_base<CgSize> and __probing_iterator for bucket traversal; public linear_probing and double_hashing templates with cooperative-group tile-rank stride distribution.
Sentinel types and kernel utilities
cudax/include/cuda/experimental/__cuco/types.cuh, cudax/include/cuda/experimental/__cuco/__detail/types.cuh, cudax/include/cuda/experimental/__cuco/__detail/utils.cuh, cudax/include/cuda/experimental/__cuco/__detail/utils.hpp
Strong-type sentinel wrappers (empty_key, empty_value, erased_key), mdspan extent aliases, and grid-launch helpers (global thread ID, grid stride, occupancy sizing, tile-size traits).
Equality wrapper for probing
cudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuh
Combines __bitwise_compare sentinel checks with key equality, returning three-way results and branching on insert vs. query mode for duplicate control.
Slot storage and device reference core
cudax/include/cuda/experimental/__cuco/__open_addressing/slot_storage_ref.cuh, cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_ref_impl.cuh
__slot_storage_ref non-owning bucket view and __open_addressing_ref_impl device-side operations (probing, CAS-based insert with packed_cas/back_to_back_cas/cas_dependent_write dispatch, contains, cooperative-group variants).
Grid kernels for bulk operations
cudax/include/cuda/experimental/__cuco/__open_addressing/kernels.cuh
Grid-stride conditional __insert_if_n, __fill, and __contains_if_n kernels with _CgSize==1 direct vs. _CgSize!=1 tiled cooperative execution paths.
Host orchestration and memory
cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_impl.cuh
Device-allocated slot buffer, async/sync clear/insert/contains with stream refs, device counter for success counting, bucket-count computation from capacity or load factor.
Public static_map container
cudax/include/cuda/experimental/__cuco/static_map.cuh, cudax/include/cuda/experimental/__cuco/static_map_ref.cuh
SFINAE-selected constructors for static/dynamic capacity and erasure modes; clear, insert, contains forwarding; device-side static_map_ref with trivially-copyable ref semantics.
Capacity, insert, and sentinel tests
cudax/test/cuco/static_map/test_capacity.cu, cudax/test/cuco/static_map/test_insert_and_contains.cu, cudax/test/cuco/static_map/test_key_sentinel.cu, cudax/test/cuco/static_map/test_shared_memory.cu, cudax/test/cuco/utility/test_capacity.cu, cudax/test/CMakeLists.txt
Validates dynamic/static capacity computation, insert/contains workflows, shared-memory sizing via capacity_v, sentinel handling, and load-factor rounding; updates strong_type.cuh documentation.

Assessment against linked issues

Objective Addressed Explanation
Port OpenAddressing [#7463]
Port static_map [#7463]

Suggested labels

cudax

Suggested reviewers

  • andralex
  • pciolkosz
  • gevtushenko

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__cuco/__detail/extent.cuh (1)

131-148: ⚡ Quick win

suggestion: Mark these header variable templates inline. They are namespace-scope constexpr definitions in a header, and the local CCCL rule requires the explicit inline spelling for this pattern.

As per coding guidelines, "All constexpr variables at namespace/global scope must use inline, including template variables."

cudax/include/cuda/experimental/__cuco/probing_scheme.cuh (1)

24-31: ⚡ Quick win

suggestion: Wrap this header with the standard CCCL prologue/epilogue pair. The file enters code directly after its includes and never closes with #include <cuda/std/__cccl/epilogue.h>, unlike the other new headers in this cohort.

As per coding guidelines, "The last included header before code must be #include <cuda/std/__cccl/prologue.h>, and #include <cuda/std/__cccl/epilogue.h> must be at the end of a file."

Also applies to: 264-264

cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh (1)

119-235: suggestion: Please attach benchmark results for this fast path before merge. This shared-memory kernel adds a new execution path and tuning heuristic, so we need the perf numbers that justify it on the supported toolchains and architectures. As per coding guidelines, "Do not commit SASS code changes without running benchmarks to check for performance regressions."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2eb011c-f333-4929-a09b-f09102640ec3

📥 Commits

Reviewing files that changed from the base of the PR and between 75c7b14 and 134736e.

📒 Files selected for processing (22)
  • cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/equal_wrapper.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/extent.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/prime.hpp
  • cudax/include/cuda/experimental/__cuco/__detail/probing_scheme_base.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/types.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/utils.cuh
  • cudax/include/cuda/experimental/__cuco/__detail/utils.hpp
  • cudax/include/cuda/experimental/__cuco/__open_addressing/functors.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/kernels.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_impl.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/open_addressing_ref_impl.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/slot_storage_ref.cuh
  • cudax/include/cuda/experimental/__cuco/__open_addressing/types.cuh
  • cudax/include/cuda/experimental/__cuco/__static_map/kernels.cuh
  • cudax/include/cuda/experimental/__cuco/__utility/strong_type.cuh
  • cudax/include/cuda/experimental/__cuco/probing_scheme.cuh
  • cudax/include/cuda/experimental/__cuco/static_map.cuh
  • cudax/include/cuda/experimental/__cuco/static_map_ref.cuh
  • cudax/include/cuda/experimental/__cuco/traits.hpp
  • cudax/test/CMakeLists.txt
  • cudax/test/cuco/static_map/test_static_map.cu

Comment thread cudax/include/cuda/experimental/__cuco/detail/prime.hpp
Comment thread cudax/include/cuda/experimental/__cuco/__detail/utils.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/detail/utility/cuda.cuh
Comment thread cudax/include/cuda/experimental/__cuco/detail/utility/cuda.cuh
Comment thread cudax/include/cuda/experimental/__cuco/__open_addressing/functors.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map_ref.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map_ref.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map_ref.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/static_map.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/traits.hpp Outdated

@PointKernel PointKernel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for another round of review

Comment thread cudax/include/cuda/experimental/__cuco/__detail/bitwise_compare.cuh Outdated
Comment thread cudax/include/cuda/experimental/__cuco/__detail/prime.hpp Outdated
Comment on lines +386 to +392
const auto __src_lane = __ffs(__group_contains_available) - 1;
auto __status = __insert_result::__continue;
if (__group.thread_rank() == __src_lane)
{
__status =
__attempt_insert(__get_slot_ptr(*__probing_iter, __intra_bucket_index), empty_slot_sentinel(), __val);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I ran some tests locally with Claude Code and didn't observe any noticeable impact, but I'm happy to revisit this if needed.

Comment on lines +103 to +106
using __impl_type = ::cuda::experimental::cuco::__open_addressing::
__open_addressing_impl<_Key, value_type, _Scope, _KeyEqual, _ProbingScheme, _BucketSize, _MemoryResource>;

::cuda::std::unique_ptr<__impl_type> __impl;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PImpl is intentional here. While the Core Guidelines primarily discuss PImpl in the context of ABI stability, it is also a well-established technique for reducing header dependencies and isolating implementation details

{
using __size_type = typename _Capacity::index_type;
using __step_extent = ::cuda::std::extents<__size_type, _BucketSize>;
const __size_type __init = __hash(__probe_key) % (__cap.extent(0) / _BucketSize) * _BucketSize;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round_down(extent, B) isn't equivalent. The current (__hash % (extent / _BucketSize)) * _BucketSize picks a bucket (hash % num_buckets) then scales to its first slot, so the start is bucket-aligned.

using __step_extent = ::cuda::std::extents<__size_type, ::cuda::std::dynamic_extent>;
return ::cuda::experimental::cuco::__detail::__probing_iterator<_Capacity, __step_extent>{
__size_type{__hash1(__probe_key)} % (__cap.extent(0) / _BucketSize) * _BucketSize,
__step_extent{__size_type{(__hash2(__probe_key) % (__cap.extent(0) / _BucketSize - 1) + 1) * _BucketSize}},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed: extent(0) (the slot capacity) is always a multiple of the probe stride (cg_size * _BucketSize), so extent / _BucketSize is exact.

@PointKernel PointKernel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuco cuCollections

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Migrate cuCollections OpenAddressing and static_map

4 participants