Skip to content

optimize blockReduceWarp#9101

Open
charan-003 wants to merge 21 commits into
NVIDIA:mainfrom
charan-003:optimize_blockreduce
Open

optimize blockReduceWarp#9101
charan-003 wants to merge 21 commits into
NVIDIA:mainfrom
charan-003:optimize_blockreduce

Conversation

@charan-003

@charan-003 charan-003 commented May 21, 2026

Copy link
Copy Markdown
Contributor

Closes #5167

Replace sequential aggregation with cooperative warp reduction

Benchmark Comparison: optimize_blockreduce vs origin/main

  • GPU targets: rtxa6000, h100, rtxpro6000
  • Base label: origin/main

rtxa6000 — cub.bench.reduce.min.base

[0] NVIDIA RTX A6000

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 6.443 us 4.14% 6.203 us 4.59% -0.240 us -3.72% 🔵 SAME
I8 I32 2^20 8.773 us 3.33% 8.464 us 3.29% -0.309 us -3.52% 🟢 FAST
I8 I32 2^24 39.725 us 0.74% 39.222 us 0.42% -0.502 us -1.26% 🟢 FAST
I8 I32 2^28 452.705 us 3.39% 452.478 us 3.39% -0.227 us -0.05% 🔵 SAME
I8 I64 2^16 5.825 us 1.63% 6.076 us 4.36% 0.251 us 4.30% 🔴 SLOW
I8 I64 2^20 8.547 us 3.22% 8.245 us 1.35% -0.302 us -3.53% 🟢 FAST
I8 I64 2^24 39.567 us 0.76% 39.019 us 0.82% -0.548 us -1.39% 🟢 FAST
I8 I64 2^28 454.296 us 3.35% 457.069 us 3.51% 2.772 us 0.61% 🔵 SAME
I16 I32 2^16 6.109 us 1.00% 6.319 us 4.23% 0.210 us 3.44% 🔴 SLOW
I16 I32 2^20 10.274 us 1.19% 10.686 us 2.50% 0.412 us 4.01% 🔴 SLOW
I16 I32 2^24 67.186 us 0.43% 67.283 us 0.41% 0.097 us 0.14% 🔵 SAME
I16 I32 2^28 877.823 us 2.47% 878.647 us 2.45% 0.824 us 0.09% 🔵 SAME
I16 I64 2^16 6.241 us 3.21% 6.456 us 4.51% 0.215 us 3.44% 🔴 SLOW
I16 I64 2^20 10.270 us 1.17% 10.706 us 2.48% 0.436 us 4.24% 🔴 SLOW
I16 I64 2^24 67.225 us 0.42% 67.119 us 0.41% -0.106 us -0.16% 🔵 SAME
I16 I64 2^28 877.483 us 2.36% 874.667 us 2.47% -2.816 us -0.32% 🔵 SAME
I32 I32 2^16 5.929 us 3.68% 6.264 us 4.57% 0.335 us 5.65% 🔴 SLOW
I32 I32 2^20 14.811 us 1.72% 14.903 us 1.88% 0.092 us 0.62% 🔵 SAME
I32 I32 2^24 119.517 us 0.24% 120.028 us 0.24% 0.512 us 0.43% 🔴 SLOW
I32 I32 2^28 1.741 ms 1.67% 1.728 ms 1.67% -13.048 us -0.75% 🔵 SAME
I32 I64 2^16 6.156 us 4.72% 6.371 us 4.64% 0.215 us 3.49% 🔵 SAME
I32 I64 2^20 14.744 us 1.89% 14.891 us 1.85% 0.147 us 1.00% 🔵 SAME
I32 I64 2^24 119.648 us 0.24% 120.237 us 0.24% 0.589 us 0.49% 🔴 SLOW
I32 I64 2^28 1.740 ms 1.58% 1.727 ms 1.52% -13.147 us -0.76% 🔵 SAME
I64 I32 2^16 6.913 us 4.20% 7.250 us 4.32% 0.337 us 4.87% 🔴 SLOW
I64 I32 2^20 23.589 us 1.26% 23.737 us 1.25% 0.148 us 0.63% 🔵 SAME
I64 I32 2^24 224.202 us 0.07% 224.576 us 0.13% 0.374 us 0.17% 🔴 SLOW
I64 I32 2^28 3.381 ms 1.39% 3.380 ms 1.23% -1.249 us -0.04% 🔵 SAME
I64 I64 2^16 6.894 us 4.03% 7.262 us 4.04% 0.368 us 5.34% 🔴 SLOW
I64 I64 2^20 23.668 us 1.41% 23.933 us 1.35% 0.265 us 1.12% 🔵 SAME
I64 I64 2^24 224.323 us 0.07% 224.810 us 0.14% 0.487 us 0.22% 🔴 SLOW
I64 I64 2^28 3.380 ms 1.29% 3.381 ms 1.28% 0.657 us 0.02% 🔵 SAME
I128 I32 2^16 9.324 us 3.10% 9.900 us 3.05% 0.576 us 6.18% 🔴 SLOW
I128 I32 2^20 45.408 us 1.61% 45.897 us 1.67% 0.490 us 1.08% 🔵 SAME
I128 I32 2^24 485.106 us 0.10% 485.032 us 0.11% -0.073 us -0.02% 🔵 SAME
I128 I32 2^28 7.370 ms 0.53% 7.379 ms 0.55% 9.664 us 0.13% 🔵 SAME
I128 I64 2^16 9.057 us 3.30% 9.905 us 2.94% 0.849 us 9.37% 🔴 SLOW
I128 I64 2^20 45.708 us 0.86% 47.293 us 0.91% 1.585 us 3.47% 🔴 SLOW
I128 I64 2^24 477.357 us 0.12% 484.804 us 0.12% 7.447 us 1.56% 🔴 SLOW
I128 I64 2^28 7.264 ms 0.54% 7.379 ms 0.55% 115.104 us 1.58% 🔴 SLOW
F32 I32 2^16 6.192 us 4.45% 6.626 us 4.16% 0.434 us 7.01% 🔴 SLOW
F32 I32 2^20 14.638 us 2.21% 14.973 us 1.92% 0.335 us 2.29% 🔴 SLOW
F32 I32 2^24 121.718 us 2.43% 121.970 us 2.44% 0.252 us 0.21% 🔵 SAME
F32 I32 2^28 1.740 ms 1.62% 1.741 ms 1.59% 0.217 us 0.01% 🔵 SAME
F32 I64 2^16 6.085 us 4.04% 6.625 us 4.51% 0.541 us 8.89% 🔴 SLOW
F32 I64 2^20 14.932 us 1.90% 15.288 us 1.93% 0.356 us 2.38% 🔴 SLOW
F32 I64 2^24 119.788 us 0.11% 120.366 us 0.23% 0.578 us 0.48% 🔴 SLOW
F32 I64 2^28 1.740 ms 1.64% 1.741 ms 1.55% 0.480 us 0.03% 🔵 SAME
F64 I32 2^16 8.012 us 2.96% 8.324 us 3.70% 0.311 us 3.89% 🔴 SLOW
F64 I32 2^20 24.851 us 1.27% 24.846 us 1.06% -0.005 us -0.02% 🔵 SAME
F64 I32 2^24 227.362 us 0.29% 227.741 us 0.30% 0.379 us 0.17% 🔵 SAME
F64 I32 2^28 3.384 ms 1.19% 3.384 ms 1.19% 0.661 us 0.02% 🔵 SAME
F64 I64 2^16 7.965 us 3.48% 8.399 us 3.79% 0.435 us 5.46% 🔴 SLOW
F64 I64 2^20 24.876 us 1.22% 24.879 us 1.11% 0.003 us 0.01% 🔵 SAME
F64 I64 2^24 227.410 us 0.28% 227.751 us 0.31% 0.341 us 0.15% 🔵 SAME
F64 I64 2^28 3.384 ms 1.21% 3.384 ms 1.13% 0.053 us 0.00% 🔵 SAME

Summary: Total Matches: 56 — Pass: 28 / Failure: 28


rtxa6000 — cub.bench.reduce.sum.base

[0] NVIDIA RTX A6000

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 5.876 us 4.47% 5.950 us 4.51% 0.074 us 1.25% 🔵 SAME
I8 I32 2^20 8.329 us 3.37% 8.336 us 3.38% 0.007 us 0.08% 🔵 SAME
I8 I32 2^24 38.631 us 0.77% 38.648 us 0.75% 0.017 us 0.04% 🔵 SAME
I8 I32 2^28 449.642 us 3.18% 449.834 us 3.23% 0.192 us 0.04% 🔵 SAME
I8 I64 2^16 5.745 us 1.33% 5.807 us 1.53% 0.062 us 1.08% 🔵 SAME
I8 I64 2^20 8.365 us 3.26% 8.359 us 3.15% -0.006 us -0.07% 🔵 SAME
I8 I64 2^24 38.645 us 0.77% 39.173 us 0.83% 0.529 us 1.37% 🔴 SLOW
I8 I64 2^28 448.990 us 3.13% 450.452 us 3.14% 1.462 us 0.33% 🔵 SAME
I16 I32 2^16 5.889 us 1.42% 5.955 us 2.22% 0.066 us 1.12% 🔵 SAME
I16 I32 2^20 10.251 us 1.25% 10.309 us 1.30% 0.058 us 0.57% 🔵 SAME
I16 I32 2^24 66.399 us 0.43% 66.490 us 0.44% 0.090 us 0.14% 🔵 SAME
I16 I32 2^28 870.659 us 2.41% 872.107 us 2.55% 1.448 us 0.17% 🔵 SAME
I16 I64 2^16 5.886 us 2.92% 6.114 us 3.20% 0.228 us 3.88% 🔴 SLOW
I16 I64 2^20 10.208 us 1.16% 10.270 us 1.29% 0.062 us 0.61% 🔵 SAME
I16 I64 2^24 66.243 us 0.42% 66.344 us 0.45% 0.101 us 0.15% 🔵 SAME
I16 I64 2^28 869.308 us 2.41% 870.345 us 2.51% 1.037 us 0.12% 🔵 SAME
I32 I32 2^16 5.970 us 4.51% 6.211 us 4.01% 0.240 us 4.02% 🔴 SLOW
I32 I32 2^20 14.947 us 1.78% 14.981 us 1.87% 0.034 us 0.23% 🔵 SAME
I32 I32 2^24 119.569 us 0.24% 119.462 us 0.15% -0.107 us -0.09% 🔵 SAME
I32 I32 2^28 1.740 ms 1.63% 1.740 ms 1.60% -0.307 us -0.02% 🔵 SAME
I32 I64 2^16 6.236 us 4.93% 6.310 us 4.63% 0.074 us 1.19% 🔵 SAME
I32 I64 2^20 14.908 us 1.85% 14.916 us 1.94% 0.008 us 0.05% 🔵 SAME
I32 I64 2^24 119.607 us 0.25% 119.553 us 0.13% -0.055 us -0.05% 🔵 SAME
I32 I64 2^28 1.740 ms 1.64% 1.741 ms 1.62% 1.034 us 0.06% 🔵 SAME
I64 I32 2^16 6.868 us 4.08% 7.189 us 3.73% 0.321 us 4.67% 🔴 SLOW
I64 I32 2^20 23.427 us 1.23% 23.785 us 1.24% 0.359 us 1.53% 🔴 SLOW
I64 I32 2^24 224.040 us 0.07% 224.369 us 0.08% 0.330 us 0.15% 🔴 SLOW
I64 I32 2^28 3.381 ms 1.31% 3.380 ms 1.29% -0.244 us -0.01% 🔵 SAME
I64 I64 2^16 6.829 us 4.11% 7.191 us 3.99% 0.362 us 5.30% 🔴 SLOW
I64 I64 2^20 23.590 us 1.42% 23.996 us 1.33% 0.405 us 1.72% 🔴 SLOW
I64 I64 2^24 224.270 us 0.07% 224.592 us 0.08% 0.321 us 0.14% 🔴 SLOW
I64 I64 2^28 3.380 ms 1.28% 3.380 ms 1.24% -0.026 us -0.00% 🔵 SAME
I128 I32 2^16 8.735 us 3.44% 9.323 us 3.09% 0.587 us 6.72% 🔴 SLOW
I128 I32 2^20 44.029 us 1.52% 44.583 us 1.64% 0.554 us 1.26% 🔵 SAME
I128 I32 2^24 477.328 us 0.11% 477.997 us 0.11% 0.668 us 0.14% 🔴 SLOW
I128 I32 2^28 7.264 ms 0.55% 7.269 ms 0.56% 5.533 us 0.08% 🔵 SAME
I128 I64 2^16 8.739 us 3.36% 9.230 us 3.36% 0.490 us 5.61% 🔴 SLOW
I128 I64 2^20 45.594 us 0.82% 45.775 us 1.00% 0.181 us 0.40% 🔵 SAME
I128 I64 2^24 477.836 us 0.74% 478.603 us 0.63% 0.767 us 0.16% 🔵 SAME
I128 I64 2^28 7.263 ms 0.54% 7.268 ms 0.56% 4.916 us 0.07% 🔵 SAME
F32 I32 2^16 6.235 us 5.00% 6.506 us 4.57% 0.271 us 4.34% 🔵 SAME
F32 I32 2^20 14.930 us 1.79% 15.291 us 1.93% 0.362 us 2.42% 🔴 SLOW
F32 I32 2^24 122.061 us 2.49% 122.098 us 2.39% 0.036 us 0.03% 🔵 SAME
F32 I32 2^28 1.741 ms 1.65% 1.740 ms 1.61% -0.754 us -0.04% 🔵 SAME
F32 I64 2^16 6.389 us 5.17% 6.558 us 4.64% 0.169 us 2.65% 🔵 SAME
F32 I64 2^20 15.105 us 1.78% 15.388 us 1.98% 0.283 us 1.87% 🔴 SLOW
F32 I64 2^24 119.939 us 0.13% 120.246 us 0.13% 0.308 us 0.26% 🔴 SLOW
F32 I64 2^28 1.740 ms 1.65% 1.740 ms 1.62% -0.060 us -0.00% 🔵 SAME
F64 I32 2^16 8.118 us 3.46% 8.126 us 3.56% 0.008 us 0.10% 🔵 SAME
F64 I32 2^20 24.565 us 1.11% 24.792 us 1.19% 0.227 us 0.92% 🔵 SAME
F64 I32 2^24 227.150 us 0.31% 227.366 us 0.28% 0.216 us 0.09% 🔵 SAME
F64 I32 2^28 3.382 ms 1.13% 3.382 ms 1.10% 0.331 us 0.01% 🔵 SAME
F64 I64 2^16 7.981 us 3.60% 8.149 us 3.45% 0.167 us 2.10% 🔵 SAME
F64 I64 2^20 24.562 us 1.14% 24.865 us 1.22% 0.303 us 1.24% 🔴 SLOW
F64 I64 2^24 227.165 us 0.31% 227.344 us 0.28% 0.179 us 0.08% 🔵 SAME
F64 I64 2^28 3.382 ms 1.11% 3.383 ms 1.16% 1.073 us 0.03% 🔵 SAME
C32 I32 2^16 6.935 us 4.19% 7.185 us 4.28% 0.250 us 3.61% 🔵 SAME
C32 I32 2^20 23.635 us 1.29% 23.941 us 1.28% 0.306 us 1.29% 🔴 SLOW
C32 I32 2^24 224.485 us 0.09% 224.780 us 0.09% 0.296 us 0.13% 🔴 SLOW
C32 I32 2^28 3.385 ms 1.25% 3.385 ms 1.26% 0.629 us 0.02% 🔵 SAME
C32 I64 2^16 6.881 us 4.42% 7.291 us 5.41% 0.410 us 5.95% 🔴 SLOW
C32 I64 2^20 23.273 us 0.51% 23.866 us 1.33% 0.593 us 2.55% 🔴 SLOW
C32 I64 2^24 224.375 us 0.09% 224.649 us 0.10% 0.274 us 0.12% 🔴 SLOW
C32 I64 2^28 3.385 ms 1.26% 3.385 ms 1.23% 0.134 us 0.00% 🔵 SAME

Summary: Total Matches: 64 — Pass: 43 / Failure: 21


h100 — cub.bench.reduce.sum.base

[0] NVIDIA H100

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I64 2^20 8.739 us 5.05% 8.476 us 4.89% -0.264 us -3.02% 🔵 SAME
I8 I64 2^24 21.068 us 42.13% 21.222 us 2.23% 0.154 us 0.73% 🔵 SAME
I8 I64 2^28 161.963 us 7.21% 163.600 us 4.34% 1.637 us 1.01% 🔵 SAME
I16 I32 2^16 7.938 us 7.45% 7.763 us 5.96% -0.175 us -2.20% 🔵 SAME
I16 I32 2^20 9.850 us 8.05% 9.428 us 4.54% -0.422 us -4.28% 🔵 SAME
I16 I32 2^24 32.457 us 25.77% 31.726 us 1.63% -0.731 us -2.25% 🟢 FAST
I16 I32 2^28 293.976 us 4.24% 293.674 us 3.98% -0.301 us -0.10% 🔵 SAME
I16 I64 2^16 8.083 us 6.78% 7.739 us 6.00% -0.344 us -4.26% 🔵 SAME
I16 I64 2^20 10.175 us 76.66% 9.402 us 4.41% -0.773 us -7.60% 🟢 FAST
I16 I64 2^24 32.163 us 1.80% 31.786 us 1.63% -0.377 us -1.17% 🔵 SAME
I16 I64 2^28 294.298 us 4.21% 294.192 us 4.28% -0.106 us -0.04% 🔵 SAME
I32 I32 2^16 8.181 us 7.22% 7.884 us 6.05% -0.297 us -3.63% 🔵 SAME
I32 I32 2^20 11.596 us 4.39% 11.157 us 3.77% -0.439 us -3.79% 🟢 FAST
I32 I32 2^24 55.043 us 1.06% 54.673 us 0.93% -0.370 us -0.67% 🔵 SAME
I32 I32 2^28 565.576 us 2.90% 565.113 us 2.82% -0.463 us -0.08% 🔵 SAME
I32 I64 2^16 8.084 us 6.82% 7.729 us 6.36% -0.355 us -4.39% 🔵 SAME
I32 I64 2^20 12.060 us 73.83% 11.165 us 3.61% -0.895 us -7.42% 🟢 FAST
I32 I64 2^24 55.545 us 15.00% 54.731 us 0.95% -0.814 us -1.47% 🟢 FAST
I32 I64 2^28 565.279 us 2.73% 565.306 us 3.00% 0.028 us 0.00% 🔵 SAME
I64 I32 2^16 8.431 us 6.53% 8.358 us 5.91% -0.073 us -0.86% 🔵 SAME
I64 I32 2^20 14.090 us 3.88% 14.787 us 4.04% 0.696 us 4.94% 🔴 SLOW
I64 I32 2^24 91.217 us 0.54% 91.856 us 0.58% 0.638 us 0.70% 🔴 SLOW
I64 I32 2^28 1.105 ms 1.54% 1.106 ms 1.67% 1.200 us 0.11% 🔵 SAME
I64 I64 2^16 8.225 us 6.34% 8.747 us 6.12% 0.522 us 6.35% 🔴 SLOW
I64 I64 2^20 14.722 us 59.99% 14.868 us 4.07% 0.147 us 1.00% 🔵 SAME
I64 I64 2^24 91.654 us 6.13% 92.419 us 8.04% 0.765 us 0.83% 🔵 SAME
I64 I64 2^28 1.106 ms 1.63% 1.106 ms 1.63% 0.819 us 0.07% 🔵 SAME
I128 I32 2^16 9.449 us 5.28% 10.697 us 88.03% 1.247 us 13.20% 🔴 SLOW
I128 I32 2^20 22.965 us 2.24% 24.167 us 2.25% 1.201 us 5.23% 🔴 SLOW
I128 I32 2^24 172.162 us 1.18% 173.467 us 4.79% 1.305 us 0.76% 🔵 SAME
I128 I32 2^28 2.262 ms 0.76% 2.264 ms 0.66% 1.960 us 0.09% 🔵 SAME
I128 I64 2^16 9.545 us 6.67% 10.363 us 5.76% 0.818 us 8.57% 🔴 SLOW
I128 I64 2^20 22.477 us 2.73% 24.252 us 2.66% 1.775 us 7.90% 🔴 SLOW
I128 I64 2^24 170.802 us 0.46% 173.582 us 4.79% 2.780 us 1.63% 🔴 SLOW
I128 I64 2^28 2.252 ms 0.79% 2.265 ms 0.71% 12.514 us 0.56% 🔵 SAME
F32 I32 2^16 7.861 us 6.10% 8.414 us 6.42% 0.553 us 7.03% 🔴 SLOW
F32 I32 2^20 11.435 us 4.73% 11.911 us 4.87% 0.476 us 4.16% 🔵 SAME
F32 I32 2^24 55.044 us 1.00% 55.574 us 0.99% 0.530 us 0.96% 🔵 SAME
F32 I32 2^28 564.879 us 2.36% 565.664 us 2.46% 0.784 us 0.14% 🔵 SAME
F32 I64 2^16 7.928 us 6.25% 8.423 us 5.81% 0.495 us 6.24% 🔴 SLOW
F32 I64 2^20 11.575 us 4.65% 11.961 us 4.92% 0.386 us 3.33% 🔵 SAME
F32 I64 2^24 55.634 us 14.31% 55.761 us 0.96% 0.127 us 0.23% 🔵 SAME
F32 I64 2^28 565.203 us 2.41% 565.898 us 2.51% 0.695 us 0.12% 🔵 SAME
F64 I32 2^16 8.284 us 6.00% 8.948 us 5.75% 0.664 us 8.01% 🔴 SLOW
F64 I32 2^20 14.381 us 3.67% 15.284 us 52.02% 0.904 us 6.28% 🔴 SLOW
F64 I32 2^24 91.342 us 0.57% 92.236 us 6.37% 0.893 us 0.98% 🔴 SLOW
F64 I32 2^28 1.106 ms 1.75% 1.106 ms 1.45% 0.345 us 0.03% 🔵 SAME
F64 I64 2^16 8.312 us 6.01% 8.975 us 5.79% 0.663 us 7.97% 🔴 SLOW
F64 I64 2^20 14.360 us 3.90% 14.904 us 3.86% 0.544 us 3.79% 🔵 SAME
F64 I64 2^24 91.844 us 6.53% 92.333 us 7.93% 0.489 us 0.53% 🔵 SAME
F64 I64 2^28 1.105 ms 1.55% 1.106 ms 1.55% 0.772 us 0.07% 🔵 SAME
C32 I32 2^16 8.265 us 5.99% 8.934 us 19.28% 0.668 us 8.09% 🔴 SLOW
C32 I32 2^20 14.543 us 3.51% 14.829 us 56.70% 0.287 us 1.97% 🔵 SAME
C32 I32 2^24 92.780 us 9.84% 92.890 us 11.84% 0.110 us 0.12% 🔵 SAME
C32 I32 2^28 1.111 ms 1.61% 1.111 ms 1.73% -0.152 us -0.01% 🔵 SAME
C32 I64 2^16 8.719 us 6.60% 8.390 us 5.76% -0.329 us -3.77% 🔵 SAME
C32 I64 2^20 14.666 us 3.41% 14.528 us 2.84% -0.138 us -0.94% 🔵 SAME
C32 I64 2^24 92.549 us 0.70% 92.593 us 6.68% 0.045 us 0.05% 🔵 SAME
C32 I64 2^28 1.109 ms 1.63% 1.112 ms 1.73% 2.142 us 0.19% 🔵 SAME

Summary: Total Matches: 64 — Pass: 44 / Failure: 20


Details
h100 — cub.bench.reduce.min.base

[0] NVIDIA H100 PCIe

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 7.673 us 5.54% 7.957 us 5.95% 0.284 us 3.70% 🔵 SAME
I8 I32 2^20 8.761 us 4.68% 9.050 us 5.50% 0.289 us 3.29% 🔵 SAME
I8 I32 2^24 21.094 us 4.29% 21.367 us 2.73% 0.273 us 1.29% 🔵 SAME
I8 I32 2^28 160.741 us 6.56% 160.666 us 4.63% -0.075 us -0.05% 🔵 SAME
I8 I64 2^16 8.053 us 5.79% 8.324 us 6.21% 0.271 us 3.37% 🔵 SAME
I8 I64 2^20 8.836 us 4.86% 9.111 us 5.18% 0.275 us 3.11% 🔵 SAME
I8 I64 2^24 21.093 us 2.38% 21.411 us 2.54% 0.318 us 1.51% 🔵 SAME
I8 I64 2^28 161.292 us 8.75% 162.265 us 4.46% 0.972 us 0.60% 🔵 SAME
I16 I32 2^16 8.036 us 6.36% 8.355 us 6.63% 0.319 us 3.97% 🔵 SAME
I16 I32 2^20 9.670 us 4.21% 10.064 us 5.19% 0.394 us 4.07% 🔵 SAME
I16 I32 2^24 32.701 us 1.54% 33.055 us 1.78% 0.354 us 1.08% 🔵 SAME
I16 I32 2^28 294.188 us 3.90% 294.416 us 4.69% 0.228 us 0.08% 🔵 SAME
I16 I64 2^16 8.170 us 5.84% 8.525 us 6.56% 0.354 us 4.34% 🔵 SAME
I16 I64 2^20 9.730 us 3.97% 10.055 us 5.08% 0.326 us 3.35% 🔵 SAME
I16 I64 2^24 32.622 us 1.56% 33.548 us 26.67% 0.926 us 2.84% 🔴 SLOW
I16 I64 2^28 293.772 us 3.44% 294.375 us 4.08% 0.603 us 0.21% 🔵 SAME
I32 I32 2^16 7.899 us 6.04% 8.272 us 6.45% 0.373 us 4.72% 🔵 SAME
I32 I32 2^20 11.173 us 3.89% 11.654 us 4.13% 0.481 us 4.31% 🔴 SLOW
I32 I32 2^24 55.840 us 16.38% 56.054 us 1.08% 0.213 us 0.38% 🔵 SAME
I32 I32 2^28 565.451 us 2.51% 566.066 us 2.61% 0.615 us 0.11% 🔵 SAME
I32 I64 2^16 7.889 us 6.14% 8.372 us 6.58% 0.484 us 6.13% 🔵 SAME
I32 I64 2^20 11.235 us 4.48% 12.055 us 74.60% 0.820 us 7.30% 🔴 SLOW
I32 I64 2^24 55.539 us 0.82% 55.952 us 1.10% 0.413 us 0.74% 🔵 SAME
I32 I64 2^28 565.411 us 2.28% 566.152 us 2.57% 0.741 us 0.13% 🔵 SAME
I64 I32 2^16 8.303 us 5.75% 8.894 us 6.34% 0.591 us 7.11% 🔴 SLOW
I64 I32 2^20 14.358 us 2.96% 14.493 us 3.29% 0.136 us 0.94% 🔵 SAME
I64 I32 2^24 91.794 us 7.29% 92.191 us 9.30% 0.398 us 0.43% 🔵 SAME
I64 I32 2^28 1.105 ms 1.43% 1.105 ms 1.62% 0.249 us 0.02% 🔵 SAME
I64 I64 2^16 8.312 us 5.78% 8.614 us 5.98% 0.301 us 3.63% 🔵 SAME
I64 I64 2^20 14.441 us 10.97% 14.504 us 3.22% 0.064 us 0.44% 🔵 SAME
I64 I64 2^24 92.056 us 10.12% 92.195 us 6.91% 0.138 us 0.15% 🔵 SAME
I64 I64 2^28 1.106 ms 1.55% 1.105 ms 1.51% -0.302 us -0.03% 🔵 SAME
I128 I32 2^16 9.894 us 4.51% 10.278 us 4.99% 0.384 us 3.88% 🔵 SAME
I128 I32 2^20 23.362 us 1.83% 24.036 us 5.00% 0.674 us 2.89% 🔴 SLOW
I128 I32 2^24 173.052 us 3.55% 173.327 us 1.19% 0.275 us 0.16% 🔵 SAME
I128 I32 2^28 2.263 ms 0.70% 2.264 ms 0.65% 1.579 us 0.07% 🔵 SAME
I128 I64 2^16 10.151 us 4.95% 10.350 us 5.66% 0.199 us 1.96% 🔵 SAME
I128 I64 2^20 23.356 us 1.95% 24.002 us 2.16% 0.646 us 2.77% 🔴 SLOW
I128 I64 2^24 173.181 us 4.12% 173.343 us 0.43% 0.162 us 0.09% 🔵 SAME
I128 I64 2^28 2.263 ms 0.71% 2.265 ms 0.73% 2.060 us 0.09% 🔵 SAME
F32 I32 2^16 7.887 us 6.01% 8.640 us 126.34% 0.753 us 9.55% 🔴 SLOW
F32 I32 2^20 11.519 us 4.35% 11.599 us 4.62% 0.080 us 0.69% 🔵 SAME
F32 I32 2^24 55.401 us 14.79% 55.256 us 0.93% -0.145 us -0.26% 🔵 SAME
F32 I32 2^28 565.799 us 3.11% 564.983 us 2.34% -0.816 us -0.14% 🔵 SAME
F32 I64 2^16 7.948 us 5.84% 8.147 us 5.81% 0.199 us 2.50% 🔵 SAME
F32 I64 2^20 11.540 us 4.35% 11.712 us 14.32% 0.172 us 1.49% 🔵 SAME
F32 I64 2^24 55.490 us 14.90% 55.666 us 12.94% 0.176 us 0.32% 🔵 SAME
F32 I64 2^28 564.998 us 2.31% 565.210 us 2.40% 0.212 us 0.04% 🔵 SAME
F64 I32 2^16 8.649 us 5.03% 8.829 us 6.18% 0.180 us 2.08% 🔵 SAME
F64 I32 2^20 14.651 us 3.39% 14.696 us 3.64% 0.045 us 0.31% 🔵 SAME
F64 I32 2^24 92.702 us 8.63% 92.654 us 7.04% -0.048 us -0.05% 🔵 SAME
F64 I32 2^28 1.106 ms 1.52% 1.107 ms 1.80% 0.195 us 0.02% 🔵 SAME
F64 I64 2^16 8.742 us 22.41% 8.766 us 5.53% 0.024 us 0.27% 🔵 SAME
F64 I64 2^20 14.709 us 3.53% 14.765 us 3.75% 0.056 us 0.38% 🔵 SAME
F64 I64 2^24 92.838 us 11.14% 92.346 us 0.57% -0.491 us -0.53% 🔵 SAME
F64 I64 2^28 1.107 ms 1.58% 1.107 ms 1.66% 0.257 us 0.02% 🔵 SAME

Summary: Total Matches: 56 — Pass: 49 / Failure: 7


rtxpro6000 — cub.bench.reduce.sum.base

[0] NVIDIA RTX Pro 6000

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 7.164 us 1.67% 7.417 us 9.00% 0.252 us 3.52% 🔴 SLOW
I8 I32 2^20 9.216 us 0.00% 9.209 us 0.16% -0.007 us -0.07% 🟢 FAST
I8 I32 2^24 27.542 us 1.66% 27.648 us 0.00% 0.106 us 0.38% 🔴 SLOW
I8 I32 2^28 243.102 us 0.34% 243.887 us 0.43% 0.785 us 0.32% 🔵 SAME
I8 I64 2^16 7.169 us 0.10% 7.168 us 0.00% -0.001 us -0.01% 🟡 ????
I8 I64 2^20 9.216 us 0.00% 9.214 us 0.12% -0.002 us -0.02% 🟢 FAST
I8 I64 2^24 27.648 us 0.00% 27.648 us 0.00% 0.000 us 0.00% 🔵 SAME
I8 I64 2^28 242.726 us 0.12% 242.688 us 0.00% -0.038 us -0.02% 🟢 FAST
I16 I32 2^16 7.169 us 0.07% 7.168 us 0.00% -0.001 us -0.01% 🟡 ????
I16 I32 2^20 7.532 us 10.38% 7.573 us 10.75% 0.041 us 0.55% 🔵 SAME
I16 I32 2^24 43.470 us 2.08% 43.623 us 1.95% 0.153 us 0.35% 🔵 SAME
I16 I32 2^28 416.768 us 0.00% 416.849 us 0.10% 0.081 us 0.02% 🔴 SLOW
I16 I64 2^16 7.168 us 0.00% 7.167 us 0.10% -0.001 us -0.02% 🟡 ????
I16 I64 2^20 10.058 us 10.09% 10.037 us 10.01% -0.021 us -0.20% 🔵 SAME
I16 I64 2^24 41.989 us 0.71% 42.049 us 0.96% 0.060 us 0.14% 🔵 SAME
I16 I64 2^28 416.855 us 0.10% 416.768 us 0.00% -0.087 us -0.02% 🟢 FAST
I32 I32 2^16 7.171 us 0.65% 7.168 us 0.09% -0.002 us -0.03% 🔵 SAME
I32 I32 2^20 11.232 us 0.00% 11.250 us 0.55% 0.018 us 0.16% 🔴 SLOW
I32 I32 2^24 71.678 us 1.44% 71.789 us 1.43% 0.111 us 0.15% 🔵 SAME
I32 I32 2^28 765.366 us 0.11% 765.689 us 0.14% 0.323 us 0.04% 🔵 SAME
I32 I64 2^16 7.166 us 0.41% 7.195 us 3.22% 0.029 us 0.40% 🔵 SAME
I32 I64 2^20 13.280 us 0.00% 13.675 us 5.92% 0.395 us 2.98% 🟡 ????
I32 I64 2^24 72.567 us 0.72% 72.704 us 0.41% 0.138 us 0.19% 🔵 SAME
I32 I64 2^28 765.021 us 0.06% 765.032 us 0.06% 0.011 us 0.00% 🔵 SAME
I64 I32 2^16 8.329 us 12.18% 8.880 us 8.50% 0.551 us 6.62% 🔵 SAME
I64 I32 2^20 19.403 us 1.09% 19.424 us 0.04% 0.022 us 0.11% 🔴 SLOW
I64 I32 2^24 132.381 us 0.54% 132.541 us 0.64% 0.159 us 0.12% 🔵 SAME
I64 I32 2^28 1.462 ms 0.07% 1.462 ms 0.08% 0.040 us 0.00% 🔵 SAME
I64 I64 2^16 8.764 us 9.75% 9.216 us 0.00% 0.452 us 5.15% 🔴 SLOW
I64 I64 2^20 19.424 us 0.00% 19.425 us 0.03% 0.001 us 0.00% 🔴 SLOW
I64 I64 2^24 132.507 us 0.62% 132.959 us 0.77% 0.452 us 0.34% 🔵 SAME
I64 I64 2^28 1.462 ms 0.07% 1.461 ms 0.00% -0.494 us -0.03% 🟡 ????
I128 I32 2^16 9.209 us 0.16% 9.282 us 3.86% 0.073 us 0.79% 🔴 SLOW
I128 I32 2^20 28.267 us 3.31% 28.614 us 3.54% 0.348 us 1.23% 🔵 SAME
I128 I32 2^24 244.688 us 0.20% 244.766 us 0.15% 0.078 us 0.03% 🔵 SAME
I128 I32 2^28 2.900 ms 0.11% 2.901 ms 0.11% 0.684 us 0.02% 🔵 SAME
I128 I64 2^16 9.274 us 3.76% 9.502 us 7.53% 0.228 us 2.46% 🔵 SAME
I128 I64 2^20 26.877 us 3.75% 27.426 us 2.73% 0.549 us 2.04% 🔵 SAME
I128 I64 2^24 244.737 us 0.00% 245.792 us 0.10% 1.055 us 0.43% 🔴 SLOW
I128 I64 2^28 2.901 ms 0.11% 2.901 ms 0.11% -0.020 us -0.00% 🔵 SAME
F32 I32 2^16 7.349 us 8.02% 7.384 us 8.63% 0.035 us 0.47% 🔵 SAME
F32 I32 2^20 9.210 us 1.28% 9.216 us 0.00% 0.006 us 0.06% 🔴 SLOW
F32 I32 2^24 71.870 us 1.43% 72.086 us 1.34% 0.216 us 0.30% 🔵 SAME
F32 I32 2^28 764.665 us 0.12% 764.805 us 0.10% 0.140 us 0.02% 🔵 SAME
F32 I64 2^16 7.168 us 0.00% 7.168 us 0.00% 0.000 us 0.00% 🟡 ????
F32 I64 2^20 9.197 us 0.65% 9.216 us 0.00% 0.019 us 0.21% 🔴 SLOW
F32 I64 2^24 71.964 us 1.38% 72.170 us 1.25% 0.206 us 0.29% 🔵 SAME
F32 I64 2^28 765.109 us 0.08% 765.139 us 0.08% 0.030 us 0.00% 🔵 SAME
F64 I32 2^16 11.264 us 0.05% 9.317 us 4.73% -1.947 us -17.29% 🟢 FAST
F64 I32 2^20 17.404 us 1.40% 16.109 us 6.23% -1.295 us -7.44% 🟢 FAST
F64 I32 2^24 135.731 us 0.66% 134.560 us 0.63% -1.171 us -0.86% 🟢 FAST
F64 I32 2^28 1.468 ms 0.13% 1.467 ms 0.12% -0.986 us -0.07% 🔵 SAME
F64 I64 2^16 9.216 us 0.00% 9.213 us 0.10% -0.003 us -0.03% 🟢 FAST
F64 I64 2^20 17.312 us 2.16% 16.344 us 6.32% -0.968 us -5.59% 🟢 FAST
F64 I64 2^24 135.920 us 0.57% 134.968 us 0.75% -0.951 us -0.70% 🟢 FAST
F64 I64 2^28 1.466 ms 0.11% 1.466 ms 0.10% -0.839 us -0.06% 🔵 SAME
C32 I32 2^16 7.168 us 0.00% 7.168 us 0.00% 0.000 us 0.00% 🟡 ????
C32 I32 2^20 19.424 us 0.00% 19.456 us 0.00% 0.032 us 0.16% 🟡 ????
C32 I32 2^24 132.304 us 0.48% 132.298 us 0.46% -0.005 us -0.00% 🔵 SAME
C32 I32 2^28 1.462 ms 0.09% 1.462 ms 0.06% -0.089 us -0.01% 🔵 SAME
C32 I64 2^16 8.686 us 10.32% 9.019 us 6.70% 0.333 us 3.83% 🔵 SAME
C32 I64 2^20 17.376 us 0.00% 17.405 us 0.05% 0.029 us 0.17% 🔴 SLOW
C32 I64 2^24 132.290 us 0.46% 132.411 us 0.57% 0.121 us 0.09% 🔵 SAME
C32 I64 2^28 1.462 ms 0.05% 1.462 ms 0.07% 0.198 us 0.01% 🔵 SAME

Summary: Total Matches: 64 — Pass: 34 / Unknown: 8 / Failure: 22


rtxpro6000 — cub.bench.reduce.min.base

[0] NVIDIA RTX PRO 6000 Blackwell Server Edition

T{ct} OffsetT{ct} Elements{io} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 9.215 us 0.09% 9.216 us 0.00% 0.001 us 0.01% 🔴 SLOW
I8 I32 2^20 9.215 us 0.08% 9.208 us 0.48% -0.006 us -0.07% 🔵 SAME
I8 I32 2^24 27.973 us 2.66% 27.966 us 2.65% -0.008 us -0.03% 🔵 SAME
I8 I32 2^28 243.828 us 0.43% 243.800 us 0.42% -0.028 us -0.01% 🔵 SAME
I8 I64 2^16 7.233 us 4.96% 7.310 us 7.16% 0.077 us 1.06% 🔵 SAME
I8 I64 2^20 9.216 us 0.00% 9.216 us 0.00% 0.000 us 0.00% 🔵 SAME
I8 I64 2^24 27.740 us 1.55% 27.816 us 2.04% 0.076 us 0.27% 🔵 SAME
I8 I64 2^28 242.720 us 0.12% 242.735 us 0.12% 0.015 us 0.01% 🔵 SAME
I16 I32 2^16 7.183 us 2.42% 7.168 us 0.00% -0.015 us -0.21% 🟡 ????
I16 I32 2^20 7.168 us 0.00% 7.681 us 11.53% 0.513 us 7.15% 🟡 ????
I16 I32 2^24 43.763 us 1.55% 43.833 us 1.37% 0.069 us 0.16% 🔵 SAME
I16 I32 2^28 416.825 us 0.08% 416.851 us 0.15% 0.026 us 0.01% 🔵 SAME
I16 I64 2^16 7.168 us 0.00% 7.165 us 0.16% -0.003 us -0.04% 🟡 ????
I16 I64 2^20 10.093 us 10.06% 10.346 us 9.85% 0.253 us 2.51% 🔵 SAME
I16 I64 2^24 42.041 us 1.00% 42.111 us 1.24% 0.070 us 0.17% 🔵 SAME
I16 I64 2^28 416.811 us 0.07% 416.768 us 0.00% -0.043 us -0.01% 🟢 FAST
I32 I32 2^16 7.227 us 4.86% 7.223 us 4.66% -0.004 us -0.06% 🔵 SAME
I32 I32 2^20 11.232 us 0.00% 11.256 us 0.13% 0.024 us 0.21% 🔴 SLOW
I32 I32 2^24 71.695 us 1.44% 71.818 us 1.43% 0.123 us 0.17% 🔵 SAME
I32 I32 2^28 765.879 us 0.14% 765.892 us 0.14% 0.013 us 0.00% 🔵 SAME
I32 I64 2^16 7.168 us 0.00% 7.263 us 6.01% 0.095 us 1.33% 🟡 ????
I32 I64 2^20 13.746 us 6.21% 13.662 us 5.87% -0.083 us -0.61% 🔵 SAME
I32 I64 2^24 72.672 us 0.00% 72.673 us 0.44% 0.001 us 0.00% 🟡 ????
I32 I64 2^28 765.394 us 0.12% 765.369 us 0.11% -0.025 us -0.00% 🔵 SAME
I64 I32 2^16 7.421 us 9.08% 7.701 us 11.65% 0.280 us 3.77% 🔵 SAME
I64 I32 2^20 19.299 us 2.53% 19.370 us 1.70% 0.071 us 0.37% 🔵 SAME
I64 I32 2^24 134.162 us 0.15% 134.144 us 0.00% -0.018 us -0.01% 🟢 FAST
I64 I32 2^28 1.465 ms 0.09% 1.465 ms 0.07% -0.007 us -0.00% 🔵 SAME
I64 I64 2^16 7.723 us 11.78% 8.393 us 11.96% 0.671 us 8.68% 🔵 SAME
I64 I64 2^20 19.355 us 1.90% 19.375 us 1.66% 0.020 us 0.10% 🔵 SAME
I64 I64 2^24 134.153 us 0.10% 134.162 us 0.14% 0.009 us 0.01% 🔵 SAME
I64 I64 2^28 1.465 ms 0.09% 1.465 ms 0.07% 0.154 us 0.01% 🔵 SAME
I128 I32 2^16 9.264 us 3.81% 9.384 us 6.03% 0.119 us 1.29% 🔵 SAME
I128 I32 2^20 28.721 us 3.56% 28.736 us 3.53% 0.014 us 0.05% 🔵 SAME
I128 I32 2^24 244.736 us 0.00% 244.816 us 0.16% 0.080 us 0.03% 🔴 SLOW
I128 I32 2^28 2.901 ms 0.11% 2.901 ms 0.11% -0.085 us -0.00% 🔵 SAME
I128 I64 2^16 9.343 us 5.23% 9.407 us 6.93% 0.064 us 0.68% 🔵 SAME
I128 I64 2^20 27.628 us 3.40% 27.591 us 2.31% -0.037 us -0.13% 🔵 SAME
I128 I64 2^24 245.814 us 0.26% 245.713 us 0.09% -0.101 us -0.04% 🔵 SAME
I128 I64 2^28 2.901 ms 0.11% 2.901 ms 0.12% -0.310 us -0.01% 🔵 SAME
F32 I32 2^16 7.180 us 2.72% 7.256 us 5.76% 0.076 us 1.06% 🔵 SAME
F32 I32 2^20 9.188 us 0.64% 9.216 us 0.00% 0.028 us 0.31% 🔴 SLOW
F32 I32 2^24 71.986 us 1.35% 72.206 us 1.25% 0.221 us 0.31% 🔵 SAME
F32 I32 2^28 764.928 us 0.00% 765.158 us 0.09% 0.230 us 0.03% 🔴 SLOW
F32 I64 2^16 7.168 us 0.00% 7.168 us 0.00% 0.000 us 0.00% 🟡 ????
F32 I64 2^20 9.200 us 1.31% 9.247 us 2.58% 0.048 us 0.52% 🔵 SAME
F32 I64 2^24 72.045 us 1.35% 72.251 us 1.19% 0.206 us 0.29% 🔵 SAME
F32 I64 2^28 765.104 us 0.08% 765.176 us 0.09% 0.072 us 0.01% 🔵 SAME
F64 I32 2^16 9.216 us 0.00% 9.215 us 0.07% -0.001 us -0.01% 🟢 FAST
F64 I32 2^20 15.328 us 0.00% 15.455 us 3.22% 0.127 us 0.83% 🟡 ????
F64 I32 2^24 136.364 us 0.42% 136.192 us 0.00% -0.172 us -0.13% 🟡 ????
F64 I32 2^28 1.472 ms 0.14% 1.472 ms 0.14% -0.426 us -0.03% 🔵 SAME
F64 I64 2^16 9.216 us 0.00% 9.216 us 0.00% 0.000 us 0.00% 🔵 SAME
F64 I64 2^20 17.193 us 3.40% 16.548 us 6.11% -0.645 us -3.75% 🟢 FAST
F64 I64 2^24 137.111 us 0.74% 136.584 us 0.59% -0.528 us -0.39% 🔵 SAME
F64 I64 2^28 1.471 ms 0.13% 1.471 ms 0.14% -0.540 us -0.04% 🔵 SAME

Summary: Total Matches: 56 — Pass: 39 / Unknown: 8 / Failure: 9

@charan-003 charan-003 requested a review from a team as a code owner May 21, 2026 20:05
@charan-003 charan-003 requested a review from NaderAlAwar May 21, 2026 20:05
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 21, 2026
@copy-pr-bot

copy-pr-bot Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

suggestion:

Walkthrough

Refactors BlockReduce warp-aggregate application so warp 0 deterministically cooperates to reduce per-warp aggregates (reduce_op_sync fast path, shuffle fallback), removes the atomic non-deterministic path, consolidates Sum/Reduce, and adds NVBench benchmark plus CI bench filter/GPU updates.

Changes

Warp-0 Cooperative Reduction

Layer / File(s) Summary
Dependencies and documentation updates
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
Header includes reordered to add thread-operator and target utilities; comment updated to state warp 0 cooperatively reduces per-warp aggregates.
ApplyWarpAggregates warp-0 cooperative reduction
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
ApplyWarpAggregates reimplemented: compute effective num_warps, warp 0 loads each warp's aggregate per lane, then reduces via target-gated reduce_op_sync (fast path for eligible operator/type on supported SMs when FullTile) or WarpReduceShfl fallback.
Sum and Reduce entry point consolidation
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
Sum and Reduce now unconditionally invoke ApplyWarpAggregates<FullTile>, removing the IsDeterministic compile-time branch and the atomic-backed non-deterministic implementation.

Benchmarks and CI

Layer / File(s) Summary
NVBench base benchmark
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
Adds device functor benchmark_op_t<BlockThreads>, benchmark launcher block_reduce_warp_reductions, occupancy-based grid sizing with zero-grid guard, and NVBench registration.
Sum benchmark TU
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
Adds value_types list and op_t = ::cuda::std::plus<>, includes the shared base implementation.
CI benchmark filters & GPU list
ci/bench.yaml
Adds two reduce benchmark filter regexes for new benchmarks and enables rtxa6000 in the CI GPU selection list.

Assessment against linked issues

Objective Addressed Explanation
Replace sequential per-warp aggregation with low-latency warp reduction (#5167)
Use __reduce_sync_op or equivalent on supported SM targets (#5167)
Reduce compile-time overhead from recursive ApplyWarpAggregates calls (#5167) The IsDeterministic branch and one non-deterministic function were removed, but the change retains templated FullTile/type/operator paths; net compile-time instantiation reduction is unclear from diffs provided.

Out-of-scope changes

Code Change Explanation
Add NVBench benchmark base (cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh, lines 1-54) Benchmark/test artifact; not required by the BlockReduce optimization objective.
Add benchmark TU for sum (cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu, lines 1-35) Benchmark wiring and type-list instantiation are testing artifacts, not part of core optimization.
Update CI bench filters and GPU list (ci/bench.yaml, lines 35-36 and 49-53) CI configuration updates are operational/test routing changes unrelated to BlockReduce implementation goals.

Comment @coderabbitai help to get the list of available commands.

@fbusato

fbusato commented May 21, 2026

Copy link
Copy Markdown
Contributor

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach.
I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

@charan-003

Copy link
Copy Markdown
Contributor Author

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach. I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

sure, working on it

@charan-003 charan-003 force-pushed the optimize_blockreduce branch from 0b48748 to cb6dd6c Compare May 21, 2026 22:10

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cub/cub/block/specializations/block_reduce_warp_reductions.cuh (2)

125-125: 💤 Low value

suggestion: Use static_cast<int>(warps) instead of C-style cast int(warps) for consistency with the static_cast used in the same expression.

-      const int num_warps = FullTile ? int(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
+      const int num_warps = FullTile ? static_cast<int>(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));

148-148: 💤 Low value

suggestion: Per coding guidelines, variables not modified should be const. dummy_storage is only passed to the constructor.

-      NullType dummy_storage;
+      const NullType dummy_storage{};

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffc63f63-2430-4ac6-8a11-ae99c579665d

📥 Commits

Reviewing files that changed from the base of the PR and between 0b48748 and cb6dd6c.

📒 Files selected for processing (1)
  • cub/cub/block/specializations/block_reduce_warp_reductions.cuh

@fbusato

fbusato commented May 21, 2026

Copy link
Copy Markdown
Contributor

@charan-003 thanks for reporting them but I think a better approach would be:

  • Add the code for the device-side benchmark.
  • Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.
  • Add a commit with the message [bench-only] to run them.
  • Report only the comparison results, e.g. SLOW, FAST, etc.

@charan-003

charan-003 commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

@charan-003 thanks for reporting them but I think a better approach would be:

  • Add the code for the device-side benchmark.
  • Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.
  • Add a commit with the message [bench-only] to run them.
  • Report only the comparison results, e.g. SLOW, FAST, etc.

Thanks a lot for the guidance! I didn't know the exact process...
Let me add device-side benchmark code

@charan-003 charan-003 requested review from a team as code owners May 22, 2026 01:11
@charan-003 charan-003 requested a review from wmaxey May 22, 2026 01:11
@charan-003 charan-003 force-pushed the optimize_blockreduce branch from 830c2b4 to a949e20 Compare May 22, 2026 01:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 53e3ccfa-d3a5-484a-85fd-7cf18e522152

📥 Commits

Reviewing files that changed from the base of the PR and between cb6dd6c and 830c2b4.

📒 Files selected for processing (3)
  • ci/bench.yaml
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
✅ Files skipped from review due to trivial changes (1)
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

Comment thread ci/bench.yaml Outdated
template <int BlockThreads>
struct benchmark_op_t
{
template <typename T>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

suggestion: Line 17 and Line 35 use a single-letter template parameter (T). Use _Tp-style naming here to satisfy the template-parameter naming rule for this codebase.

As per coding guidelines "Avoid single letter names for template parameters; use _Tp instead of _T".

Also applies to: 35-35

Comment on lines +18 to +19
__device__ __forceinline__ T operator()(T thread_data) const
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read-only verification: confirm missing CCCL API annotations / noexcept in this file
rg -nP '^\s*(template\s*<[^>]+>\s*)?(?!.*_CCCL_(HOST_API|DEVICE_API|API)).*\b(operator\(\)|block_reduce_warp_reductions)\b' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -nP '\b(operator\(\)|block_reduce_warp_reductions)\b(?!.*noexcept)' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 391


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,120p' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh | nl -ba | sed -n '10,80p'

Repository: NVIDIA/cccl

Length of output: 98


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "operator\(\)|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 289


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Capture the full function signatures/headers for the two targets (best-effort)
perl -0777 -ne '
my $f=$_; 
if($f =~ /(operator\(\)\s*const[^\n]*\n\s*\{)/s){ print "FOUND operator():\n$1\n"; }
if($f =~ /(block_reduce_warp_reductions[^\n]*\n\s*\{)/s){ print "FOUND block_reduce_warp_reductions():\n$1\n"; }
' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 170


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the relevant top portion with line numbers (no nl dependency)
awk '{printf "%5d:%s\n", NR, $0}' cub/benchmarks/bench/reduce/block_reduce_warp_warp_reductions_base.cuh | sed -n '1,110p'

Repository: NVIDIA/cccl

Length of output: 196


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check for noexcept / CCCL API tokens near the declarations
rg -n "operator\(\)|block_reduce_warp_reductions" -n cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -n --context=3 "operator\(\)|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -n --context=0 "_CCCL_(HOST_API|DEVICE_API|API)" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 1049


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for exception-throwing in the relevant functions/bodies
rg -n "\bthrow\b" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 37


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Print a small, line-numbered window around operator() and block_reduce_warp_reductions
sed -n '10,60p' "$FILE" | awk '{printf "%5d:%s\n", NR+10-1, $0}'

Repository: NVIDIA/cccl

Length of output: 2048


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Check for _CCCL_* usage and noexcept in/near the relevant symbols
rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE"
rg -n "noexcept" "$FILE"
rg -n --context=2 "operator\(\)" "$FILE"
rg -n --context=2 "block_reduce_warp_reductions" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Look for exception throwing inside this file
rg -n "\bthrow\b" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"

echo "== _CCCL_* tokens =="
rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" || true

echo "== noexcept tokens =="
rg -n "noexcept" "$FILE" || true

echo "== throw tokens =="
rg -n "\bthrow\b" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 117


important: Add required _CCCL_* API annotations and noexcept to benchmark functions.

  • benchmark_op_t::operator() (line 18) lacks _CCCL_DEVICE_API and noexcept.
  • block_reduce_warp_reductions (line 36) lacks _CCCL_HOST_API and noexcept.
Suggested change
-  __device__ __forceinline__ T operator()(T thread_data) const
+  _CCCL_DEVICE_API __forceinline__ auto operator()(_Tp thread_data) const noexcept -> _Tp
-template <typename T>
-void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<T>)
+template <typename _Tp>
+_CCCL_HOST_API void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<_Tp>) noexcept

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 519c38f3-5544-4f5a-9750-fc04a66589e0

📥 Commits

Reviewing files that changed from the base of the PR and between 830c2b4 and a949e20.

📒 Files selected for processing (3)
  • ci/bench.yaml
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
✅ Files skipped from review due to trivial changes (1)
  • ci/bench.yaml

Comment thread cub/benchmarks/bench/reduce/block_reduce_base.cuh Outdated
@fbusato

fbusato commented May 22, 2026

Copy link
Copy Markdown
Contributor

/ok to test 80b5ac2

@github-actions

This comment has been minimized.

@charan-003

Copy link
Copy Markdown
Contributor Author

@fbusato the benchmarks shows a slowdown. can we use bit_ceil(warps) for the logical warp size. Does this approach look reasonable, or would you suggest a different direction?

@fbusato

fbusato commented May 22, 2026

Copy link
Copy Markdown
Contributor

The code looks good and I really like the simplifications. Related to performance, some high-level thoughts:

@fbusato

fbusato commented May 28, 2026

Copy link
Copy Markdown
Contributor

/ok to test dae2e4a

@fbusato

fbusato commented May 28, 2026

Copy link
Copy Markdown
Contributor

@charan-003 sorry for the delay. Please ping me every time you need to trigger the CI

@github-actions

This comment has been minimized.

@charan-003

Copy link
Copy Markdown
Contributor Author

@fbusato can you please trigger the CI?
Thank you.

@fbusato

fbusato commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

/ok to test 4210958

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@charan-003 charan-003 requested a review from fbusato June 6, 2026 05:46
int max_blocks_per_SM = 0;
NVBENCH_CUDA_CALL_NOEXCEPT(
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_SM, kernel, BlockThreads, 0));
const int grid_size = max_blocks_per_SM * num_SMs;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of issues with this benchmarking approach that I tried to address in #8607:

  1. If another (version of the) compiler decides to use more or less registers in a way that changes occupancy, results become very hard to compare for performance regression testing. My first idea was to just include the grid size in the benchmark output, but that didn't help much as it would still not show up in our current infra's benchmark comparisons. So I went for a bigger hammer and just use a hardcoded workload that is big enough that the tail effect should be negligible on current devices (2^28 items same as the biggest workloads for our device primitive benchmarks).
  2. Both the current approach and even more so my solution to 1. is measuring throughput. For block and warp primitives we are also interested in their latency in contexts where they do not make up the main workload of a kernel, but are still in the critical path. Therefore I added a second benchmark where only a single warp is launched.
  3. This one might be less important, but it was a particular concern for WarpReduceBatched: I also added a benchmarking kernel with full launch bounds to force the compiler to deal with a lot of register pressure without pushing occupancy below a certain value. If a code change or toolchain change should affect the register allocation, this benchmark should show it very clearly since it is much more prone to spilling to local memory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pauleonix thanks for the comments

so from what i understand for this PR

  1. we also want to replace occupancy based grids with 2^28 workloads.
  2. Add a single warp latency benchmark alongside the throughput one?
  3. Add a WarpReduceBatched benchmark with full launch bounds?

is that right? anything else you'd like to see?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. and 2. we want. For 3. I'm not yet sure if it justifies the additional compilation and benchmarking time.

@charan-003

Copy link
Copy Markdown
Contributor Author

@fbusato @pauleonix, do we want to add anything else?

@fbusato

fbusato commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

/ok to test 88b1da0

@github-actions

Copy link
Copy Markdown
Contributor

❌ Benchmark Results

Benchmark comparison had failures.

Results
Artifacts

@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 29m: Pass: 100%/287 | Total: 12d 17h | Max: 2h 18m | Hits: 22%/974831

See results here.

@charan-003

charan-003 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@fbusato ping

@fbusato

fbusato commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@charan-003 I don't have any additional comment. There are two things to do before merging the PRs:

  • revert bench.yaml
  • report the benchmark results, essentially the tables of the nvbench comparison script, on a comment or PR description

@charan-003

Copy link
Copy Markdown
Contributor Author

@charan-003 I don't have any additional comment. There are two things to do before merging the PRs:

  • revert bench.yaml
  • report the benchmark results, essentially the tables of the nvbench comparison script, on a comment or PR description

@fbusato
Done,
Thanks a lot

@fbusato

fbusato commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

these results don't show a clear benefits. I would add the results for all 3 SMs for DeviceReduce and BlockReduce benchmarks.
Nit, you can report the results in sections https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-collapsed-sections

@charan-003

Copy link
Copy Markdown
Contributor Author

these results don't show a clear benefits. I would add the results for all 3 SMs for DeviceReduce and BlockReduce benchmarks. Nit, you can report the results in sections https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-collapsed-sections

@fbusato
done, sorry for the delay

@fbusato

fbusato commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

the results show minimal differences. Could you also please add the results for device-level benchmarks?

@charan-003

Copy link
Copy Markdown
Contributor Author

the results show minimal differences. Could you also please add the results for device-level benchmarks?

added device-level benchmark
agree there's no clear end-to-end win at large sizes

@fbusato

fbusato commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

sorry @charan-003, maybe I didn't explain it well.
The optimization that you added in this PR is hard to observe for the full routine.
You correctly added a new device-side benchmark in block_reduce_warp_reductions_sum.cu for this reason. These results can be useful to understand if the performance improvement can be observed at block-level.

Comment thread cub/benchmarks/bench/reduce/block_reduce_sum.cu
Comment thread cub/benchmarks/bench/reduce/warp_reduce_base.cuh Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[FEA]: Optimize BlockReduce

4 participants