optimize blockReduceWarp by charan-003 · Pull Request #9101 · NVIDIA/cccl

charan-003 · 2026-05-21T20:05:32Z

Replace sequential aggregation with cooperative warp reduction

Benchmark Comparison: `optimize_blockreduce` vs `origin/main`

GPU targets: rtxa6000, h100, rtxpro6000
Base label: origin/main

rtxa6000 — cub.bench.reduce.min.base

[0] NVIDIA RTX A6000

T{ct}	OffsetT{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I32	2^16	6.443 us	4.14%	6.203 us	4.59%	-0.240 us	-3.72%	🔵 SAME
I8	I32	2^20	8.773 us	3.33%	8.464 us	3.29%	-0.309 us	-3.52%	🟢 FAST
I8	I32	2^24	39.725 us	0.74%	39.222 us	0.42%	-0.502 us	-1.26%	🟢 FAST
I8	I32	2^28	452.705 us	3.39%	452.478 us	3.39%	-0.227 us	-0.05%	🔵 SAME
I8	I64	2^16	5.825 us	1.63%	6.076 us	4.36%	0.251 us	4.30%	🔴 SLOW
I8	I64	2^20	8.547 us	3.22%	8.245 us	1.35%	-0.302 us	-3.53%	🟢 FAST
I8	I64	2^24	39.567 us	0.76%	39.019 us	0.82%	-0.548 us	-1.39%	🟢 FAST
I8	I64	2^28	454.296 us	3.35%	457.069 us	3.51%	2.772 us	0.61%	🔵 SAME
I16	I32	2^16	6.109 us	1.00%	6.319 us	4.23%	0.210 us	3.44%	🔴 SLOW
I16	I32	2^20	10.274 us	1.19%	10.686 us	2.50%	0.412 us	4.01%	🔴 SLOW
I16	I32	2^24	67.186 us	0.43%	67.283 us	0.41%	0.097 us	0.14%	🔵 SAME
I16	I32	2^28	877.823 us	2.47%	878.647 us	2.45%	0.824 us	0.09%	🔵 SAME
I16	I64	2^16	6.241 us	3.21%	6.456 us	4.51%	0.215 us	3.44%	🔴 SLOW
I16	I64	2^20	10.270 us	1.17%	10.706 us	2.48%	0.436 us	4.24%	🔴 SLOW
I16	I64	2^24	67.225 us	0.42%	67.119 us	0.41%	-0.106 us	-0.16%	🔵 SAME
I16	I64	2^28	877.483 us	2.36%	874.667 us	2.47%	-2.816 us	-0.32%	🔵 SAME
I32	I32	2^16	5.929 us	3.68%	6.264 us	4.57%	0.335 us	5.65%	🔴 SLOW
I32	I32	2^20	14.811 us	1.72%	14.903 us	1.88%	0.092 us	0.62%	🔵 SAME
I32	I32	2^24	119.517 us	0.24%	120.028 us	0.24%	0.512 us	0.43%	🔴 SLOW
I32	I32	2^28	1.741 ms	1.67%	1.728 ms	1.67%	-13.048 us	-0.75%	🔵 SAME
I32	I64	2^16	6.156 us	4.72%	6.371 us	4.64%	0.215 us	3.49%	🔵 SAME
I32	I64	2^20	14.744 us	1.89%	14.891 us	1.85%	0.147 us	1.00%	🔵 SAME
I32	I64	2^24	119.648 us	0.24%	120.237 us	0.24%	0.589 us	0.49%	🔴 SLOW
I32	I64	2^28	1.740 ms	1.58%	1.727 ms	1.52%	-13.147 us	-0.76%	🔵 SAME
I64	I32	2^16	6.913 us	4.20%	7.250 us	4.32%	0.337 us	4.87%	🔴 SLOW
I64	I32	2^20	23.589 us	1.26%	23.737 us	1.25%	0.148 us	0.63%	🔵 SAME
I64	I32	2^24	224.202 us	0.07%	224.576 us	0.13%	0.374 us	0.17%	🔴 SLOW
I64	I32	2^28	3.381 ms	1.39%	3.380 ms	1.23%	-1.249 us	-0.04%	🔵 SAME
I64	I64	2^16	6.894 us	4.03%	7.262 us	4.04%	0.368 us	5.34%	🔴 SLOW
I64	I64	2^20	23.668 us	1.41%	23.933 us	1.35%	0.265 us	1.12%	🔵 SAME
I64	I64	2^24	224.323 us	0.07%	224.810 us	0.14%	0.487 us	0.22%	🔴 SLOW
I64	I64	2^28	3.380 ms	1.29%	3.381 ms	1.28%	0.657 us	0.02%	🔵 SAME
I128	I32	2^16	9.324 us	3.10%	9.900 us	3.05%	0.576 us	6.18%	🔴 SLOW
I128	I32	2^20	45.408 us	1.61%	45.897 us	1.67%	0.490 us	1.08%	🔵 SAME
I128	I32	2^24	485.106 us	0.10%	485.032 us	0.11%	-0.073 us	-0.02%	🔵 SAME
I128	I32	2^28	7.370 ms	0.53%	7.379 ms	0.55%	9.664 us	0.13%	🔵 SAME
I128	I64	2^16	9.057 us	3.30%	9.905 us	2.94%	0.849 us	9.37%	🔴 SLOW
I128	I64	2^20	45.708 us	0.86%	47.293 us	0.91%	1.585 us	3.47%	🔴 SLOW
I128	I64	2^24	477.357 us	0.12%	484.804 us	0.12%	7.447 us	1.56%	🔴 SLOW
I128	I64	2^28	7.264 ms	0.54%	7.379 ms	0.55%	115.104 us	1.58%	🔴 SLOW
F32	I32	2^16	6.192 us	4.45%	6.626 us	4.16%	0.434 us	7.01%	🔴 SLOW
F32	I32	2^20	14.638 us	2.21%	14.973 us	1.92%	0.335 us	2.29%	🔴 SLOW
F32	I32	2^24	121.718 us	2.43%	121.970 us	2.44%	0.252 us	0.21%	🔵 SAME
F32	I32	2^28	1.740 ms	1.62%	1.741 ms	1.59%	0.217 us	0.01%	🔵 SAME
F32	I64	2^16	6.085 us	4.04%	6.625 us	4.51%	0.541 us	8.89%	🔴 SLOW
F32	I64	2^20	14.932 us	1.90%	15.288 us	1.93%	0.356 us	2.38%	🔴 SLOW
F32	I64	2^24	119.788 us	0.11%	120.366 us	0.23%	0.578 us	0.48%	🔴 SLOW
F32	I64	2^28	1.740 ms	1.64%	1.741 ms	1.55%	0.480 us	0.03%	🔵 SAME
F64	I32	2^16	8.012 us	2.96%	8.324 us	3.70%	0.311 us	3.89%	🔴 SLOW
F64	I32	2^20	24.851 us	1.27%	24.846 us	1.06%	-0.005 us	-0.02%	🔵 SAME
F64	I32	2^24	227.362 us	0.29%	227.741 us	0.30%	0.379 us	0.17%	🔵 SAME
F64	I32	2^28	3.384 ms	1.19%	3.384 ms	1.19%	0.661 us	0.02%	🔵 SAME
F64	I64	2^16	7.965 us	3.48%	8.399 us	3.79%	0.435 us	5.46%	🔴 SLOW
F64	I64	2^20	24.876 us	1.22%	24.879 us	1.11%	0.003 us	0.01%	🔵 SAME
F64	I64	2^24	227.410 us	0.28%	227.751 us	0.31%	0.341 us	0.15%	🔵 SAME
F64	I64	2^28	3.384 ms	1.21%	3.384 ms	1.13%	0.053 us	0.00%	🔵 SAME

Summary: Total Matches: 56 — Pass: 28 / Failure: 28

rtxa6000 — cub.bench.reduce.sum.base

[0] NVIDIA RTX A6000

T{ct}	OffsetT{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I32	2^16	5.876 us	4.47%	5.950 us	4.51%	0.074 us	1.25%	🔵 SAME
I8	I32	2^20	8.329 us	3.37%	8.336 us	3.38%	0.007 us	0.08%	🔵 SAME
I8	I32	2^24	38.631 us	0.77%	38.648 us	0.75%	0.017 us	0.04%	🔵 SAME
I8	I32	2^28	449.642 us	3.18%	449.834 us	3.23%	0.192 us	0.04%	🔵 SAME
I8	I64	2^16	5.745 us	1.33%	5.807 us	1.53%	0.062 us	1.08%	🔵 SAME
I8	I64	2^20	8.365 us	3.26%	8.359 us	3.15%	-0.006 us	-0.07%	🔵 SAME
I8	I64	2^24	38.645 us	0.77%	39.173 us	0.83%	0.529 us	1.37%	🔴 SLOW
I8	I64	2^28	448.990 us	3.13%	450.452 us	3.14%	1.462 us	0.33%	🔵 SAME
I16	I32	2^16	5.889 us	1.42%	5.955 us	2.22%	0.066 us	1.12%	🔵 SAME
I16	I32	2^20	10.251 us	1.25%	10.309 us	1.30%	0.058 us	0.57%	🔵 SAME
I16	I32	2^24	66.399 us	0.43%	66.490 us	0.44%	0.090 us	0.14%	🔵 SAME
I16	I32	2^28	870.659 us	2.41%	872.107 us	2.55%	1.448 us	0.17%	🔵 SAME
I16	I64	2^16	5.886 us	2.92%	6.114 us	3.20%	0.228 us	3.88%	🔴 SLOW
I16	I64	2^20	10.208 us	1.16%	10.270 us	1.29%	0.062 us	0.61%	🔵 SAME
I16	I64	2^24	66.243 us	0.42%	66.344 us	0.45%	0.101 us	0.15%	🔵 SAME
I16	I64	2^28	869.308 us	2.41%	870.345 us	2.51%	1.037 us	0.12%	🔵 SAME
I32	I32	2^16	5.970 us	4.51%	6.211 us	4.01%	0.240 us	4.02%	🔴 SLOW
I32	I32	2^20	14.947 us	1.78%	14.981 us	1.87%	0.034 us	0.23%	🔵 SAME
I32	I32	2^24	119.569 us	0.24%	119.462 us	0.15%	-0.107 us	-0.09%	🔵 SAME
I32	I32	2^28	1.740 ms	1.63%	1.740 ms	1.60%	-0.307 us	-0.02%	🔵 SAME
I32	I64	2^16	6.236 us	4.93%	6.310 us	4.63%	0.074 us	1.19%	🔵 SAME
I32	I64	2^20	14.908 us	1.85%	14.916 us	1.94%	0.008 us	0.05%	🔵 SAME
I32	I64	2^24	119.607 us	0.25%	119.553 us	0.13%	-0.055 us	-0.05%	🔵 SAME
I32	I64	2^28	1.740 ms	1.64%	1.741 ms	1.62%	1.034 us	0.06%	🔵 SAME
I64	I32	2^16	6.868 us	4.08%	7.189 us	3.73%	0.321 us	4.67%	🔴 SLOW
I64	I32	2^20	23.427 us	1.23%	23.785 us	1.24%	0.359 us	1.53%	🔴 SLOW
I64	I32	2^24	224.040 us	0.07%	224.369 us	0.08%	0.330 us	0.15%	🔴 SLOW
I64	I32	2^28	3.381 ms	1.31%	3.380 ms	1.29%	-0.244 us	-0.01%	🔵 SAME
I64	I64	2^16	6.829 us	4.11%	7.191 us	3.99%	0.362 us	5.30%	🔴 SLOW
I64	I64	2^20	23.590 us	1.42%	23.996 us	1.33%	0.405 us	1.72%	🔴 SLOW
I64	I64	2^24	224.270 us	0.07%	224.592 us	0.08%	0.321 us	0.14%	🔴 SLOW
I64	I64	2^28	3.380 ms	1.28%	3.380 ms	1.24%	-0.026 us	-0.00%	🔵 SAME
I128	I32	2^16	8.735 us	3.44%	9.323 us	3.09%	0.587 us	6.72%	🔴 SLOW
I128	I32	2^20	44.029 us	1.52%	44.583 us	1.64%	0.554 us	1.26%	🔵 SAME
I128	I32	2^24	477.328 us	0.11%	477.997 us	0.11%	0.668 us	0.14%	🔴 SLOW
I128	I32	2^28	7.264 ms	0.55%	7.269 ms	0.56%	5.533 us	0.08%	🔵 SAME
I128	I64	2^16	8.739 us	3.36%	9.230 us	3.36%	0.490 us	5.61%	🔴 SLOW
I128	I64	2^20	45.594 us	0.82%	45.775 us	1.00%	0.181 us	0.40%	🔵 SAME
I128	I64	2^24	477.836 us	0.74%	478.603 us	0.63%	0.767 us	0.16%	🔵 SAME
I128	I64	2^28	7.263 ms	0.54%	7.268 ms	0.56%	4.916 us	0.07%	🔵 SAME
F32	I32	2^16	6.235 us	5.00%	6.506 us	4.57%	0.271 us	4.34%	🔵 SAME
F32	I32	2^20	14.930 us	1.79%	15.291 us	1.93%	0.362 us	2.42%	🔴 SLOW
F32	I32	2^24	122.061 us	2.49%	122.098 us	2.39%	0.036 us	0.03%	🔵 SAME
F32	I32	2^28	1.741 ms	1.65%	1.740 ms	1.61%	-0.754 us	-0.04%	🔵 SAME
F32	I64	2^16	6.389 us	5.17%	6.558 us	4.64%	0.169 us	2.65%	🔵 SAME
F32	I64	2^20	15.105 us	1.78%	15.388 us	1.98%	0.283 us	1.87%	🔴 SLOW
F32	I64	2^24	119.939 us	0.13%	120.246 us	0.13%	0.308 us	0.26%	🔴 SLOW
F32	I64	2^28	1.740 ms	1.65%	1.740 ms	1.62%	-0.060 us	-0.00%	🔵 SAME
F64	I32	2^16	8.118 us	3.46%	8.126 us	3.56%	0.008 us	0.10%	🔵 SAME
F64	I32	2^20	24.565 us	1.11%	24.792 us	1.19%	0.227 us	0.92%	🔵 SAME
F64	I32	2^24	227.150 us	0.31%	227.366 us	0.28%	0.216 us	0.09%	🔵 SAME
F64	I32	2^28	3.382 ms	1.13%	3.382 ms	1.10%	0.331 us	0.01%	🔵 SAME
F64	I64	2^16	7.981 us	3.60%	8.149 us	3.45%	0.167 us	2.10%	🔵 SAME
F64	I64	2^20	24.562 us	1.14%	24.865 us	1.22%	0.303 us	1.24%	🔴 SLOW
F64	I64	2^24	227.165 us	0.31%	227.344 us	0.28%	0.179 us	0.08%	🔵 SAME
F64	I64	2^28	3.382 ms	1.11%	3.383 ms	1.16%	1.073 us	0.03%	🔵 SAME
C32	I32	2^16	6.935 us	4.19%	7.185 us	4.28%	0.250 us	3.61%	🔵 SAME
C32	I32	2^20	23.635 us	1.29%	23.941 us	1.28%	0.306 us	1.29%	🔴 SLOW
C32	I32	2^24	224.485 us	0.09%	224.780 us	0.09%	0.296 us	0.13%	🔴 SLOW
C32	I32	2^28	3.385 ms	1.25%	3.385 ms	1.26%	0.629 us	0.02%	🔵 SAME
C32	I64	2^16	6.881 us	4.42%	7.291 us	5.41%	0.410 us	5.95%	🔴 SLOW
C32	I64	2^20	23.273 us	0.51%	23.866 us	1.33%	0.593 us	2.55%	🔴 SLOW
C32	I64	2^24	224.375 us	0.09%	224.649 us	0.10%	0.274 us	0.12%	🔴 SLOW
C32	I64	2^28	3.385 ms	1.26%	3.385 ms	1.23%	0.134 us	0.00%	🔵 SAME

Summary: Total Matches: 64 — Pass: 43 / Failure: 21

h100 — cub.bench.reduce.sum.base

[0] NVIDIA H100

T{ct}	OffsetT{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I64	2^20	8.739 us	5.05%	8.476 us	4.89%	-0.264 us	-3.02%	🔵 SAME
I8	I64	2^24	21.068 us	42.13%	21.222 us	2.23%	0.154 us	0.73%	🔵 SAME
I8	I64	2^28	161.963 us	7.21%	163.600 us	4.34%	1.637 us	1.01%	🔵 SAME
I16	I32	2^16	7.938 us	7.45%	7.763 us	5.96%	-0.175 us	-2.20%	🔵 SAME
I16	I32	2^20	9.850 us	8.05%	9.428 us	4.54%	-0.422 us	-4.28%	🔵 SAME
I16	I32	2^24	32.457 us	25.77%	31.726 us	1.63%	-0.731 us	-2.25%	🟢 FAST
I16	I32	2^28	293.976 us	4.24%	293.674 us	3.98%	-0.301 us	-0.10%	🔵 SAME
I16	I64	2^16	8.083 us	6.78%	7.739 us	6.00%	-0.344 us	-4.26%	🔵 SAME
I16	I64	2^20	10.175 us	76.66%	9.402 us	4.41%	-0.773 us	-7.60%	🟢 FAST
I16	I64	2^24	32.163 us	1.80%	31.786 us	1.63%	-0.377 us	-1.17%	🔵 SAME
I16	I64	2^28	294.298 us	4.21%	294.192 us	4.28%	-0.106 us	-0.04%	🔵 SAME
I32	I32	2^16	8.181 us	7.22%	7.884 us	6.05%	-0.297 us	-3.63%	🔵 SAME
I32	I32	2^20	11.596 us	4.39%	11.157 us	3.77%	-0.439 us	-3.79%	🟢 FAST
I32	I32	2^24	55.043 us	1.06%	54.673 us	0.93%	-0.370 us	-0.67%	🔵 SAME
I32	I32	2^28	565.576 us	2.90%	565.113 us	2.82%	-0.463 us	-0.08%	🔵 SAME
I32	I64	2^16	8.084 us	6.82%	7.729 us	6.36%	-0.355 us	-4.39%	🔵 SAME
I32	I64	2^20	12.060 us	73.83%	11.165 us	3.61%	-0.895 us	-7.42%	🟢 FAST
I32	I64	2^24	55.545 us	15.00%	54.731 us	0.95%	-0.814 us	-1.47%	🟢 FAST
I32	I64	2^28	565.279 us	2.73%	565.306 us	3.00%	0.028 us	0.00%	🔵 SAME
I64	I32	2^16	8.431 us	6.53%	8.358 us	5.91%	-0.073 us	-0.86%	🔵 SAME
I64	I32	2^20	14.090 us	3.88%	14.787 us	4.04%	0.696 us	4.94%	🔴 SLOW
I64	I32	2^24	91.217 us	0.54%	91.856 us	0.58%	0.638 us	0.70%	🔴 SLOW
I64	I32	2^28	1.105 ms	1.54%	1.106 ms	1.67%	1.200 us	0.11%	🔵 SAME
I64	I64	2^16	8.225 us	6.34%	8.747 us	6.12%	0.522 us	6.35%	🔴 SLOW
I64	I64	2^20	14.722 us	59.99%	14.868 us	4.07%	0.147 us	1.00%	🔵 SAME
I64	I64	2^24	91.654 us	6.13%	92.419 us	8.04%	0.765 us	0.83%	🔵 SAME
I64	I64	2^28	1.106 ms	1.63%	1.106 ms	1.63%	0.819 us	0.07%	🔵 SAME
I128	I32	2^16	9.449 us	5.28%	10.697 us	88.03%	1.247 us	13.20%	🔴 SLOW
I128	I32	2^20	22.965 us	2.24%	24.167 us	2.25%	1.201 us	5.23%	🔴 SLOW
I128	I32	2^24	172.162 us	1.18%	173.467 us	4.79%	1.305 us	0.76%	🔵 SAME
I128	I32	2^28	2.262 ms	0.76%	2.264 ms	0.66%	1.960 us	0.09%	🔵 SAME
I128	I64	2^16	9.545 us	6.67%	10.363 us	5.76%	0.818 us	8.57%	🔴 SLOW
I128	I64	2^20	22.477 us	2.73%	24.252 us	2.66%	1.775 us	7.90%	🔴 SLOW
I128	I64	2^24	170.802 us	0.46%	173.582 us	4.79%	2.780 us	1.63%	🔴 SLOW
I128	I64	2^28	2.252 ms	0.79%	2.265 ms	0.71%	12.514 us	0.56%	🔵 SAME
F32	I32	2^16	7.861 us	6.10%	8.414 us	6.42%	0.553 us	7.03%	🔴 SLOW
F32	I32	2^20	11.435 us	4.73%	11.911 us	4.87%	0.476 us	4.16%	🔵 SAME
F32	I32	2^24	55.044 us	1.00%	55.574 us	0.99%	0.530 us	0.96%	🔵 SAME
F32	I32	2^28	564.879 us	2.36%	565.664 us	2.46%	0.784 us	0.14%	🔵 SAME
F32	I64	2^16	7.928 us	6.25%	8.423 us	5.81%	0.495 us	6.24%	🔴 SLOW
F32	I64	2^20	11.575 us	4.65%	11.961 us	4.92%	0.386 us	3.33%	🔵 SAME
F32	I64	2^24	55.634 us	14.31%	55.761 us	0.96%	0.127 us	0.23%	🔵 SAME
F32	I64	2^28	565.203 us	2.41%	565.898 us	2.51%	0.695 us	0.12%	🔵 SAME
F64	I32	2^16	8.284 us	6.00%	8.948 us	5.75%	0.664 us	8.01%	🔴 SLOW
F64	I32	2^20	14.381 us	3.67%	15.284 us	52.02%	0.904 us	6.28%	🔴 SLOW
F64	I32	2^24	91.342 us	0.57%	92.236 us	6.37%	0.893 us	0.98%	🔴 SLOW
F64	I32	2^28	1.106 ms	1.75%	1.106 ms	1.45%	0.345 us	0.03%	🔵 SAME
F64	I64	2^16	8.312 us	6.01%	8.975 us	5.79%	0.663 us	7.97%	🔴 SLOW
F64	I64	2^20	14.360 us	3.90%	14.904 us	3.86%	0.544 us	3.79%	🔵 SAME
F64	I64	2^24	91.844 us	6.53%	92.333 us	7.93%	0.489 us	0.53%	🔵 SAME
F64	I64	2^28	1.105 ms	1.55%	1.106 ms	1.55%	0.772 us	0.07%	🔵 SAME
C32	I32	2^16	8.265 us	5.99%	8.934 us	19.28%	0.668 us	8.09%	🔴 SLOW
C32	I32	2^20	14.543 us	3.51%	14.829 us	56.70%	0.287 us	1.97%	🔵 SAME
C32	I32	2^24	92.780 us	9.84%	92.890 us	11.84%	0.110 us	0.12%	🔵 SAME
C32	I32	2^28	1.111 ms	1.61%	1.111 ms	1.73%	-0.152 us	-0.01%	🔵 SAME
C32	I64	2^16	8.719 us	6.60%	8.390 us	5.76%	-0.329 us	-3.77%	🔵 SAME
C32	I64	2^20	14.666 us	3.41%	14.528 us	2.84%	-0.138 us	-0.94%	🔵 SAME
C32	I64	2^24	92.549 us	0.70%	92.593 us	6.68%	0.045 us	0.05%	🔵 SAME
C32	I64	2^28	1.109 ms	1.63%	1.112 ms	1.73%	2.142 us	0.19%	🔵 SAME

Summary: Total Matches: 64 — Pass: 44 / Failure: 20

Details

h100 — cub.bench.reduce.min.base

[0] NVIDIA H100 PCIe

T{ct}	OffsetT{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I32	2^16	7.673 us	5.54%	7.957 us	5.95%	0.284 us	3.70%	🔵 SAME
I8	I32	2^20	8.761 us	4.68%	9.050 us	5.50%	0.289 us	3.29%	🔵 SAME
I8	I32	2^24	21.094 us	4.29%	21.367 us	2.73%	0.273 us	1.29%	🔵 SAME
I8	I32	2^28	160.741 us	6.56%	160.666 us	4.63%	-0.075 us	-0.05%	🔵 SAME
I8	I64	2^16	8.053 us	5.79%	8.324 us	6.21%	0.271 us	3.37%	🔵 SAME
I8	I64	2^20	8.836 us	4.86%	9.111 us	5.18%	0.275 us	3.11%	🔵 SAME
I8	I64	2^24	21.093 us	2.38%	21.411 us	2.54%	0.318 us	1.51%	🔵 SAME
I8	I64	2^28	161.292 us	8.75%	162.265 us	4.46%	0.972 us	0.60%	🔵 SAME
I16	I32	2^16	8.036 us	6.36%	8.355 us	6.63%	0.319 us	3.97%	🔵 SAME
I16	I32	2^20	9.670 us	4.21%	10.064 us	5.19%	0.394 us	4.07%	🔵 SAME
I16	I32	2^24	32.701 us	1.54%	33.055 us	1.78%	0.354 us	1.08%	🔵 SAME
I16	I32	2^28	294.188 us	3.90%	294.416 us	4.69%	0.228 us	0.08%	🔵 SAME
I16	I64	2^16	8.170 us	5.84%	8.525 us	6.56%	0.354 us	4.34%	🔵 SAME
I16	I64	2^20	9.730 us	3.97%	10.055 us	5.08%	0.326 us	3.35%	🔵 SAME
I16	I64	2^24	32.622 us	1.56%	33.548 us	26.67%	0.926 us	2.84%	🔴 SLOW
I16	I64	2^28	293.772 us	3.44%	294.375 us	4.08%	0.603 us	0.21%	🔵 SAME
I32	I32	2^16	7.899 us	6.04%	8.272 us	6.45%	0.373 us	4.72%	🔵 SAME
I32	I32	2^20	11.173 us	3.89%	11.654 us	4.13%	0.481 us	4.31%	🔴 SLOW
I32	I32	2^24	55.840 us	16.38%	56.054 us	1.08%	0.213 us	0.38%	🔵 SAME
I32	I32	2^28	565.451 us	2.51%	566.066 us	2.61%	0.615 us	0.11%	🔵 SAME
I32	I64	2^16	7.889 us	6.14%	8.372 us	6.58%	0.484 us	6.13%	🔵 SAME
I32	I64	2^20	11.235 us	4.48%	12.055 us	74.60%	0.820 us	7.30%	🔴 SLOW
I32	I64	2^24	55.539 us	0.82%	55.952 us	1.10%	0.413 us	0.74%	🔵 SAME
I32	I64	2^28	565.411 us	2.28%	566.152 us	2.57%	0.741 us	0.13%	🔵 SAME
I64	I32	2^16	8.303 us	5.75%	8.894 us	6.34%	0.591 us	7.11%	🔴 SLOW
I64	I32	2^20	14.358 us	2.96%	14.493 us	3.29%	0.136 us	0.94%	🔵 SAME
I64	I32	2^24	91.794 us	7.29%	92.191 us	9.30%	0.398 us	0.43%	🔵 SAME
I64	I32	2^28	1.105 ms	1.43%	1.105 ms	1.62%	0.249 us	0.02%	🔵 SAME
I64	I64	2^16	8.312 us	5.78%	8.614 us	5.98%	0.301 us	3.63%	🔵 SAME
I64	I64	2^20	14.441 us	10.97%	14.504 us	3.22%	0.064 us	0.44%	🔵 SAME
I64	I64	2^24	92.056 us	10.12%	92.195 us	6.91%	0.138 us	0.15%	🔵 SAME
I64	I64	2^28	1.106 ms	1.55%	1.105 ms	1.51%	-0.302 us	-0.03%	🔵 SAME
I128	I32	2^16	9.894 us	4.51%	10.278 us	4.99%	0.384 us	3.88%	🔵 SAME
I128	I32	2^20	23.362 us	1.83%	24.036 us	5.00%	0.674 us	2.89%	🔴 SLOW
I128	I32	2^24	173.052 us	3.55%	173.327 us	1.19%	0.275 us	0.16%	🔵 SAME
I128	I32	2^28	2.263 ms	0.70%	2.264 ms	0.65%	1.579 us	0.07%	🔵 SAME
I128	I64	2^16	10.151 us	4.95%	10.350 us	5.66%	0.199 us	1.96%	🔵 SAME
I128	I64	2^20	23.356 us	1.95%	24.002 us	2.16%	0.646 us	2.77%	🔴 SLOW
I128	I64	2^24	173.181 us	4.12%	173.343 us	0.43%	0.162 us	0.09%	🔵 SAME
I128	I64	2^28	2.263 ms	0.71%	2.265 ms	0.73%	2.060 us	0.09%	🔵 SAME
F32	I32	2^16	7.887 us	6.01%	8.640 us	126.34%	0.753 us	9.55%	🔴 SLOW
F32	I32	2^20	11.519 us	4.35%	11.599 us	4.62%	0.080 us	0.69%	🔵 SAME
F32	I32	2^24	55.401 us	14.79%	55.256 us	0.93%	-0.145 us	-0.26%	🔵 SAME
F32	I32	2^28	565.799 us	3.11%	564.983 us	2.34%	-0.816 us	-0.14%	🔵 SAME
F32	I64	2^16	7.948 us	5.84%	8.147 us	5.81%	0.199 us	2.50%	🔵 SAME
F32	I64	2^20	11.540 us	4.35%	11.712 us	14.32%	0.172 us	1.49%	🔵 SAME
F32	I64	2^24	55.490 us	14.90%	55.666 us	12.94%	0.176 us	0.32%	🔵 SAME
F32	I64	2^28	564.998 us	2.31%	565.210 us	2.40%	0.212 us	0.04%	🔵 SAME
F64	I32	2^16	8.649 us	5.03%	8.829 us	6.18%	0.180 us	2.08%	🔵 SAME
F64	I32	2^20	14.651 us	3.39%	14.696 us	3.64%	0.045 us	0.31%	🔵 SAME
F64	I32	2^24	92.702 us	8.63%	92.654 us	7.04%	-0.048 us	-0.05%	🔵 SAME
F64	I32	2^28	1.106 ms	1.52%	1.107 ms	1.80%	0.195 us	0.02%	🔵 SAME
F64	I64	2^16	8.742 us	22.41%	8.766 us	5.53%	0.024 us	0.27%	🔵 SAME
F64	I64	2^20	14.709 us	3.53%	14.765 us	3.75%	0.056 us	0.38%	🔵 SAME
F64	I64	2^24	92.838 us	11.14%	92.346 us	0.57%	-0.491 us	-0.53%	🔵 SAME
F64	I64	2^28	1.107 ms	1.58%	1.107 ms	1.66%	0.257 us	0.02%	🔵 SAME

Summary: Total Matches: 56 — Pass: 49 / Failure: 7

rtxpro6000 — cub.bench.reduce.sum.base

[0] NVIDIA RTX Pro 6000

T{ct}	OffsetT{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I32	2^16	7.164 us	1.67%	7.417 us	9.00%	0.252 us	3.52%	🔴 SLOW
I8	I32	2^20	9.216 us	0.00%	9.209 us	0.16%	-0.007 us	-0.07%	🟢 FAST
I8	I32	2^24	27.542 us	1.66%	27.648 us	0.00%	0.106 us	0.38%	🔴 SLOW
I8	I32	2^28	243.102 us	0.34%	243.887 us	0.43%	0.785 us	0.32%	🔵 SAME
I8	I64	2^16	7.169 us	0.10%	7.168 us	0.00%	-0.001 us	-0.01%	🟡 ????
I8	I64	2^20	9.216 us	0.00%	9.214 us	0.12%	-0.002 us	-0.02%	🟢 FAST
I8	I64	2^24	27.648 us	0.00%	27.648 us	0.00%	0.000 us	0.00%	🔵 SAME
I8	I64	2^28	242.726 us	0.12%	242.688 us	0.00%	-0.038 us	-0.02%	🟢 FAST
I16	I32	2^16	7.169 us	0.07%	7.168 us	0.00%	-0.001 us	-0.01%	🟡 ????
I16	I32	2^20	7.532 us	10.38%	7.573 us	10.75%	0.041 us	0.55%	🔵 SAME
I16	I32	2^24	43.470 us	2.08%	43.623 us	1.95%	0.153 us	0.35%	🔵 SAME
I16	I32	2^28	416.768 us	0.00%	416.849 us	0.10%	0.081 us	0.02%	🔴 SLOW
I16	I64	2^16	7.168 us	0.00%	7.167 us	0.10%	-0.001 us	-0.02%	🟡 ????
I16	I64	2^20	10.058 us	10.09%	10.037 us	10.01%	-0.021 us	-0.20%	🔵 SAME
I16	I64	2^24	41.989 us	0.71%	42.049 us	0.96%	0.060 us	0.14%	🔵 SAME
I16	I64	2^28	416.855 us	0.10%	416.768 us	0.00%	-0.087 us	-0.02%	🟢 FAST
I32	I32	2^16	7.171 us	0.65%	7.168 us	0.09%	-0.002 us	-0.03%	🔵 SAME
I32	I32	2^20	11.232 us	0.00%	11.250 us	0.55%	0.018 us	0.16%	🔴 SLOW
I32	I32	2^24	71.678 us	1.44%	71.789 us	1.43%	0.111 us	0.15%	🔵 SAME
I32	I32	2^28	765.366 us	0.11%	765.689 us	0.14%	0.323 us	0.04%	🔵 SAME
I32	I64	2^16	7.166 us	0.41%	7.195 us	3.22%	0.029 us	0.40%	🔵 SAME
I32	I64	2^20	13.280 us	0.00%	13.675 us	5.92%	0.395 us	2.98%	🟡 ????
I32	I64	2^24	72.567 us	0.72%	72.704 us	0.41%	0.138 us	0.19%	🔵 SAME
I32	I64	2^28	765.021 us	0.06%	765.032 us	0.06%	0.011 us	0.00%	🔵 SAME
I64	I32	2^16	8.329 us	12.18%	8.880 us	8.50%	0.551 us	6.62%	🔵 SAME
I64	I32	2^20	19.403 us	1.09%	19.424 us	0.04%	0.022 us	0.11%	🔴 SLOW
I64	I32	2^24	132.381 us	0.54%	132.541 us	0.64%	0.159 us	0.12%	🔵 SAME
I64	I32	2^28	1.462 ms	0.07%	1.462 ms	0.08%	0.040 us	0.00%	🔵 SAME
I64	I64	2^16	8.764 us	9.75%	9.216 us	0.00%	0.452 us	5.15%	🔴 SLOW
I64	I64	2^20	19.424 us	0.00%	19.425 us	0.03%	0.001 us	0.00%	🔴 SLOW
I64	I64	2^24	132.507 us	0.62%	132.959 us	0.77%	0.452 us	0.34%	🔵 SAME
I64	I64	2^28	1.462 ms	0.07%	1.461 ms	0.00%	-0.494 us	-0.03%	🟡 ????
I128	I32	2^16	9.209 us	0.16%	9.282 us	3.86%	0.073 us	0.79%	🔴 SLOW
I128	I32	2^20	28.267 us	3.31%	28.614 us	3.54%	0.348 us	1.23%	🔵 SAME
I128	I32	2^24	244.688 us	0.20%	244.766 us	0.15%	0.078 us	0.03%	🔵 SAME
I128	I32	2^28	2.900 ms	0.11%	2.901 ms	0.11%	0.684 us	0.02%	🔵 SAME
I128	I64	2^16	9.274 us	3.76%	9.502 us	7.53%	0.228 us	2.46%	🔵 SAME
I128	I64	2^20	26.877 us	3.75%	27.426 us	2.73%	0.549 us	2.04%	🔵 SAME
I128	I64	2^24	244.737 us	0.00%	245.792 us	0.10%	1.055 us	0.43%	🔴 SLOW
I128	I64	2^28	2.901 ms	0.11%	2.901 ms	0.11%	-0.020 us	-0.00%	🔵 SAME
F32	I32	2^16	7.349 us	8.02%	7.384 us	8.63%	0.035 us	0.47%	🔵 SAME
F32	I32	2^20	9.210 us	1.28%	9.216 us	0.00%	0.006 us	0.06%	🔴 SLOW
F32	I32	2^24	71.870 us	1.43%	72.086 us	1.34%	0.216 us	0.30%	🔵 SAME
F32	I32	2^28	764.665 us	0.12%	764.805 us	0.10%	0.140 us	0.02%	🔵 SAME
F32	I64	2^16	7.168 us	0.00%	7.168 us	0.00%	0.000 us	0.00%	🟡 ????
F32	I64	2^20	9.197 us	0.65%	9.216 us	0.00%	0.019 us	0.21%	🔴 SLOW
F32	I64	2^24	71.964 us	1.38%	72.170 us	1.25%	0.206 us	0.29%	🔵 SAME
F32	I64	2^28	765.109 us	0.08%	765.139 us	0.08%	0.030 us	0.00%	🔵 SAME
F64	I32	2^16	11.264 us	0.05%	9.317 us	4.73%	-1.947 us	-17.29%	🟢 FAST
F64	I32	2^20	17.404 us	1.40%	16.109 us	6.23%	-1.295 us	-7.44%	🟢 FAST
F64	I32	2^24	135.731 us	0.66%	134.560 us	0.63%	-1.171 us	-0.86%	🟢 FAST
F64	I32	2^28	1.468 ms	0.13%	1.467 ms	0.12%	-0.986 us	-0.07%	🔵 SAME
F64	I64	2^16	9.216 us	0.00%	9.213 us	0.10%	-0.003 us	-0.03%	🟢 FAST
F64	I64	2^20	17.312 us	2.16%	16.344 us	6.32%	-0.968 us	-5.59%	🟢 FAST
F64	I64	2^24	135.920 us	0.57%	134.968 us	0.75%	-0.951 us	-0.70%	🟢 FAST
F64	I64	2^28	1.466 ms	0.11%	1.466 ms	0.10%	-0.839 us	-0.06%	🔵 SAME
C32	I32	2^16	7.168 us	0.00%	7.168 us	0.00%	0.000 us	0.00%	🟡 ????
C32	I32	2^20	19.424 us	0.00%	19.456 us	0.00%	0.032 us	0.16%	🟡 ????
C32	I32	2^24	132.304 us	0.48%	132.298 us	0.46%	-0.005 us	-0.00%	🔵 SAME
C32	I32	2^28	1.462 ms	0.09%	1.462 ms	0.06%	-0.089 us	-0.01%	🔵 SAME
C32	I64	2^16	8.686 us	10.32%	9.019 us	6.70%	0.333 us	3.83%	🔵 SAME
C32	I64	2^20	17.376 us	0.00%	17.405 us	0.05%	0.029 us	0.17%	🔴 SLOW
C32	I64	2^24	132.290 us	0.46%	132.411 us	0.57%	0.121 us	0.09%	🔵 SAME
C32	I64	2^28	1.462 ms	0.05%	1.462 ms	0.07%	0.198 us	0.01%	🔵 SAME

Summary: Total Matches: 64 — Pass: 34 / Unknown: 8 / Failure: 22

rtxpro6000 — cub.bench.reduce.min.base

[0] NVIDIA RTX PRO 6000 Blackwell Server Edition

T{ct}	OffsetT{ct}	Elements{io}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I32	2^16	9.215 us	0.09%	9.216 us	0.00%	0.001 us	0.01%	🔴 SLOW
I8	I32	2^20	9.215 us	0.08%	9.208 us	0.48%	-0.006 us	-0.07%	🔵 SAME
I8	I32	2^24	27.973 us	2.66%	27.966 us	2.65%	-0.008 us	-0.03%	🔵 SAME
I8	I32	2^28	243.828 us	0.43%	243.800 us	0.42%	-0.028 us	-0.01%	🔵 SAME
I8	I64	2^16	7.233 us	4.96%	7.310 us	7.16%	0.077 us	1.06%	🔵 SAME
I8	I64	2^20	9.216 us	0.00%	9.216 us	0.00%	0.000 us	0.00%	🔵 SAME
I8	I64	2^24	27.740 us	1.55%	27.816 us	2.04%	0.076 us	0.27%	🔵 SAME
I8	I64	2^28	242.720 us	0.12%	242.735 us	0.12%	0.015 us	0.01%	🔵 SAME
I16	I32	2^16	7.183 us	2.42%	7.168 us	0.00%	-0.015 us	-0.21%	🟡 ????
I16	I32	2^20	7.168 us	0.00%	7.681 us	11.53%	0.513 us	7.15%	🟡 ????
I16	I32	2^24	43.763 us	1.55%	43.833 us	1.37%	0.069 us	0.16%	🔵 SAME
I16	I32	2^28	416.825 us	0.08%	416.851 us	0.15%	0.026 us	0.01%	🔵 SAME
I16	I64	2^16	7.168 us	0.00%	7.165 us	0.16%	-0.003 us	-0.04%	🟡 ????
I16	I64	2^20	10.093 us	10.06%	10.346 us	9.85%	0.253 us	2.51%	🔵 SAME
I16	I64	2^24	42.041 us	1.00%	42.111 us	1.24%	0.070 us	0.17%	🔵 SAME
I16	I64	2^28	416.811 us	0.07%	416.768 us	0.00%	-0.043 us	-0.01%	🟢 FAST
I32	I32	2^16	7.227 us	4.86%	7.223 us	4.66%	-0.004 us	-0.06%	🔵 SAME
I32	I32	2^20	11.232 us	0.00%	11.256 us	0.13%	0.024 us	0.21%	🔴 SLOW
I32	I32	2^24	71.695 us	1.44%	71.818 us	1.43%	0.123 us	0.17%	🔵 SAME
I32	I32	2^28	765.879 us	0.14%	765.892 us	0.14%	0.013 us	0.00%	🔵 SAME
I32	I64	2^16	7.168 us	0.00%	7.263 us	6.01%	0.095 us	1.33%	🟡 ????
I32	I64	2^20	13.746 us	6.21%	13.662 us	5.87%	-0.083 us	-0.61%	🔵 SAME
I32	I64	2^24	72.672 us	0.00%	72.673 us	0.44%	0.001 us	0.00%	🟡 ????
I32	I64	2^28	765.394 us	0.12%	765.369 us	0.11%	-0.025 us	-0.00%	🔵 SAME
I64	I32	2^16	7.421 us	9.08%	7.701 us	11.65%	0.280 us	3.77%	🔵 SAME
I64	I32	2^20	19.299 us	2.53%	19.370 us	1.70%	0.071 us	0.37%	🔵 SAME
I64	I32	2^24	134.162 us	0.15%	134.144 us	0.00%	-0.018 us	-0.01%	🟢 FAST
I64	I32	2^28	1.465 ms	0.09%	1.465 ms	0.07%	-0.007 us	-0.00%	🔵 SAME
I64	I64	2^16	7.723 us	11.78%	8.393 us	11.96%	0.671 us	8.68%	🔵 SAME
I64	I64	2^20	19.355 us	1.90%	19.375 us	1.66%	0.020 us	0.10%	🔵 SAME
I64	I64	2^24	134.153 us	0.10%	134.162 us	0.14%	0.009 us	0.01%	🔵 SAME
I64	I64	2^28	1.465 ms	0.09%	1.465 ms	0.07%	0.154 us	0.01%	🔵 SAME
I128	I32	2^16	9.264 us	3.81%	9.384 us	6.03%	0.119 us	1.29%	🔵 SAME
I128	I32	2^20	28.721 us	3.56%	28.736 us	3.53%	0.014 us	0.05%	🔵 SAME
I128	I32	2^24	244.736 us	0.00%	244.816 us	0.16%	0.080 us	0.03%	🔴 SLOW
I128	I32	2^28	2.901 ms	0.11%	2.901 ms	0.11%	-0.085 us	-0.00%	🔵 SAME
I128	I64	2^16	9.343 us	5.23%	9.407 us	6.93%	0.064 us	0.68%	🔵 SAME
I128	I64	2^20	27.628 us	3.40%	27.591 us	2.31%	-0.037 us	-0.13%	🔵 SAME
I128	I64	2^24	245.814 us	0.26%	245.713 us	0.09%	-0.101 us	-0.04%	🔵 SAME
I128	I64	2^28	2.901 ms	0.11%	2.901 ms	0.12%	-0.310 us	-0.01%	🔵 SAME
F32	I32	2^16	7.180 us	2.72%	7.256 us	5.76%	0.076 us	1.06%	🔵 SAME
F32	I32	2^20	9.188 us	0.64%	9.216 us	0.00%	0.028 us	0.31%	🔴 SLOW
F32	I32	2^24	71.986 us	1.35%	72.206 us	1.25%	0.221 us	0.31%	🔵 SAME
F32	I32	2^28	764.928 us	0.00%	765.158 us	0.09%	0.230 us	0.03%	🔴 SLOW
F32	I64	2^16	7.168 us	0.00%	7.168 us	0.00%	0.000 us	0.00%	🟡 ????
F32	I64	2^20	9.200 us	1.31%	9.247 us	2.58%	0.048 us	0.52%	🔵 SAME
F32	I64	2^24	72.045 us	1.35%	72.251 us	1.19%	0.206 us	0.29%	🔵 SAME
F32	I64	2^28	765.104 us	0.08%	765.176 us	0.09%	0.072 us	0.01%	🔵 SAME
F64	I32	2^16	9.216 us	0.00%	9.215 us	0.07%	-0.001 us	-0.01%	🟢 FAST
F64	I32	2^20	15.328 us	0.00%	15.455 us	3.22%	0.127 us	0.83%	🟡 ????
F64	I32	2^24	136.364 us	0.42%	136.192 us	0.00%	-0.172 us	-0.13%	🟡 ????
F64	I32	2^28	1.472 ms	0.14%	1.472 ms	0.14%	-0.426 us	-0.03%	🔵 SAME
F64	I64	2^16	9.216 us	0.00%	9.216 us	0.00%	0.000 us	0.00%	🔵 SAME
F64	I64	2^20	17.193 us	3.40%	16.548 us	6.11%	-0.645 us	-3.75%	🟢 FAST
F64	I64	2^24	137.111 us	0.74%	136.584 us	0.59%	-0.528 us	-0.39%	🔵 SAME
F64	I64	2^28	1.471 ms	0.13%	1.471 ms	0.14%	-0.540 us	-0.04%	🔵 SAME

Summary: Total Matches: 56 — Pass: 39 / Unknown: 8 / Failure: 9

copy-pr-bot · 2026-05-21T20:05:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-21T20:08:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

suggestion:

Walkthrough

Refactors BlockReduce warp-aggregate application so warp 0 deterministically cooperates to reduce per-warp aggregates (reduce_op_sync fast path, shuffle fallback), removes the atomic non-deterministic path, consolidates Sum/Reduce, and adds NVBench benchmark plus CI bench filter/GPU updates.

Changes

Warp-0 Cooperative Reduction

Layer / File(s)	Summary
Dependencies and documentation updates `cub/cub/block/specializations/block_reduce_warp_reductions.cuh`	Header includes reordered to add thread-operator and target utilities; comment updated to state warp 0 cooperatively reduces per-warp aggregates.
ApplyWarpAggregates warp-0 cooperative reduction `cub/cub/block/specializations/block_reduce_warp_reductions.cuh`	ApplyWarpAggregates reimplemented: compute effective num_warps, warp 0 loads each warp's aggregate per lane, then reduces via target-gated `reduce_op_sync` (fast path for eligible operator/type on supported SMs when FullTile) or `WarpReduceShfl` fallback.
Sum and Reduce entry point consolidation `cub/cub/block/specializations/block_reduce_warp_reductions.cuh`	Sum and Reduce now unconditionally invoke `ApplyWarpAggregates<FullTile>`, removing the IsDeterministic compile-time branch and the atomic-backed non-deterministic implementation.

Benchmarks and CI

Layer / File(s)	Summary
NVBench base benchmark `cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh`	Adds device functor `benchmark_op_t<BlockThreads>`, benchmark launcher `block_reduce_warp_reductions`, occupancy-based grid sizing with zero-grid guard, and NVBench registration.
Sum benchmark TU `cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu`	Adds `value_types` list and `op_t = ::cuda::std::plus<>`, includes the shared base implementation.
CI benchmark filters & GPU list `ci/bench.yaml`	Adds two reduce benchmark filter regexes for new benchmarks and enables `rtxa6000` in the CI GPU selection list.

Assessment against linked issues

Objective	Addressed	Explanation
Replace sequential per-warp aggregation with low-latency warp reduction (`#5167`)	✅
Use `__reduce_sync_op` or equivalent on supported SM targets (`#5167`)	✅
Reduce compile-time overhead from recursive `ApplyWarpAggregates` calls (`#5167`)	❓	The IsDeterministic branch and one non-deterministic function were removed, but the change retains templated FullTile/type/operator paths; net compile-time instantiation reduction is unclear from diffs provided.

Out-of-scope changes

Code Change	Explanation
Add NVBench benchmark base (`cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh`, lines 1-54)	Benchmark/test artifact; not required by the BlockReduce optimization objective.
Add benchmark TU for sum (`cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu`, lines 1-35)	Benchmark wiring and type-list instantiation are testing artifacts, not part of core optimization.
Update CI bench filters and GPU list (`ci/bench.yaml`, lines 35-36 and 49-53)	CI configuration updates are operational/test routing changes unrelated to BlockReduce implementation goals.

_{Comment @coderabbitai help to get the list of available commands.}

fbusato · 2026-05-21T21:25:44Z

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach.
I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

charan-003 · 2026-05-21T21:28:07Z

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach. I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

sure, working on it

coderabbitai

🧹 Nitpick comments (2)

cub/cub/block/specializations/block_reduce_warp_reductions.cuh (2)
125-125: 💤 Low value

suggestion: Use static_cast<int>(warps) instead of C-style cast int(warps) for consistency with the static_cast used in the same expression.
-      const int num_warps = FullTile ? int(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
+      const int num_warps = FullTile ? static_cast<int>(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
148-148: 💤 Low value

suggestion: Per coding guidelines, variables not modified should be const. dummy_storage is only passed to the constructor.
-      NullType dummy_storage;
+      const NullType dummy_storage{};

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffc63f63-2430-4ac6-8a11-ae99c579665d

📥 Commits

Reviewing files that changed from the base of the PR and between 0b48748 and cb6dd6c.

📒 Files selected for processing (1)

cub/cub/block/specializations/block_reduce_warp_reductions.cuh

fbusato · 2026-05-21T23:21:08Z

@charan-003 thanks for reporting them but I think a better approach would be:

Add the code for the device-side benchmark.
Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.
Add a commit with the message [bench-only] to run them.
Report only the comparison results, e.g. SLOW, FAST, etc.

charan-003 · 2026-05-22T00:56:02Z

@charan-003 thanks for reporting them but I think a better approach would be:

Add the code for the device-side benchmark.

Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.

Add a commit with the message [bench-only] to run them.

Report only the comparison results, e.g. SLOW, FAST, etc.

Thanks a lot for the guidance! I didn't know the exact process...
Let me add device-side benchmark code

coderabbitai

Actionable comments posted: 3

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 53e3ccfa-d3a5-484a-85fd-7cf18e522152

📥 Commits

Reviewing files that changed from the base of the PR and between cb6dd6c and 830c2b4.

📒 Files selected for processing (3)

ci/bench.yaml
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

✅ Files skipped from review due to trivial changes (1)

cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

coderabbitai · 2026-05-22T01:16:22Z

+template <int BlockThreads>
+struct benchmark_op_t
+{
+  template <typename T>


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

suggestion: Line 17 and Line 35 use a single-letter template parameter (T). Use _Tp-style naming here to satisfy the template-parameter naming rule for this codebase.

As per coding guidelines "Avoid single letter names for template parameters; use _Tp instead of _T".

Also applies to: 35-35

coderabbitai · 2026-05-22T01:16:22Z

+  __device__ __forceinline__ T operator()(T thread_data) const
+  {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Read-only verification: confirm missing CCCL API annotations / noexcept in this file rg -nP '^\s*(template\s*<[^>]+>\s*)?(?!.*_CCCL_(HOST_API|DEVICE_API|API)).*\b(operator|block_reduce_warp_reductions)\b' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh rg -nP '\b(operator|block_reduce_warp_reductions)\b(?!.*noexcept)' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 391

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,120p' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh | nl -ba | sed -n '10,80p'

Repository: NVIDIA/cccl

Length of output: 98

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "operator|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 289

🏁 Script executed:

#!/bin/bash set -euo pipefail # Capture the full function signatures/headers for the two targets (best-effort) perl -0777 -ne ' my $f=$_; if($f =~ /(operator\s*const[^\n]*\n\s*\{)/s){ print "FOUND operator():\n$1\n"; } if($f =~ /(block_reduce_warp_reductions[^\n]*\n\s*\{)/s){ print "FOUND block_reduce_warp_reductions():\n$1\n"; } ' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 170

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the relevant top portion with line numbers (no nl dependency) awk '{printf "%5d:%s\n", NR, $0}' cub/benchmarks/bench/reduce/block_reduce_warp_warp_reductions_base.cuh | sed -n '1,110p'

Repository: NVIDIA/cccl

Length of output: 196

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check for noexcept / CCCL API tokens near the declarations rg -n "operator|block_reduce_warp_reductions" -n cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh rg -n --context=3 "operator|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh rg -n --context=0 "_CCCL_(HOST_API|DEVICE_API|API)" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 1049

🏁 Script executed:

#!/bin/bash set -euo pipefail # Look for exception-throwing in the relevant functions/bodies rg -n "\bthrow\b" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 37

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" # Print a small, line-numbered window around operator() and block_reduce_warp_reductions sed -n '10,60p' "$FILE" | awk '{printf "%5d:%s\n", NR+10-1, $0}'

Repository: NVIDIA/cccl

Length of output: 2048

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" # Check for _CCCL_* usage and noexcept in/near the relevant symbols rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" rg -n "noexcept" "$FILE" rg -n --context=2 "operator" "$FILE" rg -n --context=2 "block_reduce_warp_reductions" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" # Look for exception throwing inside this file rg -n "\bthrow\b" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" echo "== _CCCL_* tokens ==" rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" || true echo "== noexcept tokens ==" rg -n "noexcept" "$FILE" || true echo "== throw tokens ==" rg -n "\bthrow\b" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 117

important: Add required _CCCL_* API annotations and noexcept to benchmark functions.

benchmark_op_t::operator() (line 18) lacks _CCCL_DEVICE_API and noexcept.

block_reduce_warp_reductions (line 36) lacks _CCCL_HOST_API and noexcept.

Suggested change

- __device__ __forceinline__ T operator()(T thread_data) const + _CCCL_DEVICE_API __forceinline__ auto operator()(_Tp thread_data) const noexcept -> _Tp

-template <typename T> -void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<T>) +template <typename _Tp> +_CCCL_HOST_API void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<_Tp>) noexcept

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 519c38f3-5544-4f5a-9750-fc04a66589e0

📥 Commits

Reviewing files that changed from the base of the PR and between 830c2b4 and a949e20.

📒 Files selected for processing (3)

ci/bench.yaml
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

✅ Files skipped from review due to trivial changes (1)

ci/bench.yaml

fbusato · 2026-05-22T01:32:21Z

/ok to test 80b5ac2

charan-003 · 2026-05-22T02:17:13Z

@fbusato the benchmarks shows a slowdown. can we use bit_ceil(warps) for the logical warp size. Does this approach look reasonable, or would you suggest a different direction?

fbusato · 2026-05-22T17:12:00Z

The code looks good and I really like the simplifications. Related to performance, some high-level thoughts:

It is not worthwhile to use Redux/Warp-Shuffle Reduction without a full mask because it falls back to a slower implementation. You can use the identity element in some cases: https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/functional/operator_properties.html#identity-element
Warp-shuffle reduction could be useful only with large thread blocks. The mental model is that warp-shuffle is faster than shared memory but only marginally. bit_ceil(warps) for the logical warp size is a good idea
I would dispatch between the new and old implementation depending on the above conditions.

fbusato · 2026-05-28T21:42:21Z

/ok to test dae2e4a

fbusato · 2026-05-28T21:43:53Z

@charan-003 sorry for the delay. Please ping me every time you need to trigger the CI

charan-003 · 2026-06-05T21:24:39Z

@fbusato can you please trigger the CI?
Thank you.

fbusato · 2026-06-05T21:51:17Z

/ok to test 4210958

pauleonix · 2026-06-07T00:45:41Z

+  int max_blocks_per_SM       = 0;
+  NVBENCH_CUDA_CALL_NOEXCEPT(
+    cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_SM, kernel, BlockThreads, 0));
+  const int grid_size = max_blocks_per_SM * num_SMs;


There are a couple of issues with this benchmarking approach that I tried to address in #8607:

If another (version of the) compiler decides to use more or less registers in a way that changes occupancy, results become very hard to compare for performance regression testing. My first idea was to just include the grid size in the benchmark output, but that didn't help much as it would still not show up in our current infra's benchmark comparisons. So I went for a bigger hammer and just use a hardcoded workload that is big enough that the tail effect should be negligible on current devices (2^28 items same as the biggest workloads for our device primitive benchmarks).

Both the current approach and even more so my solution to 1. is measuring throughput. For block and warp primitives we are also interested in their latency in contexts where they do not make up the main workload of a kernel, but are still in the critical path. Therefore I added a second benchmark where only a single warp is launched.

This one might be less important, but it was a particular concern for WarpReduceBatched: I also added a benchmarking kernel with full launch bounds to force the compiler to deal with a lot of register pressure without pushing occupancy below a certain value. If a code change or toolchain change should affect the register allocation, this benchmark should show it very clearly since it is much more prone to spilling to local memory.

@pauleonix thanks for the comments

so from what i understand for this PR

we also want to replace occupancy based grids with 2^28 workloads.

Add a single warp latency benchmark alongside the throughput one?

Add a WarpReduceBatched benchmark with full launch bounds?

is that right? anything else you'd like to see?

and 2. we want. For 3. I'm not yet sure if it justifies the additional compilation and benchmarking time.

charan-003 · 2026-06-09T23:56:08Z

@fbusato @pauleonix, do we want to add anything else?

fbusato · 2026-06-10T00:31:45Z

/ok to test 88b1da0

github-actions · 2026-06-10T00:43:44Z

❌ Benchmark Results

Benchmark comparison had failures.

Results
Artifacts

github-actions · 2026-06-10T03:06:20Z

🥳 CI Workflow Results

🟩 Finished in 2h 29m: Pass: 100%/287 | Total: 12d 17h | Max: 2h 18m | Hits: 22%/974831

See results here.

charan-003 · 2026-06-12T18:17:37Z

@fbusato ping

fbusato · 2026-06-12T19:32:16Z

@charan-003 I don't have any additional comment. There are two things to do before merging the PRs:

revert bench.yaml
report the benchmark results, essentially the tables of the nvbench comparison script, on a comment or PR description

charan-003 · 2026-06-15T01:07:00Z

@charan-003 I don't have any additional comment. There are two things to do before merging the PRs:

revert bench.yaml

report the benchmark results, essentially the tables of the nvbench comparison script, on a comment or PR description

@fbusato
Done,
Thanks a lot

fbusato · 2026-06-15T18:55:57Z

these results don't show a clear benefits. I would add the results for all 3 SMs for DeviceReduce and BlockReduce benchmarks.
Nit, you can report the results in sections https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-collapsed-sections

charan-003 · 2026-06-17T01:07:29Z

these results don't show a clear benefits. I would add the results for all 3 SMs for DeviceReduce and BlockReduce benchmarks. Nit, you can report the results in sections https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-collapsed-sections

@fbusato
done, sorry for the delay

fbusato · 2026-06-22T19:53:22Z

the results show minimal differences. Could you also please add the results for device-level benchmarks?

charan-003 · 2026-06-23T06:25:27Z

the results show minimal differences. Could you also please add the results for device-level benchmarks?

added device-level benchmark
agree there's no clear end-to-end win at large sizes

fbusato · 2026-06-23T22:52:46Z

sorry @charan-003, maybe I didn't explain it well.
The optimization that you added in this PR is hard to observe for the full routine.
You correctly added a new device-side benchmark in block_reduce_warp_reductions_sum.cu for this reason. These results can be useful to understand if the performance improvement can be observed at block-level.

charan-003 requested a review from a team as a code owner May 21, 2026 20:05

charan-003 requested a review from NaderAlAwar May 21, 2026 20:05

github-project-automation Bot added this to CCCL May 21, 2026

github-project-automation Bot moved this to Todo in CCCL May 21, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026

optimize blockReduceWarp

cb6dd6c

charan-003 force-pushed the optimize_blockreduce branch from 0b48748 to cb6dd6c Compare May 21, 2026 22:10

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

charan-003 requested review from a team as code owners May 22, 2026 01:11

charan-003 requested a review from wmaxey May 22, 2026 01:11

[bench-only]

a949e20

charan-003 force-pushed the optimize_blockreduce branch from 830c2b4 to a949e20 Compare May 22, 2026 01:13

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread cub/benchmarks/bench/reduce/block_reduce_base.cuh Outdated

[bench-only]

80b5ac2

This comment has been minimized.

Sign in to view

[bench-only]

dae2e4a

This comment has been minimized.

Sign in to view

charan-003 added 2 commits June 6, 2026 00:45

fix name space

ac8ad90

Merge branch 'main' into optimize_blockreduce

4079348

charan-003 requested a review from fbusato June 6, 2026 05:46

pauleonix reviewed Jun 7, 2026

View reviewed changes

charan-003 added 2 commits June 9, 2026 01:55

add latency benchmarks

7500113

Merge branch 'main' into optimize_blockreduce

88b1da0

fbusato mentioned this pull request Jun 9, 2026

Add warp-aggregated CUB block histogram algorithms #9222

Open

charan-003 requested a review from pauleonix June 9, 2026 23:54

pauleonix mentioned this pull request Jun 12, 2026

Add CUB WarpReduceBroadcast #9320

Open

charan-003 added 2 commits June 14, 2026 20:01

final changes

33ce2e6

Merge branch 'main' into optimize_blockreduce

13b716c

fbusato reviewed Jun 23, 2026

View reviewed changes

Comment thread cub/benchmarks/bench/reduce/block_reduce_sum.cu

Comment thread cub/benchmarks/bench/reduce/warp_reduce_base.cuh Outdated

charan-003 added 2 commits June 24, 2026 00:14

change the file name

c5429c2

Merge branch 'main' into optimize_blockreduce

d56381d

		__device__ __forceinline__ T operator()(T thread_data) const
		{

Conversation

charan-003 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison: optimize_blockreduce vs origin/main

[0] NVIDIA RTX A6000

[0] NVIDIA RTX A6000

[0] NVIDIA H100

[0] NVIDIA H100 PCIe

[0] NVIDIA RTX Pro 6000

[0] NVIDIA RTX PRO 6000 Blackwell Server Edition

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Assessment against linked issues

Out-of-scope changes

Uh oh!

fbusato commented May 21, 2026

Uh oh!

charan-003 commented May 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

fbusato commented May 21, 2026

Uh oh!

charan-003 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fbusato commented May 22, 2026

Uh oh!

This comment has been minimized.

charan-003 commented May 22, 2026

Uh oh!

fbusato commented May 22, 2026

Uh oh!

fbusato commented May 28, 2026

Uh oh!

fbusato commented May 28, 2026

Uh oh!

This comment has been minimized.

charan-003 commented Jun 5, 2026

Uh oh!

fbusato commented Jun 5, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

pauleonix Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

charan-003 Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

pauleonix Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

charan-003 commented Jun 9, 2026

Uh oh!

fbusato commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

❌ Benchmark Results

charan-003 commented May 21, 2026 •

edited

Loading

Benchmark Comparison: `optimize_blockreduce` vs `origin/main`

coderabbitai Bot commented May 21, 2026 •

edited

Loading

charan-003 commented May 22, 2026 •

edited

Loading

charan-003 commented Jun 12, 2026 •

edited

Loading