triton-kernels: add vendor-neutral Triton skill#675
Conversation
Add a new skill for writing portable Triton kernels that run on NVIDIA and AMD GPUs without modification. Covers core DSL patterns, autotune, numerics, benchmarking, and HuggingFace Kernels Hub integration. Complements cuda-kernels (raw CUDA) and sits alongside rocm-kernels/ xpu-kernels as the vendor-neutral base they can specialize from. Contents: - SKILL.md: main instruction file with reference kernels (softmax, matmul, rmsnorm) - references/: autotune guide, kernel patterns, benchmarking guide - scripts/: benchmark and correctness test templates - examples/: fused softmax with test + benchmark harness Closes huggingface#302
|
Hi @jaygala223, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md for more details. |
|
Can you provide an example kernel that you wrote with this skill and tested? Feel free to apply to be able to publish |
|
Hi @sayakpaul I tested the skill by using it as context for an AI agent and asking it to write a LayerNorm kernel (which is not one of the reference kernels already in the skill). The agent followed the patterns from the SKILL.md: row-wise reduction, fp32 accumulation, masked loads, contiguity assert, num_warps heuristic. Results on V100 (fp32, M=4096): Correctness: All shapes pass against Performance: ~1.45x faster than PyTorch LayerNorm across all tested N values: The generated kernel + test is here: examples/skill_test_layernorm.py (8d7da2e) I also applied for kernel repo access on the Hub. |
|
What is your Hub username? |
|
jaygala223 |
|
You have access to publish the kernels as proper kernel repos on the Hub. Could we please see the kernels generated with this skill on the Hub? |
|
Yes. Published the layernorm kernel generated using the skill as a kernel repo: https://huggingface.co/kernels/jaygala223/triton-layernorm |
|
Are the results in #675 (comment) for https://huggingface.co/kernels/jaygala223/triton-layernorm? |
|
Yes they are from the same kernel
…On Wed, 24 Jun, 2026, 19:35 Sayak Paul, ***@***.***> wrote:
*sayakpaul* left a comment (huggingface/kernels#675)
<#675 (comment)>
Are the results in #675 (comment)
<#675 (comment)>
for https://huggingface.co/kernels/jaygala223/triton-layernorm?
—
Reply to this email directly, view it on GitHub
<#675?email_source=notifications&email_token=ANS4OMRPRM3JADW74F3SVPD5BPN2PA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZZGAYTCMZZGI42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4790113929>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANS4OMTYIT2M3U2LQPLLIQL5BPN2PAVCNFSNUABFKJSXA33TNF2G64TZHM4DSNRRGI3DINBTHNEXG43VMU5TINZTGIYDONZXGQ42C5QC>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/ANS4OMQF2UCAN5HOVUDG4NT5BPN2PA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZZGAYTCMZZGI42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJKTGN5XXIZLSL5UW64Y>
and Android
<https://github.com/notifications/mobile/android/ANS4OMSKXG2ZRRNERZM3ALT5BPN2PA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZZGAYTCMZZGI42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLTGN5XXIZLSL5QW4ZDSN5UWI>.
Download it today!
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
sayakpaul
left a comment
There was a problem hiding this comment.
We should also add skill to our docs.
|
Hi @sayakpaul added |
|
Okay done! Added a tip and example kernel. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Related issue
Closes #302
What does this PR do?
Adds a new
triton-kernelsskill tokernel-builder/skills/, providing vendor-neutral guidance for writing portable Triton kernels that run on NVIDIA and AMD GPUs without modification.Motivation
The existing skills are either raw CUDA (
cuda-kernels) or backend-specific Triton (rocm-kernels,xpu-kernels). There's no shared base covering portable Triton patterns that the backend-specific skills can build on. This was identified in #302 and theskills/cuda-kernelslink in the original issue is now dead since skills moved tokernel-builder/skills/in #456.Changes
kernel-builder/skills/triton-kernels/SKILL.md: core DSL patterns (program_id, masked loads, 2D tiling, reductions, tl.dot), autotune guidance, numerics (fp32 accumulation, safe softmax, dtype handling), reference kernel implementations (softmax, matmul, rmsnorm), benchmarking patterns, get_kernel integration, transformers patching, and common pitfalls tablereferences/autotune-guide.md: deep dive on configs, key parameter, num_warps/num_stages selection, when NOT to autotune, practical config setsreferences/kernel-patterns.md: dropout (in-SRAM mask), fused add+rmsnorm, SiLU, SwiGLU, group-major PID ordering, pointer slidingreferences/benchmarking-guide.md: GB/s vs TFLOPS, roofline model, do_bench usage, when to expect gains over PyTorchscripts/benchmark_template.py: reusable benchmark harness with correctness checkscripts/correctness_template.py: multi-shape, multi-dtype test suite templateexamples/fused_softmax.py: complete working kernel with test + benchmarkmanifest.txtandCHANGELOG.mdTesting
examples/fused_softmax.py) passes correctness tests againsttorch.softmaxon V100 with irregular shapes (1823x781) at atol=1e-3rocm-kernels(SKILL.md, manifest.txt, CHANGELOG.md, references/, scripts/, examples/)Checklist