Skip to content

triton-kernels: add vendor-neutral Triton skill#675

Open
jaygala223 wants to merge 6 commits into
huggingface:mainfrom
jaygala223:triton-kernels-skill
Open

triton-kernels: add vendor-neutral Triton skill#675
jaygala223 wants to merge 6 commits into
huggingface:mainfrom
jaygala223:triton-kernels-skill

Conversation

@jaygala223

Copy link
Copy Markdown

Related issue

Closes #302

What does this PR do?

Adds a new triton-kernels skill to kernel-builder/skills/, providing vendor-neutral guidance for writing portable Triton kernels that run on NVIDIA and AMD GPUs without modification.

Motivation

The existing skills are either raw CUDA (cuda-kernels) or backend-specific Triton (rocm-kernels, xpu-kernels). There's no shared base covering portable Triton patterns that the backend-specific skills can build on. This was identified in #302 and the skills/cuda-kernels link in the original issue is now dead since skills moved to kernel-builder/skills/ in #456.

Changes

  • Add kernel-builder/skills/triton-kernels/SKILL.md: core DSL patterns (program_id, masked loads, 2D tiling, reductions, tl.dot), autotune guidance, numerics (fp32 accumulation, safe softmax, dtype handling), reference kernel implementations (softmax, matmul, rmsnorm), benchmarking patterns, get_kernel integration, transformers patching, and common pitfalls table
  • Add references/autotune-guide.md: deep dive on configs, key parameter, num_warps/num_stages selection, when NOT to autotune, practical config sets
  • Add references/kernel-patterns.md: dropout (in-SRAM mask), fused add+rmsnorm, SiLU, SwiGLU, group-major PID ordering, pointer sliding
  • Add references/benchmarking-guide.md: GB/s vs TFLOPS, roofline model, do_bench usage, when to expect gains over PyTorch
  • Add scripts/benchmark_template.py: reusable benchmark harness with correctness check
  • Add scripts/correctness_template.py: multi-shape, multi-dtype test suite template
  • Add examples/fused_softmax.py: complete working kernel with test + benchmark
  • Add manifest.txt and CHANGELOG.md

Testing

  • The example kernel (examples/fused_softmax.py) passes correctness tests against torch.softmax on V100 with irregular shapes (1823x781) at atol=1e-3
  • Benchmark template runs and produces GB/s measurements
  • Verified the skill follows the same directory structure as rocm-kernels (SKILL.md, manifest.txt, CHANGELOG.md, references/, scripts/, examples/)

Checklist

  • This PR is linked to an issue that was discussed and approved
  • I have tested these changes locally
  • New/changed functionality has test coverage
  • LLM disclosure:
    • I did not use an LLM to create this PR.
    • I used an LLM for assistance while creating this PR.
    • This PR was mostly or completely generated by an LLM.

Add a new skill for writing portable Triton kernels that run on NVIDIA
and AMD GPUs without modification. Covers core DSL patterns, autotune,
numerics, benchmarking, and HuggingFace Kernels Hub integration.

Complements cuda-kernels (raw CUDA) and sits alongside rocm-kernels/
xpu-kernels as the vendor-neutral base they can specialize from.

Contents:
- SKILL.md: main instruction file with reference kernels (softmax, matmul, rmsnorm)
- references/: autotune guide, kernel patterns, benchmarking guide
- scripts/: benchmark and correctness test templates
- examples/: fused softmax with test + benchmark harness

Closes huggingface#302
@github-actions

Copy link
Copy Markdown

Hi @jaygala223, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md for more details.

@sayakpaul

Copy link
Copy Markdown
Member

Can you provide an example kernel that you wrote with this skill and tested? Feel free to apply to be able to publish kernel repo types on the Hub.

@jaygala223

Copy link
Copy Markdown
Author

Hi @sayakpaul I tested the skill by using it as context for an AI agent and asking it to write a LayerNorm kernel (which is not one of the reference kernels already in the skill). The agent followed the patterns from the SKILL.md: row-wise reduction, fp32 accumulation, masked loads, contiguity assert, num_warps heuristic.

Results on V100 (fp32, M=4096):

Correctness: All shapes pass against torch.nn.functional.layer_norm at atol=1e-3, including irregular shapes (1823x781), single element, and bf16.

Performance: ~1.45x faster than PyTorch LayerNorm across all tested N values:

   N    Triton    PyTorch
 256    497       365 GB/s
1024    727       548 GB/s
2048    771       534 GB/s

The generated kernel + test is here: examples/skill_test_layernorm.py (8d7da2e)

I also applied for kernel repo access on the Hub.

@sayakpaul

Copy link
Copy Markdown
Member

What is your Hub username?

@jaygala223

Copy link
Copy Markdown
Author

jaygala223

@sayakpaul

Copy link
Copy Markdown
Member

You have access to publish the kernels as proper kernel repos on the Hub. Could we please see the kernels generated with this skill on the Hub?

@jaygala223

Copy link
Copy Markdown
Author

Yes. Published the layernorm kernel generated using the skill as a kernel repo: https://huggingface.co/kernels/jaygala223/triton-layernorm

@sayakpaul

Copy link
Copy Markdown
Member

@jaygala223

jaygala223 commented Jun 24, 2026 via email

Copy link
Copy Markdown
Author

@sayakpaul sayakpaul reopened this Jun 24, 2026

@sayakpaul sayakpaul left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add skill to our docs.

@jaygala223

Copy link
Copy Markdown
Author

Hi @sayakpaul added triton-kernels to the supported skills list in docs/source/cli-skills.md.

@sayakpaul sayakpaul left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a couple of changes skin to #614.

@jaygala223

Copy link
Copy Markdown
Author

Okay done!

Added a tip and example kernel.

Comment thread docs/source/cli-skills.md
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jaygala223 jaygala223 requested a review from sayakpaul June 27, 2026 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Triton Skill

3 participants