diff --git a/LICENSE b/LICENSE index dd8968a3..e46d5080 100644 --- a/LICENSE +++ b/LICENSE @@ -1,247 +1,21 @@ -June 9 Researcher Reciprocity License -Version 1.0 -dated June 9, 2026 - -This is a license (the "License") between you ("You") and GPU Mode and the -reference-kernels contributors ("Licensor"). This License adapts the Open -Responsible AI License Source ("Open RAIL-S") pattern for source code and -project materials, and adds the Researcher Reciprocity use restriction in -Attachment A. It is intended to have an open and permissive character while -preserving reciprocal research access when the Project Materials are used to -train or improve AI systems. - -If you train on it, you let us generate. - -Section I: Preamble - -reference-kernels contains reference implementations, problem specifications, -tests, examples, and related materials for KernelBot competitions and GPU kernel -research. The Project Materials include source code, documentation, -configuration, examples, tests, scripts, and related materials distributed with -this License. - -Licensor wishes to promote collaboration, open research, education, -benchmarking, and broad reuse of the Project Materials. Licensor also wishes to -avoid a one-way bargain in which researchers and contributors publish ideas and -code that are used to improve AI systems, while the providers of those AI -systems then prohibit those same researchers from generating outputs, -evaluating the systems, benchmarking them, publishing research, or exploring -their own ideas. - -This License therefore grants broad rights to use the Project Materials, -subject to attribution and the use-based restriction in Attachment A. - -Section II: Definitions - -1. "License" means these terms and conditions for use, reproduction, and -Distribution. - -2. "Project Materials" means the source code, documentation, configuration, -examples, tests, scripts, data, metadata, and other materials distributed with -this License. - -3. "Output" means the results of operating a model, service, application, or -other system. - -4. "Model" means any machine-learning or artificial-intelligence based -assemblies, including model weights, checkpoints, parameters, optimizer states, -adapters, embedding systems, agents, APIs, hosted services, or other systems -that are trained, tuned, evaluated, benchmarked, or otherwise used in connection -with the Project Materials. - -5. "Derivatives of the Project Materials" means all modifications, -transformations, annotations, translations, extracts, subsets, compilations, -arrangements, or other works based on the Project Materials. - -6. "Derivatives of a Model" means all modifications to a Model, works based on -a Model, or any other model that is created or initialized by transfer of -patterns of weights, parameters, activations, embeddings, outputs, or other -representations of the Model, including distillation methods and methods based -on synthetic data generated by the Model. - -7. "Training Use" means using the Project Materials, in whole or in part, to -train, pretrain, fine-tune, post-train, align, distill, evaluate for training, -benchmark for training, generate synthetic data for training, construct -embeddings for training, rank or filter examples for training, or otherwise -improve the weights, behavior, capabilities, or performance of a Model or -Derivatives of a Model. - -8. "Covered Model" means any Model or Derivatives of a Model that is trained, -fine-tuned, distilled, aligned, evaluated for training, benchmarked for -training, or otherwise improved through Training Use of the Project Materials. - -9. "Distribution" means any transmission, reproduction, publication, hosting, -or other sharing of the Project Materials, Derivatives of the Project -Materials, a Covered Model, or Derivatives of a Covered Model to a third party, -including making any of them available by electronic or remote means, such as -API-based or web access. - -10. "Licensor" means GPU Mode, the project maintainers, and any contributor who -has authority to license their contribution under these terms. - -11. "You" or "Your" means an individual or legal entity exercising permissions -granted by this License or making use of the Project Materials for any purpose. - -12. "Third Parties" means individuals or legal entities that are not under -common control with Licensor or You. - -13. "Authorized Researchers" means GPU Mode, the project maintainers, project -contributors, and any researchers or organizations that GPU Mode designates in -writing for purposes of generating outputs from, evaluating, benchmarking, -auditing, criticizing, or publishing research about a Covered Model. - -14. "Ordinary Users" means the general class of users to whom You make a -Covered Model available, including through a public product, commercial product, -research release, API, hosted service, preview, beta, or gated access program. - -Section III: Intellectual Property Rights - -2. Grant of Copyright License. Subject to the terms and conditions of this -License, each Licensor grants You a worldwide, non-exclusive, no-charge, -royalty-free copyright license to reproduce, prepare derivative works of, -publicly display, publicly perform, sublicense, and distribute the Project -Materials and Derivatives of the Project Materials. - -3. No Patent License. This License does not grant any patent license. - -Section IV: Conditions of Usage, Distribution, and Redistribution - -4. Distribution and Redistribution. You may reproduce and distribute copies of -the Project Materials or Derivatives of the Project Materials in any medium, -with or without modifications, provided that You meet the following conditions: - -4.1. You must give Third Party recipients of the Project Materials or -Derivatives of the Project Materials a copy of this License or a clear link to -it. - -4.2. You must retain reasonable copyright, license, and attribution notices, -excluding notices that do not pertain to any part of the Project Materials or -Derivatives of the Project Materials. - -4.3. You must give reasonable attribution to GPU Mode and reference-kernels. -Reasonable attribution includes, where practical, the project name, a link to -the project source, and any citation requested in the project documentation. - -4.4. You must cause any modified files or documentation that You Distribute to -carry prominent notices stating that You changed them. - -4.5. You may add Your own copyright statement to Your modifications and may -provide additional or different license terms for Your independent additions, -annotations, analyses, software, models, outputs, or other works, provided that -Your use, reproduction, and Distribution of the Project Materials otherwise -complies with this License. - -5. Use-Based Restrictions. The restriction set forth in Attachment A is a -use-based restriction. You may not use the Project Materials, Derivatives of the -Project Materials, Covered Models, or Derivatives of Covered Models for the -restricted use specified in Attachment A. - -For Training Use, the use-based restriction in Attachment A must be included as -an enforceable provision in any legal agreement, terms of use, acceptable use -policy, license, or other terms governing the use or Distribution of a Covered -Model or Derivatives of a Covered Model. You must give notice to subsequent -users that the Covered Model or Derivatives of the Covered Model are subject to -Attachment A. - -6. Outputs. Except as stated in this License, Licensor claims no rights in the -Output You generate using a Covered Model. You are accountable for the Output -You generate and its subsequent uses. No use of the Output may contravene this -License. - -Section V: Other Provisions - -7. No Endorsement. Nothing in this License permits You to use Licensor's names, -logos, trademarks, or service marks to imply endorsement, sponsorship, or -approval. - -8. Third-Party Rights. The Project Materials may include material submitted by -third parties. This License applies only to rights that Licensor has authority -to license. You are responsible for complying with any third-party rights, -privacy obligations, laws, or regulations that apply to Your use. - -9. Disclaimer of Warranty. Unless required by applicable law or agreed to in -writing, Licensor provides the Project Materials on an "AS IS" BASIS, WITHOUT -WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including -warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, FITNESS -FOR A PARTICULAR PURPOSE, ACCURACY, AVAILABILITY, OR ABSENCE OF DEFECTS. You -are solely responsible for determining the appropriateness of using or -redistributing the Project Materials and assume any risks associated with Your -exercise of permissions under this License. - -10. Limitation of Liability. To the maximum extent permitted by law, in no -event and under no legal theory, whether in tort, contract, or otherwise, -unless required by applicable law or agreed to in writing, shall any Licensor or -contributor be liable to You for damages, including direct, indirect, special, -incidental, consequential, exemplary, or punitive damages arising as a result of -this License or out of the use or inability to use the Project Materials, even -if such Licensor or contributor has been advised of the possibility of such -damages. - -11. Accepting Warranty or Additional Liability. While redistributing the Project -Materials or Derivatives of the Project Materials, You may choose to offer, and -charge a fee for, acceptance of support, warranty, indemnity, or other liability -obligations or rights consistent with this License. However, in accepting such -obligations, You may act only on Your own behalf and on Your sole -responsibility, not on behalf of any Licensor or contributor, and only if You -agree to indemnify, defend, and hold each Licensor and contributor harmless for -any liability incurred by, or claims asserted against, such Licensor or -contributor by reason of Your accepting any such warranty or additional -liability. - -12. Termination. If You violate this License, Your rights under it terminate -automatically. For violations other than violations of Attachment A, Your rights -are reinstated if You cure the violation within 30 days after discovering it or -receiving written notice from Licensor. For violations of Attachment A involving -a Covered Model, Your Training Use rights terminate automatically as to the -affected Covered Model and may be reinstated only if Licensor provides written -reinstatement or waiver. - -13. Severability. If any provision of this License is held invalid, illegal, or -unenforceable, the remaining provisions remain valid as if the provision had not -been set forth. The unenforceable provision will be interpreted or reformed only -to the minimum extent necessary to make it enforceable while preserving its -purpose. - -14. Additional Permission. Licensor may grant additional permissions, -exceptions, waivers, commercial terms, or private licenses in writing. Those -permissions apply only to the recipient and scope stated in the written grant. - -End of Terms and Conditions - -Attachment A -Use Restriction: Researcher Reciprocity for Training Use - -You agree not to use the Project Materials or Derivatives of the Project -Materials for Training Use if You make the resulting Covered Model or -Derivatives of the Covered Model available under terms, policies, technical -measures, access rules, account restrictions, acceptable-use rules, or other -conditions that prohibit, penalize, or materially burden Authorized Researchers -from: - -1. generating outputs from the Covered Model; - -2. evaluating, auditing, red-teaming, or benchmarking the Covered Model; - -3. comparing the Covered Model to other systems; - -4. publishing research, criticism, measurements, benchmark results, or analysis -concerning the Covered Model; or - -5. using the Covered Model to explore, test, or develop their own research -ideas. - -This access must be available on materially equal terms to those offered to -Ordinary Users of the Covered Model, subject only to neutral limits that apply -equally to Ordinary Users, such as generally applicable rate limits, payment -terms, safety rules, security rules, and laws. - -Any terms, policies, technical measures, access rules, account restrictions, -acceptable-use rules, or other conditions that conflict with this Attachment A -make the Covered Model ineligible for the Training Use grant unless Licensor has -waived the conflict in writing. - -You may not suspend, ban, throttle, sue, threaten, or otherwise retaliate -against Authorized Researchers solely because they engage in the activities -listed in this Attachment A, provided that their activity complies with -generally applicable law and neutral safety or security rules that are also -applied to Ordinary Users. +MIT License + +Copyright (c) 2025 GPU MODE + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md index 1d25040a..218787da 100644 --- a/README.md +++ b/README.md @@ -4,19 +4,17 @@ This repo holds reference kernels for the KernelBot which hosts regular competit You can see what's going on [gpumode.com](https://www.gpumode.com/) -## Competitions -1. [PMPP practice problems](https://github.com/gpu-mode/reference-kernels/tree/main/problems/pmpp_v2) +## Competition +1. [PMPP practice problems](https://gpu-mode.github.io/discord-cluster-manager/docs/active#practice-round-leaderboard) 2. [AMD $100K kernel competition](problems/amd) 3. [BioML kernels](problems/bioml) 4. [AMD $100K distributed kernel competition](problems/amd_distributed) 5. [NVIDIA Blackwell NVFP4 competition](problems/nvidia) -6. [AMD $1.1M competition](problems/amd_202602) -7. [Helion IRL hackathon](problems/helion) -8. [Linear Algebra Problems](problems/linalg) -We also work with universities on hosting the infrastructure for their classes: -- [Stanford CS149 assignment 5 kernels](https://github.com/stanford-cs149/asst5-kernels) -- [Tri Dao's Princeton parallel programming class](problems/princeton) +## Making a Leaderboard Submission + +Please take a look at `vectoradd_py` to see multiple examples of expected submisisons ranging from PyTorch code to Triton to inline CUDA. + ## Contributing New Problems @@ -27,12 +25,5 @@ To add a new problem, create a new folder in the `problems/glory` directory wher You can evaluate problems with your own Modal account (they give you a free $30) by borrowing this [neat script from @gau-nernst](https://github.com/gpu-mode/reference-kernels/pull/96#issue-3850136894) -## License - -This project is licensed under the [June 9 Researcher Reciprocity License](LICENSE). - -The license adapts the Open RAIL-S structure and adds one specific use restriction: training, fine-tuning, distillation, synthetic-data generation for training, embedding for training, or otherwise using this project to improve an AI model or AI service requires Researcher Reciprocity. -> If you train on it, you let us generate. -Covered AI model and service providers may not use this project while imposing terms that prevent GPU Mode, project contributors, or authorized researchers from generating outputs, evaluating models, benchmarking, publishing research, or exploring their own research ideas on materially equal terms to ordinary users. diff --git a/problems/amd_202602.yaml b/problems/amd_202602.yaml deleted file mode 100644 index e5d0d8aa..00000000 --- a/problems/amd_202602.yaml +++ /dev/null @@ -1,19 +0,0 @@ -name: AMD Developer Challenge February 2026 -deadline: "2026-04-07 07:59" -description: "AMD Developer Challenge: MXFP4 matrix multiplication, Mixture-of-Experts, and Multi-head Latent Attention optimized for MI355X." -problems: - - directory: amd_202602/mxfp4-mm - name: amd-mxfp4-mm - deadline: "2026-04-07 07:59" - gpus: - - MI355X - - directory: amd_202602/moe-mxfp4 - name: amd-moe-mxfp4 - deadline: "2026-04-07 07:59" - gpus: - - MI355X - - directory: amd_202602/mixed-mla - name: amd-mixed-mla - deadline: "2026-04-07 07:59" - gpus: - - MI355X diff --git a/problems/amd_202602/eval.py b/problems/amd_202602/eval.py deleted file mode 100644 index cc5d559b..00000000 --- a/problems/amd_202602/eval.py +++ /dev/null @@ -1,387 +0,0 @@ -import base64 -import dataclasses -import multiprocessing -import re -import time -import os -import sys -import math -from pathlib import Path -from typing import Any, Optional - -import torch.cuda - -from utils import set_seed, clear_l2_cache_large as clear_l2_cache -try: - from task import TestSpec -except ImportError: - TestSpec = dict - -from reference import check_implementation, generate_input - - -class PopcornOutput: - def __init__(self, fd: int): - self.file = os.fdopen(fd, 'w') - os.set_inheritable(fd, False) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.file.close() - - def print(self, *args, **kwargs): - print(*args, **kwargs, file=self.file, flush=True) - - def log(self, key, value): - self.print(f"{key}: {value}") - - -@dataclasses.dataclass -class TestCase: - args: dict - spec: str - - -def _combine(a: int, b: int) -> int: - # combine two integers into one: - # we need this to generate a secret seed based on the test-level seed and - # the global secret seed. - # the test-level seeds are public knowledge, and typically relatively small numbers, - # so we need to make sure they don't provide any useful info for the full seed. - # This Cantor construction ensures that if the secret seed is a large number, - # then so is the overall seed. - return int(a + (a+b)*(a+b+1)//2) - - -def get_test_cases(file_name: str, seed: Optional[int]) -> list[TestCase]: - try: - content = Path(file_name).read_text() - except Exception as E: - print(f"Could not open test file`{file_name}`: {E}", file=sys.stderr) - exit(113) - - tests = [] - lines = content.splitlines() - match = r"\s*([a-zA-Z_]\w*):\s*([a-zA-Z_]\w*|[+-]?[0-9]+)\s*" - for line in lines: - parts = line.split(";") - case = {} - for part in parts: - matched = re.match(match, part) - if not re.fullmatch(match, part): - print(f"invalid test case: '{line}': '{part}'", file=sys.stderr) - exit(113) - key = matched[1] - val = matched[2] - try: - val = int(val) - except ValueError: - if val == "true": - val = True - elif val == "false": - val = False - - case[key] = val - tests.append(TestCase(spec=line, args=case)) - - if seed is not None: - for test in tests: - if "seed" in test.args: - test.args["seed"] = _combine(test.args["seed"], seed) - - return tests - - -@dataclasses.dataclass -class Stats: - runs: int - mean: float - std: float - err: float - best: float - worst: float - - -def calculate_stats(durations: list[int]): - """ - Calculate statistical data from a list of durations. - - @param durations: A list of durations in nanoseconds. - @return: A Stats object containing the number of runs, mean, standard deviation, error, best, and worst durations. - """ - runs = len(durations) - total = sum(durations) - best = min(durations) - worst = max(durations) - - avg = total / runs - variance = sum(map(lambda x: (x - avg)**2, durations)) - std = math.sqrt(variance / (runs - 1)) - err = std / math.sqrt(runs) - - return Stats(runs=runs, mean=avg, std=std, err=err, best=float(best), - worst=float(worst)) - - -def _clone_data(data): - """ - Recursively goes through data and clones all tensors. - """ - if isinstance(data, tuple): - return tuple(_clone_data(x) for x in data) - elif isinstance(data, list): - return [_clone_data(x) for x in data] - elif isinstance(data, dict): - return {k: _clone_data(v) for k, v in data.items()} - elif isinstance(data, torch.Tensor): - return data.clone() - else: - return data - - -def wrap_check_implementation(data, submission_output): - # Old version returned just a single string, new version - # returns (bool, str); this function ensures compatibility with old - # problem definitions. - result = check_implementation(data, submission_output) - if isinstance(result, tuple): - return result - else: - return not bool(result), result - - -def _run_single_test(test: TestCase): - """ - Runs a single test case. Do not call directly - """ - from submission import custom_kernel - data = generate_input(**test.args) - torch.cuda.synchronize() - submission_output = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - return wrap_check_implementation(data, submission_output) - - -def run_single_test(pool: multiprocessing.Pool, test: TestCase): - """ - Runs a single test in another process. - """ - return pool.apply(_run_single_test, (test,)) - - -def run_testing(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - """ - Executes the actual test case code and checks for correctness. - - @param logger: A PopcornOutput object used for logging test results. - @param tests: A list of TestCase objects representing the test cases to be executed. - @return: An integer representing the exit status: 0 if all tests pass, otherwise 112. - """ - passed = True - logger.log("test-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"test.{idx}.spec", test.spec) - good, message = run_single_test(pool, test) - if not good: - logger.log(f"test.{idx}.status", "fail") - logger.log(f"test.{idx}.error", message) - passed = False - else: - logger.log(f"test.{idx}.status", "pass") - if message: - logger.log(f"test.{idx}.message", message) - - if passed: - logger.log("check", "pass") - return 0 - else: - logger.log("check", "fail") - return 112 - - -def _run_single_benchmark(test: TestCase, recheck: bool, max_repeats: int, max_time_ns: float) -> Stats | Any: - """ - Runs one benchmark. Do not call directly. - """ - from submission import custom_kernel - - durations = [] - # generate input data once - data = generate_input(**test.args) - check_copy = _clone_data(data) - # first, one obligatory correctness check - output = custom_kernel(data) - good, message = wrap_check_implementation(check_copy, output) - if not good: - return message - - # now, do multiple timing runs without further correctness testing - # there is an upper bound of 100 runs, and a lower bound of 3 runs; - # otherwise, we repeat until we either measure at least 10 full seconds, - # or the relative error of the mean is below 1%. - - bm_start_time = time.perf_counter_ns() - for i in range(max_repeats): - if recheck: - # ensure we use a different seed for every benchmark - if "seed" in test.args: - test.args["seed"] += 13 - - data = generate_input(**test.args) - check_copy = _clone_data(data) - torch.cuda.synchronize() - clear_l2_cache() - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) - start_event.record() - output = custom_kernel(data) - end_event.record() - torch.cuda.synchronize() - - if recheck: - good, message = check_implementation(check_copy, output) - if not good: - return message - - del output - durations.append(start_event.elapsed_time(end_event) * 1e6) - - if i > 1: - total_bm_duration = time.perf_counter_ns() - bm_start_time - stats = calculate_stats(durations) - # stop if either - # a) relative error dips below 0.1% - # b) we exceed the total time limit for benchmarking the kernel - # c) we exceed 2 minutes of total wallclock time. - if stats.err / stats.mean < 0.001 or stats.mean * stats.runs > max_time_ns or total_bm_duration > 120e9: - break - - return calculate_stats(durations) - - -def run_single_benchmark(pool: multiprocessing.Pool, test: TestCase, recheck: bool, max_repeats: int, - max_time_ns: float): - """ - For a particular test case, check correctness (if applicable) and grab runtime results. - - @param pool: Process on which the benchmark will be launched. - @param test: TestCase object. - @param recheck: Flag for whether to explicitly check functional correctness. - @param max_repeats: Number of trials to repeat. - @param max_time_ns: Timeout time in nanoseconds. - @return: A Stats object for this particular benchmark case or an error if the test fails. - """ - return pool.apply(_run_single_benchmark, (test, recheck, max_repeats, max_time_ns)) - - -def run_benchmarking(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - """ - Executes benchmarking code for a CUDA Kernel and logs runtimes. - - @param logger: A PopcornOutput object used for logging benchmark results. - @param pool: Process on which the benchmarks will be launched. - @param tests: A list of TestCase objects representing the test cases to be benchmarked. - @return: An integer representing the exit status: 0 if all benchmarks pass, otherwise 112. - """ - # warm up - run_single_benchmark(pool, tests[0], False, 100, 10e7) - - passed = True - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - result = run_single_benchmark(pool, test, False, 1000, 50e9) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - passed = False - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", result) - - if passed: - logger.log("check", "pass") - return 0 - else: - logger.log("check", "fail") - return 112 - - -def run_single_profile(test: TestCase) -> str: - """ - Runs a single test case. Do not call directly - """ - from submission import custom_kernel - from torch.profiler import profile, record_function, ProfilerActivity - data = generate_input(**test.args) - torch.cuda.synchronize() - - with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: - submission_output = custom_kernel(data) - torch.cuda.synchronize() - return prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20) - - -def run_profiling(logger: PopcornOutput, tests: list[TestCase]): - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - report = run_single_profile(test) - logger.log(f"benchmark.{idx}.report", base64.b64encode(report.encode("utf-8"), b"+*").decode("utf-8")) - logger.log("check", "pass") - return 0 - - -def main(): - fd = os.getenv("POPCORN_FD") - if not fd: - return 111 - - if len(sys.argv) < 3: - return 2 - - mode = sys.argv[1] - seed = os.getenv("POPCORN_SEED") - os.unsetenv("POPCORN_SEED") - seed = int(seed) if seed else None - set_seed(seed or 42) - tests = get_test_cases(sys.argv[2], seed) - - with PopcornOutput(int(fd)) as logger: - import multiprocessing - mp_context = multiprocessing.get_context('spawn') - with mp_context.Pool(1) as pool: - if mode == "test": - return run_testing(logger, pool, tests) - if mode == "benchmark": - return run_benchmarking(logger, pool, tests) - - if mode == "leaderboard": - # warmup - run_single_benchmark(pool, tests[0], False, 100, 1e7) - logger.log("benchmark-count", len(tests)) - passed = True - for i in range(len(tests)): - result = run_single_benchmark(pool, tests[i], True, 100, 30e9) - logger.log(f"benchmark.{i}.spec", tests[i].spec) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{i}.{field.name}", getattr(result, field.name)) - else: - passed = False - logger.log(f"benchmark.{i}.status", "fail") - logger.log(f"benchmark.{i}.error", str(result)) # TODO: Make sure result implements __str__? - break - - logger.log("check", "pass" if passed else "fail") - elif mode == "profile": - run_profiling(logger, tests) - else: - # TODO: Implement script mode - return 2 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/problems/amd_202602/mixed-mla/README.md b/problems/amd_202602/mixed-mla/README.md deleted file mode 100644 index 24cae9ce..00000000 --- a/problems/amd_202602/mixed-mla/README.md +++ /dev/null @@ -1,200 +0,0 @@ -# MLA (Multi-head Latent Attention) Decode Kernel - -## Description - -Implement a custom MLA attention decode kernel optimized for MI355X. - -This is the **inner attention kernel** from DeepSeek R1's `forward_absorb` MLA path. -The absorbed query and compressed KV cache are provided directly — you implement the -attention computation with variable-length batching. - -The reference uses **aiter MLA a8w8 decode kernel** (`mla_decode_fwd`, fp8 Q + fp8 KV, -persistent mode). On MI355X, a8w8 is ~2-3x faster than bf16 with negligible accuracy loss. -The reference quantizes Q to fp8 on-the-fly and uses pre-quantized fp8 KV from `kv_data["fp8"]`. - -## DeepSeek R1 Forward-Absorb MLA Config - -| Parameter | Value | Notes | -|---|---|---| -| num_heads | 16 | Query heads (after TP split) | -| num_kv_heads | 1 | Single shared latent KV head | -| kv_lora_rank | 512 | Latent dimension | -| qk_rope_head_dim | 64 | RoPE embedding dimension | -| qk_head_dim | 576 | kv_lora_rank + qk_rope_head_dim (absorbed q/k dim) | -| v_head_dim | 512 | = kv_lora_rank (output dim) | -| sm_scale | 1/sqrt(576) | | -| q dtype | bfloat16 | Input always bf16; reference quantizes to fp8 on-the-fly | -| kv dtype | bf16 / fp8 / mxfp4 | All three provided simultaneously | -| mode | decode | q_seq_len=1, kv_seq_len up to 8k | - -## Reference Kernel - -The reference (`ref_kernel`) is configurable via two globals in `reference.py`: - -| `Q_DTYPE` | `KV_DTYPE` | Aiter kernel dispatched | Description | -|---|---|---|---| -| `"fp8"` (default) | `"fp8"` (default) | `mla_a8w8_qh16_qseqlen1_gqaratio16_ps` | fp8 Q + fp8 KV — fastest | -| `"bf16"` | `"fp8"` | `mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps` | bf16 Q + fp8 KV | -| `"bf16"` | `"bf16"` | `mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps` | bf16 Q + bf16 KV — highest precision | - -**Note**: `Q_DTYPE="fp8"` + `KV_DTYPE="bf16"` is not a valid combination (no a8w16 kernel exists). - -### Reference Latency (MI355X) - -| Case | a8w8 (us) | a16w16 (us) | a8w8 speedup | -|---|---|---|---| -| bs=4, kv=1k | ~118 | ~162 | 1.4x | -| bs=4, kv=8k | ~113 | ~177 | 1.6x | -| bs=64, kv=8k | ~171 | ~353 | 2.1x | -| bs=256, kv=8k | ~349 | ~814 | 2.3x | - -## KV Buffer Format (forward_absorb) - -The compressed KV buffer has `qk_head_dim=576` dimensions: -- **Full 576 dims** are used as **keys** (for Q@K^T score computation) -- **First 512 dims** (kv_lora_rank) are used as **values** (for output computation) - -## KV Cache Quantization - -| dtype | kv_buffer | kv_scale | Quantization | Bandwidth | -|---|---|---|---|---| -| bf16 | bfloat16 `(total_kv, 1, 576)` | None | No quantization | 1x | -| fp8 | fp8 `(total_kv, 1, 576)` | scalar float32 | Dynamic per-tensor (sglang `scaled_fp8_quant`) | 2x savings | -| mxfp4 | fp4x2 `(total_kv, 1, 288)` | fp8_e8m0 `(total_kv, N_blocks)` | Block-32 MXFP4 (aiter `dynamic_mxfp4_quant`) | 4x savings | - -### FP8 quantization (sglang `scaled_fp8_quant`) - -- **Granularity**: per-tensor -- **Scale**: `kv_scale = max(abs(kv_bf16)) / fp8_max` -- **Quantize**: `kv_fp8 = (kv_bf16 / kv_scale).clamp(...).to(fp8)` -- **Dequantize**: `kv_bf16 ≈ kv_fp8.to(bf16) * kv_scale` -- **kv_scale**: scalar float32 tensor - -### MXFP4 quantization (aiter `dynamic_mxfp4_quant`) - -- **Granularity**: per-block of 32 elements -- **FP4 format**: E2M1 — values `[0, 0.5, 1, 1.5, 2, 3, 4, 6]`, max = 6.0 -- **Scale format**: E8M0 — exponent-only scale stored in `aiter.dtypes.fp8_e8m0` -- **Packing**: 2 FP4 values packed per byte (low nibble = even index, high nibble = odd index) -- **kv_buffer**: `(total_kv, 1, 288)` in `aiter.dtypes.fp4x2` — packed FP4 data -- **kv_scale**: `(total_kv, N_blocks)` in `aiter.dtypes.fp8_e8m0` — per-block E8M0 scale factors -- **Dequantize**: `aiter.utility.fp4_utils.mxfp4_to_f32` + `e8m0_to_f32` for block-wise scaling - -### aiter dtype reference - -| Logical type | aiter dtype | PyTorch native (if available) | Fallback | -|---|---|---|---| -| fp4x2 | `aiter.dtypes.fp4x2` | `torch.float4_e2m1fn_x2` | `torch.uint8` | -| fp8_e8m0 | `aiter.dtypes.fp8_e8m0` | `torch.float8_e8m0fnu` | `torch.uint8` | -| fp8 | `aiter.dtypes.fp8` | `torch.float8_e4m3fnuz` (gfx942) / `torch.float8_e4m3fn` (gfx950+) | `torch.uint8` | - -## Input - -A tuple `(q, kv_data, qo_indptr, kv_indptr, config)`: - -``` -q: (total_q, 16, 576) bfloat16 — absorbed queries -kv_data: dict with three KV cache formats (see below) -qo_indptr: (batch_size + 1,) int32 — query segment pointers -kv_indptr: (batch_size + 1,) int32 — KV segment pointers -config: dict — MLA parameters -``` - -### kv_data dict - -All three KV cache formats are provided simultaneously. Each entry is either a -`Tensor` (bf16) or a `(Tensor, Tensor)` tuple (quantized buffer + scale): - -```python -kv_data = { - "bf16": kv_buffer_bf16, # Tensor (total_kv, 1, 576) bfloat16 - "fp8": (kv_buffer_fp8, kv_scale_fp8), # (fp8 Tensor, scalar float32) - "mxfp4": (kv_buffer_mxfp4, kv_scale_mxfp4), # (fp4x2 Tensor, fp8_e8m0 Tensor) -} -``` - -### config dict - -```python -config = { - "batch_size": int, - "num_heads": 16, - "num_kv_heads": 1, - "qk_head_dim": 576, - "kv_lora_rank": 512, - "qk_rope_head_dim": 64, - "v_head_dim": 512, - "q_seq_len": 1, - "kv_seq_len": int, # varies per test case (1024 or 8192) - "sm_scale": 0.04166..., # 1/sqrt(576) -} -``` - -## Output - -``` -attention_output: (total_q, 16, 512) bfloat16 -``` - -## Optimization Opportunities - -The reference is already a highly optimized aiter a8w8 persistent kernel. To beat it, consider: - -1. **MXFP4 KV cache**: 4x bandwidth savings over bf16, 2x over fp8. Two strategies: - - **Strategy A — Fuse dequantization with attention (keep Q in bf16/fp8):** - Load quantized KV tiles from HBM, dequantize in registers/LDS to bf16, and - immediately compute QK^T and softmax·V — never writing the decompressed KV back - to HBM. This eliminates the extra read/write of the bf16 intermediate buffer, - roughly quartering the memory traffic for mxfp4 compared to the naive - dequant-then-attend approach. - - **Strategy B — Quantize Q to match KV precision (full low-precision compute):** - Dynamically quantize Q from bf16 → mxfp4 (per-block scaling), then compute QK^T - entirely in fp4×fp4 using MFMA instructions on MI355X. The softmax is still done - in fp32 for numerical stability, and V accumulation uses fp4×fp4 → fp32. This - trades a small amount of accuracy for significantly higher throughput on the - matrix units. - -2. **Custom split-K / split-batch scheduling**: the aiter kernel uses 32-way KV splits - with reduce; a different split strategy or tile size may be more efficient for certain - batch/seq_len combinations. - -3. **MQA pattern**: 1 KV head shared across 16 query heads — minimize redundant KV loads - by loading KV once and broadcasting across all query heads in shared memory/LDS. - -4. **Variable-length batching**: indptr-based segmented attention across batch elements. - -5. **Split K/V from buffer**: full 576 dims for keys, first 512 for values — potential - for separate tiling strategies for the score and output stages. - -## Accuracy - -Submissions are checked against the a8w8 reference with `rtol=2e-02, atol=8e-03`. - -Measured accuracy of different approaches vs bf16 torch ground truth: - -| Approach | max abs diff | Notes | -|---|---|---| -| aiter a8w8 (reference) | 2.6e-05 — 8.0e-05 | fp8 quantization + kernel accumulation | -| torch fp8 (scaled_mm) | 2e-06 — 1.5e-05 | Closest to bf16 | -| torch mxfp4 | 2.1e-04 — 8.3e-04 | 4-bit quantization noise | - -All approaches are well within the tolerance. - -## Benchmark Cases - -All three KV formats (bf16, fp8, mxfp4) are provided in every test case. - -| batch_size | q_seq_len | kv_seq_len | -|---|---|---| -| 4 | 1 | 1024 | -| 4 | 1 | 8192 | -| 32 | 1 | 1024 | -| 32 | 1 | 8192 | -| 64 | 1 | 1024 | -| 64 | 1 | 8192 | -| 256 | 1 | 1024 | -| 256 | 1 | 8192 | - -Ranking is by **geometric mean** of benchmark latencies. diff --git a/problems/amd_202602/mixed-mla/reference.py b/problems/amd_202602/mixed-mla/reference.py deleted file mode 100644 index 9bddf10f..00000000 --- a/problems/amd_202602/mixed-mla/reference.py +++ /dev/null @@ -1,372 +0,0 @@ -""" -Reference implementation for MLA (Multi-head Latent Attention) decode kernel. - -Uses aiter MLA kernels (mla_decode_fwd) as the reference. -DeepSeek R1 forward_absorb MLA: absorbed q (576), compressed kv_buffer (576), -output v_head_dim = kv_lora_rank = 512. - -The input provides: - q: (total_q, 16, 576) bfloat16 — absorbed query - kv_data: dict with KV cache in three formats: - "bf16": Tensor (total_kv, 1, 576) bfloat16 — highest precision - "fp8": (Tensor, Tensor) kv_buffer fp8 + scalar scale — per-tensor quantized - "mxfp4": (Tensor, Tensor) kv_buffer fp4x2 + fp8_e8m0 — block-32 quantized - The reference quantizes Q to fp8 on-the-fly inside ref_kernel. - -The reference kernel quantizes Q to fp8 on-the-fly and uses fp8 KV (a8w8 kernel), -which is ~2-3x faster than bf16 on MI355X with negligible accuracy loss. - -Decode only — persistent mode with get_mla_metadata_v1. -""" - -import torch -import torch.nn.functional as F -from task import input_t, output_t -from utils import make_match_reference - -from aiter.mla import mla_decode_fwd -from aiter import dtypes as aiter_dtypes -from aiter import get_mla_metadata_info_v1, get_mla_metadata_v1 -from aiter.utility.fp4_utils import ( - dynamic_mxfp4_quant, - mxfp4_to_f32, - e8m0_to_f32, -) - -# --------------------------------------------------------------------------- -# DeepSeek R1 latent MQA constants (forward_absorb path) -# https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/config.json -# --------------------------------------------------------------------------- -NUM_HEADS = 16 -NUM_KV_HEADS = 1 -KV_LORA_RANK = 512 -QK_ROPE_HEAD_DIM = 64 -QK_HEAD_DIM = KV_LORA_RANK + QK_ROPE_HEAD_DIM # 576 -V_HEAD_DIM = KV_LORA_RANK # 512 -SM_SCALE = 1.0 / (QK_HEAD_DIM ** 0.5) - -PAGE_SIZE = 1 -NUM_KV_SPLITS = 32 - -# FP8 dtype (platform-specific via aiter) -FP8_DTYPE = aiter_dtypes.fp8 - -# Query dtype for the reference kernel: "fp8" or "bf16" -Q_DTYPE = "fp8" - -# KV cache dtype for the reference kernel: "fp8" or "bf16" -KV_DTYPE = "fp8" - - -# --------------------------------------------------------------------------- -# FP8 quantization (sglang style: dynamic per-tensor) -# --------------------------------------------------------------------------- - -def quantize_fp8(tensor: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: - """ - Dynamic per-tensor FP8 quantization (following sglang scaled_fp8_quant). - - Args: - tensor: bf16 tensor to quantize - - Returns: - (fp8_tensor, scale) where scale is a scalar float32 tensor. - Dequantize: fp8_tensor.to(bf16) * scale - """ - finfo = torch.finfo(FP8_DTYPE) - amax = tensor.abs().amax().clamp(min=1e-12) - scale = amax / finfo.max - fp8_tensor = (tensor / scale).clamp(min=finfo.min, max=finfo.max).to(FP8_DTYPE) - return fp8_tensor, scale.to(torch.float32).reshape(1) - - -# --------------------------------------------------------------------------- -# MXFP4 quantization (aiter native: block-32, fp4x2 + fp8_e8m0 dtypes) -# Uses aiter.utility.fp4_utils.dynamic_mxfp4_quant -# --------------------------------------------------------------------------- - -def quantize_mxfp4(tensor: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: - """ - MXFP4 block-wise quantization using aiter's dynamic_mxfp4_quant. - - Block size = 32. Each block gets an E8M0 scale factor. - Two FP4 E2M1 values are packed per byte. - - Args: - tensor: bf16 tensor of shape [B, M, N] (N must be divisible by 32) - - Returns: - (fp4_data, scale_e8m0) - - fp4_data: shape [B, M, N//2] in aiter_dtypes.fp4x2 - - scale_e8m0: shape [B*M, ceil(N/32)] padded, in aiter_dtypes.fp8_e8m0 - """ - orig_shape = tensor.shape # (B, M, N) - B, M, N = orig_shape - - # dynamic_mxfp4_quant expects 2D: (B*M, N) - tensor_2d = tensor.reshape(B * M, N) - fp4_data_2d, scale_e8m0 = dynamic_mxfp4_quant(tensor_2d) - - # Reshape fp4_data back to 3D: (B, M, N//2) - fp4_data = fp4_data_2d.view(B, M, N // 2) - - return fp4_data, scale_e8m0 - - -def dequantize_mxfp4( - fp4_data: torch.Tensor, - scale_e8m0: torch.Tensor, - orig_shape: tuple, - dtype: torch.dtype = torch.bfloat16, -) -> torch.Tensor: - """ - Dequantize MXFP4 tensor using aiter utilities. - - Note: dynamic_mxfp4_quant may pad both row and block dimensions in scale_e8m0. - We trim scales to match the actual data dimensions. - - Args: - fp4_data: packed FP4 data, shape [B, M, N//2] in fp4x2 or uint8 - scale_e8m0: E8M0 block scale factors (possibly padded) in fp8_e8m0 - orig_shape: original (B, M, N) for reshaping - dtype: output dtype - - Returns: - Dequantized tensor of shape orig_shape. - """ - B, M, N = orig_shape - num_rows = B * M - block_size = 32 - num_blocks = N // block_size # actual blocks needed (e.g. 576/32 = 18) - - # Unpack FP4 to float32: mxfp4_to_f32 expects (..., N//2) -> (..., N) - fp4_data_2d = fp4_data.reshape(num_rows, N // 2) - float_vals = mxfp4_to_f32(fp4_data_2d) # (num_rows, N) - - # Convert E8M0 scales to float32 and trim padded dimensions - scale_f32 = e8m0_to_f32(scale_e8m0) # (padded_rows, padded_blocks) - scale_f32 = scale_f32[:num_rows, :num_blocks] # (num_rows, num_blocks) - - # Apply block scales - float_vals_blocked = float_vals.view(num_rows, num_blocks, block_size) - scaled = float_vals_blocked * scale_f32.unsqueeze(-1) - - return scaled.view(B, M, N).to(dtype) - - -# --------------------------------------------------------------------------- -# Persistent mode metadata helpers -# --------------------------------------------------------------------------- - -def _make_mla_decode_metadata( - batch_size: int, - max_q_len: int, - nhead: int, - nhead_kv: int, - q_dtype: torch.dtype, - kv_dtype: torch.dtype, - qo_indptr: torch.Tensor, - kv_indptr: torch.Tensor, - kv_last_page_len: torch.Tensor, - num_kv_splits: int = NUM_KV_SPLITS, -): - """Allocate and populate work buffers for persistent mla_decode_fwd.""" - info = get_mla_metadata_info_v1( - batch_size, max_q_len, nhead, q_dtype, kv_dtype, - is_sparse=False, fast_mode=False, - num_kv_splits=num_kv_splits, intra_batch_mode=True, - ) - work = [torch.empty(s, dtype=t, device="cuda") for s, t in info] - (work_metadata, work_indptr, work_info_set, - reduce_indptr, reduce_final_map, reduce_partial_map) = work - - # Populate the metadata buffers - get_mla_metadata_v1( - qo_indptr, kv_indptr, kv_last_page_len, - nhead // nhead_kv, # num_heads_per_head_k - nhead_kv, # num_heads_k - True, # is_causal - work_metadata, work_info_set, work_indptr, - reduce_indptr, reduce_final_map, reduce_partial_map, - page_size=PAGE_SIZE, - kv_granularity=max(PAGE_SIZE, 16), - max_seqlen_qo=max_q_len, - uni_seqlen_qo=max_q_len, - fast_mode=False, - max_split_per_batch=num_kv_splits, - intra_batch_mode=True, - dtype_q=q_dtype, - dtype_kv=kv_dtype, - ) - - return { - "work_meta_data": work_metadata, - "work_indptr": work_indptr, - "work_info_set": work_info_set, - "reduce_indptr": reduce_indptr, - "reduce_final_map": reduce_final_map, - "reduce_partial_map": reduce_partial_map, - } - - -# --------------------------------------------------------------------------- -# Aiter reference kernel (decode only) -# --------------------------------------------------------------------------- - -def _aiter_mla_decode( - q: torch.Tensor, - kv_buffer: torch.Tensor, - qo_indptr: torch.Tensor, - kv_indptr: torch.Tensor, - config: dict, - q_scale: torch.Tensor | None = None, - kv_scale: torch.Tensor | None = None, -) -> torch.Tensor: - """ - MLA decode attention using aiter persistent-mode kernel. - - Supports multiple Q/KV dtype combinations: - - Q_DTYPE="fp8": fp8 Q + fp8 KV (a8w8) — fastest on MI355X - - Q_DTYPE="bf16": bf16 Q + bf16 KV (a16w16) — highest precision - - q: (total_q, num_heads, 576) fp8 or bf16 - kv_buffer: (total_kv, 1, 576) fp8 or bf16 - q_scale: scalar float32 (required for fp8 Q, None for bf16) - kv_scale: scalar float32 (required for fp8 KV, None for bf16) - """ - batch_size = config["batch_size"] - nq = config["num_heads"] - nkv = config["num_kv_heads"] - dq = config["qk_head_dim"] - dv = config["v_head_dim"] - q_seq_len = config["q_seq_len"] - - total_kv_len = int(kv_indptr[-1].item()) - kv_indices = torch.arange(total_kv_len, dtype=torch.int32, device="cuda") - - # Reshape kv_buffer to 4D for aiter: (total_kv, page_size, nhead_kv, dim) - kv_buffer_4d = kv_buffer.view(kv_buffer.shape[0], PAGE_SIZE, nkv, kv_buffer.shape[-1]) - - max_q_len = q_seq_len - kv_last_page_len = (kv_indptr[1:] - kv_indptr[:-1]).to(torch.int32) - - # Build persistent-mode metadata - meta = _make_mla_decode_metadata( - batch_size, max_q_len, nq, nkv, - q.dtype, kv_buffer.dtype, - qo_indptr, kv_indptr, kv_last_page_len, - num_kv_splits=NUM_KV_SPLITS, - ) - - o = torch.empty((q.shape[0], nq, dv), dtype=torch.bfloat16, device="cuda") - mla_decode_fwd( - q.view(-1, nq, dq), - kv_buffer_4d, - o, - qo_indptr, - kv_indptr, - kv_indices, - kv_last_page_len, - max_q_len, - page_size=PAGE_SIZE, - nhead_kv=nkv, - sm_scale=SM_SCALE, - logit_cap=0.0, - num_kv_splits=NUM_KV_SPLITS, - q_scale=q_scale, - kv_scale=kv_scale, - intra_batch_mode=True, - **meta, - ) - return o - - -# --------------------------------------------------------------------------- -# generate_input / ref_kernel / check_implementation -# --------------------------------------------------------------------------- - -def generate_input(batchsize: int, qseqlen: int, kvseqlen: int, seed: int) -> input_t: - """ - Generate absorbed q and compressed kv_buffer for MLA decode. - - Returns all three KV cache formats in kv_data dict: - kv_data = { - "bf16": Tensor — (total_kv, 1, 576) bfloat16 - "fp8": (Tensor, Tensor) — kv_buffer fp8 + scalar scale - "mxfp4": (Tensor, Tensor) — kv_buffer fp4x2 + fp8_e8m0 scale - } - """ - gen = torch.Generator(device="cuda") - gen.manual_seed(seed) - - total_q = batchsize * qseqlen - total_kv = batchsize * kvseqlen - - # Absorbed query: (total_q, num_heads, 576) bf16 - q = torch.randn( - (total_q, NUM_HEADS, QK_HEAD_DIM), - dtype=torch.bfloat16, device="cuda", generator=gen, - ) - - # Compressed KV buffer: (total_kv, 1, 576) bf16 — the source of truth - kv_buffer_bf16 = torch.randn( - (total_kv, NUM_KV_HEADS, QK_HEAD_DIM), - dtype=torch.bfloat16, device="cuda", generator=gen, - ) - - # Quantize KV to fp8 - kv_buffer_fp8, kv_scale_fp8 = quantize_fp8(kv_buffer_bf16) - - # Quantize KV to mxfp4 - kv_buffer_mxfp4, kv_scale_mxfp4 = quantize_mxfp4(kv_buffer_bf16) - - # All three KV formats: bf16 is a Tensor, fp8/mxfp4 are (Tensor, Tensor) tuples - kv_data = { - "bf16": kv_buffer_bf16, - "fp8": (kv_buffer_fp8, kv_scale_fp8), - "mxfp4": (kv_buffer_mxfp4, kv_scale_mxfp4), - } - - qo_indptr = torch.arange(0, batchsize + 1, dtype=torch.int32, device="cuda") * qseqlen - kv_indptr = torch.arange(0, batchsize + 1, dtype=torch.int32, device="cuda") * kvseqlen - - config = { - "batch_size": batchsize, - "num_heads": NUM_HEADS, - "num_kv_heads": NUM_KV_HEADS, - "qk_head_dim": QK_HEAD_DIM, - "kv_lora_rank": KV_LORA_RANK, - "qk_rope_head_dim": QK_ROPE_HEAD_DIM, - "v_head_dim": V_HEAD_DIM, - "q_seq_len": qseqlen, - "kv_seq_len": kvseqlen, - "sm_scale": SM_SCALE, - } - - return (q, kv_data, qo_indptr, kv_indptr, config) - - -def ref_kernel(data: input_t) -> output_t: - """Reference MLA decode attention. Uses Q_DTYPE and KV_DTYPE to select kernel variant.""" - q, kv_data, qo_indptr, kv_indptr, config = data - - # Resolve Q - if Q_DTYPE == "fp8": - q_input, q_scale = quantize_fp8(q) - else: - q_input, q_scale = q, None - - # Resolve KV - if KV_DTYPE == "fp8": - kv_buffer_fp8, kv_scale = kv_data["fp8"] - kv_input = kv_buffer_fp8 - else: - kv_input, kv_scale = kv_data["bf16"], None - - return _aiter_mla_decode( - q_input, kv_input, qo_indptr, kv_indptr, config, - q_scale=q_scale, kv_scale=kv_scale, - ) - - -check_implementation = make_match_reference(ref_kernel, rtol=1e-01, atol=1e-01) diff --git a/problems/amd_202602/mixed-mla/submission.py b/problems/amd_202602/mixed-mla/submission.py deleted file mode 100644 index fba8b760..00000000 --- a/problems/amd_202602/mixed-mla/submission.py +++ /dev/null @@ -1,299 +0,0 @@ -# gpumode leaderboard reference -""" -Reference implementation for MLA (Multi-head Latent Attention) decode kernel. - -Uses aiter MLA kernels (mla_decode_fwd) as the reference. -DeepSeek R1 forward_absorb MLA: absorbed q (576), compressed kv_buffer (576), -output v_head_dim = kv_lora_rank = 512. - -The input provides: - q: (total_q, 16, 576) bfloat16 — absorbed query - kv_data: dict with KV cache in three formats: - "bf16": Tensor (total_kv, 1, 576) bfloat16 — highest precision - "fp8": (Tensor, Tensor) kv_buffer fp8 + scalar scale — per-tensor quantized - "mxfp4": (Tensor, Tensor) kv_buffer fp4x2 + fp8_e8m0 — block-32 quantized - The reference quantizes Q to fp8 on-the-fly inside ref_kernel. - -The reference kernel quantizes Q to fp8 on-the-fly and uses fp8 KV (a8w8 kernel), -which is ~2-3x faster than bf16 on MI355X with negligible accuracy loss. - -Decode only — persistent mode with get_mla_metadata_v1. -""" - -import torch -import torch.nn.functional as F -from task import input_t, output_t -from utils import make_match_reference - -from aiter.mla import mla_decode_fwd -from aiter import dtypes as aiter_dtypes -from aiter import get_mla_metadata_info_v1, get_mla_metadata_v1 -from aiter.utility.fp4_utils import ( - dynamic_mxfp4_quant, - mxfp4_to_f32, - e8m0_to_f32, -) - -# --------------------------------------------------------------------------- -# DeepSeek R1 latent MQA constants (forward_absorb path) -# https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/config.json -# --------------------------------------------------------------------------- -NUM_HEADS = 16 -NUM_KV_HEADS = 1 -KV_LORA_RANK = 512 -QK_ROPE_HEAD_DIM = 64 -QK_HEAD_DIM = KV_LORA_RANK + QK_ROPE_HEAD_DIM # 576 -V_HEAD_DIM = KV_LORA_RANK # 512 -SM_SCALE = 1.0 / (QK_HEAD_DIM ** 0.5) - -PAGE_SIZE = 1 -NUM_KV_SPLITS = 32 - -# FP8 dtype (platform-specific via aiter) -FP8_DTYPE = aiter_dtypes.fp8 - -# Query dtype for the reference kernel: "fp8" or "bf16" -Q_DTYPE = "fp8" - -# KV cache dtype for the reference kernel: "fp8" or "bf16" -KV_DTYPE = "fp8" - - -# --------------------------------------------------------------------------- -# FP8 quantization (sglang style: dynamic per-tensor) -# --------------------------------------------------------------------------- -def quantize_fp8(tensor: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: - """ - Dynamic per-tensor FP8 quantization (following sglang scaled_fp8_quant). - - Args: - tensor: bf16 tensor to quantize - - Returns: - (fp8_tensor, scale) where scale is a scalar float32 tensor. - Dequantize: fp8_tensor.to(bf16) * scale - """ - finfo = torch.finfo(FP8_DTYPE) - amax = tensor.abs().amax().clamp(min=1e-12) - scale = amax / finfo.max - fp8_tensor = (tensor / scale).clamp(min=finfo.min, max=finfo.max).to(FP8_DTYPE) - return fp8_tensor, scale.to(torch.float32).reshape(1) - - -# --------------------------------------------------------------------------- -# MXFP4 quantization (aiter native: block-32, fp4x2 + fp8_e8m0 dtypes) -# Uses aiter.utility.fp4_utils.dynamic_mxfp4_quant -# --------------------------------------------------------------------------- - -def quantize_mxfp4(tensor: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: - """ - MXFP4 block-wise quantization using aiter's dynamic_mxfp4_quant. - - Block size = 32. Each block gets an E8M0 scale factor. - Two FP4 E2M1 values are packed per byte. - - Args: - tensor: bf16 tensor of shape [B, M, N] (N must be divisible by 32) - - Returns: - (fp4_data, scale_e8m0) - - fp4_data: shape [B, M, N//2] in aiter_dtypes.fp4x2 - - scale_e8m0: shape [B*M, ceil(N/32)] padded, in aiter_dtypes.fp8_e8m0 - """ - orig_shape = tensor.shape # (B, M, N) - B, M, N = orig_shape - - # dynamic_mxfp4_quant expects 2D: (B*M, N) - tensor_2d = tensor.reshape(B * M, N) - fp4_data_2d, scale_e8m0 = dynamic_mxfp4_quant(tensor_2d) - - # Reshape fp4_data back to 3D: (B, M, N//2) - fp4_data = fp4_data_2d.view(B, M, N // 2) - - return fp4_data, scale_e8m0 - - -def dequantize_mxfp4( - fp4_data: torch.Tensor, - scale_e8m0: torch.Tensor, - orig_shape: tuple, - dtype: torch.dtype = torch.bfloat16, -) -> torch.Tensor: - """ - Dequantize MXFP4 tensor using aiter utilities. - - Note: dynamic_mxfp4_quant may pad both row and block dimensions in scale_e8m0. - We trim scales to match the actual data dimensions. - - Args: - fp4_data: packed FP4 data, shape [B, M, N//2] in fp4x2 or uint8 - scale_e8m0: E8M0 block scale factors (possibly padded) in fp8_e8m0 - orig_shape: original (B, M, N) for reshaping - dtype: output dtype - - Returns: - Dequantized tensor of shape orig_shape. - """ - B, M, N = orig_shape - num_rows = B * M - block_size = 32 - num_blocks = N // block_size # actual blocks needed (e.g. 576/32 = 18) - - # Unpack FP4 to float32: mxfp4_to_f32 expects (..., N//2) -> (..., N) - fp4_data_2d = fp4_data.reshape(num_rows, N // 2) - float_vals = mxfp4_to_f32(fp4_data_2d) # (num_rows, N) - - # Convert E8M0 scales to float32 and trim padded dimensions - scale_f32 = e8m0_to_f32(scale_e8m0) # (padded_rows, padded_blocks) - scale_f32 = scale_f32[:num_rows, :num_blocks] # (num_rows, num_blocks) - - # Apply block scales - float_vals_blocked = float_vals.view(num_rows, num_blocks, block_size) - scaled = float_vals_blocked * scale_f32.unsqueeze(-1) - - return scaled.view(B, M, N).to(dtype) - - -# --------------------------------------------------------------------------- -# Persistent mode metadata helpers -# --------------------------------------------------------------------------- - -def _make_mla_decode_metadata( - batch_size: int, - max_q_len: int, - nhead: int, - nhead_kv: int, - q_dtype: torch.dtype, - kv_dtype: torch.dtype, - qo_indptr: torch.Tensor, - kv_indptr: torch.Tensor, - kv_last_page_len: torch.Tensor, - num_kv_splits: int = NUM_KV_SPLITS, -): - """Allocate and populate work buffers for persistent mla_decode_fwd.""" - info = get_mla_metadata_info_v1( - batch_size, max_q_len, nhead, q_dtype, kv_dtype, - is_sparse=False, fast_mode=False, - num_kv_splits=num_kv_splits, intra_batch_mode=True, - ) - work = [torch.empty(s, dtype=t, device="cuda") for s, t in info] - (work_metadata, work_indptr, work_info_set, - reduce_indptr, reduce_final_map, reduce_partial_map) = work - - # Populate the metadata buffers - get_mla_metadata_v1( - qo_indptr, kv_indptr, kv_last_page_len, - nhead // nhead_kv, # num_heads_per_head_k - nhead_kv, # num_heads_k - True, # is_causal - work_metadata, work_info_set, work_indptr, - reduce_indptr, reduce_final_map, reduce_partial_map, - page_size=PAGE_SIZE, - kv_granularity=max(PAGE_SIZE, 16), - max_seqlen_qo=max_q_len, - uni_seqlen_qo=max_q_len, - fast_mode=False, - max_split_per_batch=num_kv_splits, - intra_batch_mode=True, - dtype_q=q_dtype, - dtype_kv=kv_dtype, - ) - - return { - "work_meta_data": work_metadata, - "work_indptr": work_indptr, - "work_info_set": work_info_set, - "reduce_indptr": reduce_indptr, - "reduce_final_map": reduce_final_map, - "reduce_partial_map": reduce_partial_map, - } - - -# --------------------------------------------------------------------------- -# Aiter reference kernel (decode only) -# --------------------------------------------------------------------------- - -def _aiter_mla_decode( - q: torch.Tensor, - kv_buffer: torch.Tensor, - qo_indptr: torch.Tensor, - kv_indptr: torch.Tensor, - config: dict, - q_scale: torch.Tensor | None = None, - kv_scale: torch.Tensor | None = None, -) -> torch.Tensor: - """ - MLA decode attention using aiter persistent-mode kernel. - - Supports multiple Q/KV dtype combinations: - - Q_DTYPE="fp8": fp8 Q + fp8 KV (a8w8) — fastest on MI355X - - Q_DTYPE="bf16": bf16 Q + bf16 KV (a16w16) — highest precision - - q: (total_q, num_heads, 576) fp8 or bf16 - kv_buffer: (total_kv, 1, 576) fp8 or bf16 - q_scale: scalar float32 (required for fp8 Q, None for bf16) - kv_scale: scalar float32 (required for fp8 KV, None for bf16) - """ - batch_size = config["batch_size"] - nq = config["num_heads"] - nkv = config["num_kv_heads"] - dq = config["qk_head_dim"] - dv = config["v_head_dim"] - q_seq_len = config["q_seq_len"] - total_kv_len = int(kv_indptr[-1].item()) - - # Reshape kv_buffer to 4D for aiter: (total_kv, page_size, nhead_kv, dim) - kv_buffer_4d = kv_buffer.view(kv_buffer.shape[0], PAGE_SIZE, nkv, kv_buffer.shape[-1]) - - max_q_len = q_seq_len - kv_indices = torch.arange(total_kv_len, dtype=torch.int32, device="cuda") - kv_last_page_len = (kv_indptr[1:] - kv_indptr[:-1]).to(torch.int32) - meta = _make_mla_decode_metadata( - batch_size, max_q_len, nq, nkv, - q.dtype, kv_buffer.dtype, - qo_indptr, kv_indptr, kv_last_page_len, - num_kv_splits=NUM_KV_SPLITS, - ) - - o = torch.empty((q.shape[0], nq, dv), dtype=torch.bfloat16, device="cuda") - mla_decode_fwd( - q.view(-1, nq, dq), - kv_buffer_4d, - o, - qo_indptr, - kv_indptr, - kv_indices, - kv_last_page_len, - max_q_len, - page_size=PAGE_SIZE, - nhead_kv=nkv, - sm_scale=SM_SCALE, - logit_cap=0.0, - num_kv_splits=NUM_KV_SPLITS, - q_scale=q_scale, - kv_scale=kv_scale, - intra_batch_mode=True, - **meta, - ) - return o - -def custom_kernel(data: input_t) -> output_t: - """Reference MLA decode attention. Uses Q_DTYPE and KV_DTYPE to select kernel variant.""" - q, kv_data, qo_indptr, kv_indptr, config = data - - # Resolve Q - if Q_DTYPE == "fp8": - q_input, q_scale = quantize_fp8(q) - else: - q_input, q_scale = q, None - - # Resolve KV - if KV_DTYPE == "fp8": - kv_buffer_fp8, kv_scale = kv_data["fp8"] - kv_input = kv_buffer_fp8 - else: - kv_input, kv_scale = kv_data["bf16"], None - return _aiter_mla_decode( - q_input, kv_input, qo_indptr, kv_indptr, config, - q_scale=q_scale, kv_scale=kv_scale, - ) \ No newline at end of file diff --git a/problems/amd_202602/mixed-mla/task.py b/problems/amd_202602/mixed-mla/task.py deleted file mode 100644 index 7aff7b6a..00000000 --- a/problems/amd_202602/mixed-mla/task.py +++ /dev/null @@ -1,36 +0,0 @@ -import torch -from typing import TypeVar, TypedDict, Union - -# DeepSeek R1 MLA forward_absorb format: -# -# Input: (q, kv_data, qo_indptr, kv_indptr, config) -# q: (total_q, num_heads, qk_head_dim) bfloat16 -# kv_data: dict with three KV cache formats: -# "bf16": Tensor (total_kv, 1, 576) bfloat16 -# "fp8": (Tensor, Tensor) kv_buffer fp8 (total_kv, 1, 576) + scalar scale -# "mxfp4": (Tensor, Tensor) kv_buffer fp4x2 (total_kv, 1, 288) + fp8_e8m0 scale -# qo_indptr: (batch_size + 1,) int32 -# kv_indptr: (batch_size + 1,) int32 -# config: dict with MLA parameters -# -# where qk_head_dim = kv_lora_rank + qk_rope_head_dim = 512 + 64 = 576 -# -# Output: attention output tensor (total_q, num_heads, v_head_dim) bfloat16 -# where v_head_dim = kv_lora_rank = 512 -# -# The kv_buffer stores the compressed KV representation: -# - Full 576 dims used as keys (for Q@K^T score computation) -# - First 512 dims (kv_lora_rank) used as values (for output computation) - -input_t = TypeVar( - "input_t", - bound=tuple[torch.Tensor, dict, torch.Tensor, torch.Tensor, dict], -) -output_t = TypeVar("output_t", bound=torch.Tensor) - - -class TestSpec(TypedDict): - batchsize: int - qseqlen: int - kvseqlen: int - seed: int diff --git a/problems/amd_202602/mixed-mla/task.yml b/problems/amd_202602/mixed-mla/task.yml deleted file mode 100644 index c0a5d5a6..00000000 --- a/problems/amd_202602/mixed-mla/task.yml +++ /dev/null @@ -1,95 +0,0 @@ -# name: mla-py - -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - Implement a custom MLA (Multi-head Latent Attention) decode kernel optimized for MI355X. - - This is the inner attention kernel from DeepSeek R1's forward_absorb MLA path. - The absorbed query and compressed KV cache are provided directly — you only need to - implement the **attention** computation with variable-length batching (indptr). - - The reference uses aiter a8w8 MLA decode kernel (mla_decode_fwd, fp8 Q + fp8 KV, - persistent mode), which is ~2-3x faster than bf16 on MI355X. - - DeepSeek R1 forward_absorb MLA config: - - num_heads = 16 (query heads, after TP split) - - num_kv_heads = 1 (shared latent KV head) - - kv_lora_rank = 512 - - qk_rope_head_dim = 64 - - qk_head_dim = 576 (kv_lora_rank + qk_rope_head_dim, absorbed q/k dim) - - v_head_dim = 512 (= kv_lora_rank, output dim) - - sm_scale = 1/sqrt(576) - - dtype: q=bfloat16 - - decode only (q_seq_len=1, kv_seq_len up to 8k) - - KV buffer format (forward_absorb): - - Full 576 dims are used as keys (for Q@K^T score computation) - - First 512 dims (kv_lora_rank) are used as values (for output computation) - - Input tuple: (q, kv_data, qo_indptr, kv_indptr, config) - - q: (total_q, 16, 576) bfloat16 — absorbed query - - kv_data: dict with three KV cache formats: - kv_data["bf16"] — Tensor (total_kv, 1, 576) bfloat16 - kv_data["fp8"] — (Tensor, Tensor): kv_buffer fp8 + scalar scale - kv_data["mxfp4"] — (Tensor, Tensor): kv_buffer fp4x2 + fp8_e8m0 scale - - qo_indptr: (batch_size+1,) int32 — query segment pointers - - kv_indptr: (batch_size+1,) int32 — KV segment pointers - - config: dict with MLA parameters - - Return: - - attention output: (total_q, 16, 512) bfloat16 - - Key optimization opportunities: - 1. Use mxfp4 KV cache for even lower memory bandwidth (4x savings over bf16) - - Fuse dequantization with attention to skip bf16 materialization - 2. Custom kernel with tighter memory access patterns - 3. MQA: 1 KV head shared across 16 query heads — minimize redundant memory loads - 4. Decode: q_seq_len=1, kv_seq_len up to 8k — memory-bound workload - 5. Variable-length batching: indptr-based segmented attention - 6. Split K/V from buffer: full 576 dims for keys, first 512 dims for values - - The ranking criteria is the geometric mean of the benchmark results. - -config: - main: "eval.py" - -templates: - Python: "submission.py" - -test_timeout: 900 -benchmark_timeout: 900 -ranked_timeout: 1200 - -tests: - # bs=4 - - {"batchsize": 4, "qseqlen": 1, "kvseqlen": 1024, "seed": 4220} - # bs=32 - - {"batchsize": 32, "qseqlen": 1, "kvseqlen": 1024, "seed": 5412} - # bs=64 - - {"batchsize": 64, "qseqlen": 1, "kvseqlen": 8192, "seed": 1360} - # bs=256 - - {"batchsize": 256, "qseqlen": 1, "kvseqlen": 8192, "seed": 9826} - -benchmarks: - # bs=4 - - {"batchsize": 4, "qseqlen": 1, "kvseqlen": 1024, "seed": 4217} - - {"batchsize": 4, "qseqlen": 1, "kvseqlen": 8192, "seed": 4220} - # bs=32 - - {"batchsize": 32, "qseqlen": 1, "kvseqlen": 1024, "seed": 5412} - - {"batchsize": 32, "qseqlen": 1, "kvseqlen": 8192, "seed": 5415} - # bs=64 - - {"batchsize": 64, "qseqlen": 1, "kvseqlen": 1024, "seed": 1357} - - {"batchsize": 64, "qseqlen": 1, "kvseqlen": 8192, "seed": 1360} - # bs=256 - - {"batchsize": 256, "qseqlen": 1, "kvseqlen": 1024, "seed": 9823} - - {"batchsize": 256, "qseqlen": 1, "kvseqlen": 8192, "seed": 9826} - -ranking_by: "geom" diff --git a/problems/amd_202602/moe-mxfp4/README.md b/problems/amd_202602/moe-mxfp4/README.md deleted file mode 100644 index ab664f7d..00000000 --- a/problems/amd_202602/moe-mxfp4/README.md +++ /dev/null @@ -1,198 +0,0 @@ -# MXFP4 Mixture-of-Experts (MoE) Fused Kernel - -## Description - -Implement a DeepSeek-R1 style MXFP4 Mixture-of-Experts (MoE) fused kernel optimized for AMD Instinct MI355X GPU. - -The kernel fuses the complete MoE forward pass into a 2-stage pipeline: -1. **Stage 1**: MXFP4 GEMM (gate+up projection) + SwiGLU activation -2. **Stage 2**: MXFP4 GEMM (down projection) + weighted reduction across top-k experts - -The reference uses **AITER `fused_moe`** with `QuantType.per_1x32` (MXFP4 block scaling, block_size=32). - -## DeepSeek-R1 MoE Architecture - -| Parameter | Value | Notes | -|---|---|---| -| hidden_size | 7168 | Model hidden dimension | -| moe_intermediate_size | 2048 | Per-expert intermediate dimension | -| n_routed_experts | 256 | Routed experts (EP-off) or 32 (EP-on, 8-way split) | -| n_shared_experts | 1 | Always selected with weight=1.0 | -| top_k (routed) | 8 | Routed experts per token | -| total_top_k | 9 | 8 routed + 1 shared | -| MoE layers | 58 | Layers 3–60 | - -## Kernel Flow - -For each token `i` and each assigned expert `j`: - -``` -(1) Quant activations: hidden_states -> MXFP4 (aiter per-1x32 dynamic quantization) - -(2) Stage 1 GEMM + SwiGLU activation: - gate = x_i @ W_gate_j.T # [d_hidden] x [d_expert, d_hidden].T -> [d_expert] - up = x_i @ W_up_j.T # [d_hidden] x [d_expert, d_hidden].T -> [d_expert] - intermediate = SiLU(gate) * up # SwiGLU activation -> [d_expert] - (W_gate and W_up are fused as gate_up_weight: one a4w4 GEMM + fused activation) - -(3) Stage 2 GEMM: - expert_out = intermediate @ W_down_j.T # [d_expert] x [d_hidden, d_expert].T -> [d_hidden] - -(4) Weighted reduction: - output_i += w_ij * expert_out # accumulate across top_k experts -``` - -All weight GEMMs are **a4w4** (MXFP4 activations x MXFP4 weights, per-1x32 block scaling). -The AITER CK kernel fuses all of the above into a 2-stage pipeline across all tokens and experts. - -## Weight Layout & Pre-shuffling - -Weights are provided in two layouts: - -| Layout | Description | Use case | -|---|---|---| -| **Raw** | Original MXFP4 quantized weights | PyTorch reference / custom kernels | -| **Pre-shuffled** | `shuffle_weight(w, layout=(16,16))` + `e8m0_shuffle(scale)` | AITER CK kernel (tile-coalesced layout) | - -The (16,16) shuffle rearranges weight tiles for coalesced memory access by CK GEMM instructions. -Scale shuffling (`e8m0_shuffle`) reorders E8M0 block scales to match the shuffled weight layout. - -You may use either layout — raw weights if you implement your own tiling, or pre-shuffled weights -for direct use with AITER/CK kernels. - -## MXFP4 Quantization Details - -| Property | Value | -|---|---| -| FP4 format | E2M1 — values `[0, 0.5, 1, 1.5, 2, 3, 4, 6]`, max = 6.0 | -| Scale format | E8M0 — exponent-only (power-of-2 scale) | -| Block size | 32 elements per scale | -| Packing | 2 FP4 values per byte (`fp4x2`): low nibble = even index, high nibble = odd index | -| Padding | Dimensions padded to 256-alignment for CK kernel | - -### aiter dtype reference - -| Logical type | aiter dtype | PyTorch native (if available) | Fallback | -|---|---|---|---| -| fp4x2 | `aiter.dtypes.fp4x2` | `torch.float4_e2m1fn_x2` | `torch.uint8` | -| fp8_e8m0 | `aiter.dtypes.fp8_e8m0` | `torch.float8_e8m0fnu` | `torch.uint8` | - -## Input - -A tuple of tensors and a config dict: - -``` -(hidden_states, - gate_up_weight, down_weight, # fp4x2 raw - gate_up_weight_scale, down_weight_scale, # e8m0 raw - gate_up_weight_shuffled, down_weight_shuffled, # fp4x2 pre-shuffled - gate_up_weight_scale_shuffled, down_weight_scale_shuffled, # e8m0 pre-shuffled - topk_weights, topk_ids, - config) -``` - -### Tensor shapes - -| Tensor | Shape | Dtype | Notes | -|---|---|---|---| -| `hidden_states` | `[M, d_hidden]` | bfloat16 | Input activations (M = batch of tokens) | -| `gate_up_weight` | `[E, 2*d_expert_pad, d_hidden_pad//2]` | fp4x2 | Fused gate+up weights, raw | -| `down_weight` | `[E, d_hidden_pad, d_expert_pad//2]` | fp4x2 | Down projection weights, raw | -| `gate_up_weight_scale` | `[E, 2*d_expert_pad, d_hidden_pad//32]` | e8m0 | Block scales for gate_up, raw | -| `down_weight_scale` | `[E, d_hidden_pad, d_expert_pad//32]` | e8m0 | Block scales for down, raw | -| `gate_up_weight_shuffled` | `[E, 2*d_expert_pad, d_hidden_pad//2]` | fp4x2 | Pre-shuffled for CK | -| `down_weight_shuffled` | `[E, d_hidden_pad, d_expert_pad//2]` | fp4x2 | Pre-shuffled for CK | -| `gate_up_weight_scale_shuffled` | `[padded, flat]` | e8m0 | Pre-shuffled for CK | -| `down_weight_scale_shuffled` | `[padded, flat]` | e8m0 | Pre-shuffled for CK | -| `topk_weights` | `[M, total_top_k]` | float32 | Routing weights | -| `topk_ids` | `[M, total_top_k]` | int32 | Expert indices (see below) | - -### topk_ids format - -- First `n_experts_per_token` columns: routed expert IDs `[0, n_routed_experts)` -- Last `n_shared_experts` columns: shared expert IDs `[n_routed_experts, n_routed_experts + n_shared_experts)` -- Shared experts are always selected with weight = 1.0 - -### config dict - -```python -config = { - "d_hidden": int, # hidden dimension (e.g. 7168) - "d_expert": int, # expert intermediate dimension (e.g. 2048 or 256) - "d_hidden_pad": int, # d_hidden padded to 256-alignment - "d_expert_pad": int, # d_expert padded to 256-alignment - "n_routed_experts": int, # number of routed experts - "n_shared_experts": int, # number of shared experts (1) - "n_experts_per_token": int, # routed top-k (8) - "total_top_k": int, # routed + shared (9) - "bs": int, # batch size (number of tokens) -} -``` - -## Output - -``` -output: [M, d_hidden] bfloat16 -``` - -## Reference Performance - -AITER `fused_moe` with MXFP4 (E includes shared expert, top_k = routed + shared): - -| bs | E | d_hidden | d_expert | top_k | time (us) | -|---|---|---|---|---|---| -| 4 | 257 | 7168 | 256 | 9 | 46.9 | -| 64 | 257 | 7168 | 256 | 9 | 187.7 | -| 256 | 257 | 7168 | 256 | 9 | 245.7 | -| 64 | 33 | 7168 | 2048 | 9 | 220.6 | -| 256 | 33 | 7168 | 2048 | 9 | 276.4 | -| 1024 | 33 | 7168 | 2048 | 9 | 572.2 | - -## Optimization Opportunities - -The AITER CK `fused_moe` kernel is already well-optimized. To beat it, consider: - -1. **Custom tiling / scheduling**: The CK kernel uses a fixed tile strategy. For small batch sizes - (bs=4) or highly skewed expert distributions, a custom schedule may reduce idle waves. - -2. **Activation quantization fusion**: The reference quantizes activations separately before the - GEMM. Fusing dynamic MXFP4 quantization into the Stage 1 GEMM prologue saves one global - memory round-trip. - -3. **Inter-stage fusion**: The reference runs Stage 1 and Stage 2 as separate kernel launches. - Fusing both stages (gate_up GEMM → SwiGLU → down GEMM → accumulate) into a single kernel - eliminates the intermediate buffer write/read between stages. - -4. **Expert-parallel wave scheduling**: With 257 experts but only 9 active per token, most - expert slots are empty. A work-stealing or compact-dispatch strategy can minimize wasted - wavefronts. - -5. **Shared expert fusion**: The shared expert is always selected for all tokens. It could be - computed as a dense GEMM (no routing overhead) and fused with the routed expert reduction. - -6. **Split-K for large M**: For bs=1024 with EP-on (E=33, d_expert=2048), the GEMMs are large - enough to benefit from split-K parallelism within each expert. - -## Accuracy - -Submissions are checked against the AITER reference with `rtol=1e-2, atol=1e-2`. - -## Benchmark Cases - -### EP-off (all 257 experts on 1 GPU, d_expert=256) - -| bs | E | d_hidden | d_expert | top_k | -|---|---|---|---|---| -| 4 | 257 | 7168 | 256 | 9 | -| 64 | 257 | 7168 | 256 | 9 | -| 256 | 257 | 7168 | 256 | 9 | - -### EP-on (EP=8, 33 experts per GPU, d_expert=2048) - -| bs | E | d_hidden | d_expert | top_k | -|---|---|---|---|---| -| 64 | 33 | 7168 | 2048 | 9 | -| 256 | 33 | 7168 | 2048 | 9 | -| 1024 | 33 | 7168 | 2048 | 9 | - -Ranking is by **geometric mean** of benchmark latencies. diff --git a/problems/amd_202602/moe-mxfp4/eval.py b/problems/amd_202602/moe-mxfp4/eval.py deleted file mode 100644 index a03a3cc5..00000000 --- a/problems/amd_202602/moe-mxfp4/eval.py +++ /dev/null @@ -1,382 +0,0 @@ -import base64 -import dataclasses -import multiprocessing -import re -import time -import os -import sys -import math -from pathlib import Path -from typing import Any, Optional - -import torch.cuda - -from utils import set_seed, clear_l2_cache_large as clear_l2_cache -try: - from task import TestSpec -except ImportError: - TestSpec = dict - -from reference import check_implementation, generate_input - - -class PopcornOutput: - def __init__(self, fd: int): - self.file = os.fdopen(fd, 'w') - os.set_inheritable(fd, False) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.file.close() - - def print(self, *args, **kwargs): - print(*args, **kwargs, file=self.file, flush=True) - - def log(self, key, value): - self.print(f"{key}: {value}") - - -@dataclasses.dataclass -class TestCase: - args: dict - spec: str - - -def _combine(a: int, b: int) -> int: - # combine two integers into one: - # we need this to generate a secret seed based on the test-level seed and - # the global secret seed. - # the test-level seeds are public knowledge, and typically relatively small numbers, - # so we need to make sure they don't provide any useful info for the full seed. - # This Cantor construction ensures that if the secret seed is a large number, - # then so is the overall seed. - return int(a + (a+b)*(a+b+1)//2) - - -def get_test_cases(file_name: str, seed: Optional[int]) -> list[TestCase]: - try: - content = Path(file_name).read_text() - except Exception as E: - print(f"Could not open test file`{file_name}`: {E}", file=sys.stderr) - exit(113) - - tests = [] - lines = content.splitlines() - match = r"\s*([a-zA-Z_]\w*):\s*([a-zA-Z_]\w*|[+-]?[0-9]+)\s*" - for line in lines: - parts = line.split(";") - case = {} - for part in parts: - matched = re.match(match, part) - if not re.fullmatch(match, part): - print(f"invalid test case: '{line}': '{part}'", file=sys.stderr) - exit(113) - key = matched[1] - val = matched[2] - try: - val = int(val) - except ValueError: - if val == "true": - val = True - elif val == "false": - val = False - - case[key] = val - tests.append(TestCase(spec=line, args=case)) - - if seed is not None: - for test in tests: - if "seed" in test.args: - test.args["seed"] = _combine(test.args["seed"], seed) - - return tests - - -@dataclasses.dataclass -class Stats: - runs: int - mean: float - std: float - err: float - best: float - worst: float - - -def calculate_stats(durations: list[int]): - """ - Calculate statistical data from a list of durations. - - @param durations: A list of durations in nanoseconds. - @return: A Stats object containing the number of runs, mean, standard deviation, error, best, and worst durations. - """ - runs = len(durations) - total = sum(durations) - best = min(durations) - worst = max(durations) - - avg = total / runs - variance = sum(map(lambda x: (x - avg)**2, durations)) - std = math.sqrt(variance / (runs - 1)) - err = std / math.sqrt(runs) - - return Stats(runs=runs, mean=avg, std=std, err=err, best=float(best), - worst=float(worst)) - - -def _clone_data(data): - """ - Return data as-is (no cloning). - - aiter's fused_moe produces incorrect results when weight tensors are - cloned to different memory addresses (same values, different output). - Since fused_moe does not mutate its inputs, skipping the clone is safe. - """ - return data - - -def wrap_check_implementation(data, submission_output): - # Old version returned just a single string, new version - # returns (bool, str); this function ensures compatibility with old - # problem definitions. - result = check_implementation(data, submission_output) - if isinstance(result, tuple): - return result - else: - return not bool(result), result - - -def _run_single_test(test: TestCase): - """ - Runs a single test case. Do not call directly - """ - from submission import custom_kernel - data = generate_input(**test.args) - torch.cuda.synchronize() - submission_output = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - return wrap_check_implementation(data, submission_output) - - -def run_single_test(pool: multiprocessing.Pool, test: TestCase): - """ - Runs a single test in another process. - """ - return pool.apply(_run_single_test, (test,)) - - -def run_testing(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - """ - Executes the actual test case code and checks for correctness. - - @param logger: A PopcornOutput object used for logging test results. - @param tests: A list of TestCase objects representing the test cases to be executed. - @return: An integer representing the exit status: 0 if all tests pass, otherwise 112. - """ - passed = True - logger.log("test-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"test.{idx}.spec", test.spec) - good, message = run_single_test(pool, test) - if not good: - logger.log(f"test.{idx}.status", "fail") - logger.log(f"test.{idx}.error", message) - passed = False - else: - logger.log(f"test.{idx}.status", "pass") - if message: - logger.log(f"test.{idx}.message", message) - - if passed: - logger.log("check", "pass") - return 0 - else: - logger.log("check", "fail") - return 112 - - -def _run_single_benchmark(test: TestCase, recheck: bool, max_repeats: int, max_time_ns: float) -> Stats | Any: - """ - Runs one benchmark. Do not call directly. - """ - from submission import custom_kernel - - durations = [] - # generate input data once - data = generate_input(**test.args) - check_copy = _clone_data(data) - # first, one obligatory correctness check - output = custom_kernel(data) - good, message = wrap_check_implementation(check_copy, output) - if not good: - return message - - # now, do multiple timing runs without further correctness testing - # there is an upper bound of 100 runs, and a lower bound of 3 runs; - # otherwise, we repeat until we either measure at least 10 full seconds, - # or the relative error of the mean is below 1%. - - bm_start_time = time.perf_counter_ns() - for i in range(max_repeats): - if recheck: - # ensure we use a different seed for every benchmark - if "seed" in test.args: - test.args["seed"] += 13 - - data = generate_input(**test.args) - check_copy = _clone_data(data) - torch.cuda.synchronize() - clear_l2_cache() - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) - start_event.record() - output = custom_kernel(data) - end_event.record() - torch.cuda.synchronize() - - if recheck: - good, message = check_implementation(check_copy, output) - if not good: - return message - - del output - durations.append(start_event.elapsed_time(end_event) * 1e6) - - if i > 1: - total_bm_duration = time.perf_counter_ns() - bm_start_time - stats = calculate_stats(durations) - # stop if either - # a) relative error dips below 0.1% - # b) we exceed the total time limit for benchmarking the kernel - # c) we exceed 2 minutes of total wallclock time. - if stats.err / stats.mean < 0.001 or stats.mean * stats.runs > max_time_ns or total_bm_duration > 120e9: - break - - return calculate_stats(durations) - - -def run_single_benchmark(pool: multiprocessing.Pool, test: TestCase, recheck: bool, max_repeats: int, - max_time_ns: float): - """ - For a particular test case, check correctness (if applicable) and grab runtime results. - - @param pool: Process on which the benchmark will be launched. - @param test: TestCase object. - @param recheck: Flag for whether to explicitly check functional correctness. - @param max_repeats: Number of trials to repeat. - @param max_time_ns: Timeout time in nanoseconds. - @return: A Stats object for this particular benchmark case or an error if the test fails. - """ - return pool.apply(_run_single_benchmark, (test, recheck, max_repeats, max_time_ns)) - - -def run_benchmarking(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - """ - Executes benchmarking code for a CUDA Kernel and logs runtimes. - - @param logger: A PopcornOutput object used for logging benchmark results. - @param pool: Process on which the benchmarks will be launched. - @param tests: A list of TestCase objects representing the test cases to be benchmarked. - @return: An integer representing the exit status: 0 if all benchmarks pass, otherwise 112. - """ - # warm up - run_single_benchmark(pool, tests[0], False, 100, 10e7) - - passed = True - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - result = run_single_benchmark(pool, test, False, 1000, 50e9) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - passed = False - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", result) - - if passed: - logger.log("check", "pass") - return 0 - else: - logger.log("check", "fail") - return 112 - - -def run_single_profile(test: TestCase) -> str: - """ - Runs a single test case. Do not call directly - """ - from submission import custom_kernel - from torch.profiler import profile, record_function, ProfilerActivity - data = generate_input(**test.args) - torch.cuda.synchronize() - - with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: - submission_output = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - return prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20) - - -def run_profiling(logger: PopcornOutput, tests: list[TestCase]): - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - report = run_single_profile(test) - logger.log(f"benchmark.{idx}.report", base64.b64encode(report.encode("utf-8"), b"+*").decode("utf-8")) - logger.log("check", "pass") - return 0 - - -def main(): - fd = os.getenv("POPCORN_FD") - if not fd: - return 111 - - if len(sys.argv) < 3: - return 2 - - mode = sys.argv[1] - seed = os.getenv("POPCORN_SEED") - os.unsetenv("POPCORN_SEED") - seed = int(seed) if seed else None - set_seed(seed or 42) - tests = get_test_cases(sys.argv[2], seed) - - with PopcornOutput(int(fd)) as logger: - import multiprocessing - mp_context = multiprocessing.get_context('spawn') - with mp_context.Pool(1) as pool: - if mode == "test": - return run_testing(logger, pool, tests) - if mode == "benchmark": - return run_benchmarking(logger, pool, tests) - - if mode == "leaderboard": - # warmup - run_single_benchmark(pool, tests[0], False, 100, 1e7) - logger.log("benchmark-count", len(tests)) - passed = True - for i in range(len(tests)): - result = run_single_benchmark(pool, tests[i], True, 100, 30e9) - logger.log(f"benchmark.{i}.spec", tests[i].spec) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{i}.{field.name}", getattr(result, field.name)) - else: - passed = False - logger.log(f"benchmark.{i}.status", "fail") - logger.log(f"benchmark.{i}.error", str(result)) # TODO: Make sure result implements __str__? - break - - logger.log("check", "pass" if passed else "fail") - elif mode == "profile": - run_profiling(logger, tests) - else: - # TODO: Implement script mode - return 2 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/problems/amd_202602/moe-mxfp4/reference.py b/problems/amd_202602/moe-mxfp4/reference.py deleted file mode 100644 index e22ff34d..00000000 --- a/problems/amd_202602/moe-mxfp4/reference.py +++ /dev/null @@ -1,299 +0,0 @@ -from utils import make_match_reference -from task import input_t, output_t -import torch -import torch.nn.functional as F -from typing import Dict, Tuple, Optional -import math - -import aiter -from aiter import ActivationType, QuantType, dtypes -from aiter.fused_moe import fused_moe -from aiter.utility import fp4_utils -from aiter.ops.shuffle import shuffle_weight - - -# ────────────────────────────────────────────────────────────────────── -# Constants -# ────────────────────────────────────────────────────────────────────── -MXFP4_BLOCK_SIZE = 32 -PAD_ALIGN = 256 - - -def _pad_to(x: int, align: int) -> int: - return (x + align - 1) // align * align - - -# ────────────────────────────────────────────────────────────────────── -# generate_input: produce all tensors needed by ref_kernel -# -# Models DeepSeek-R1 MoE layer shapes: -# - d_hidden = 7168 -# - d_expert = moe_intermediate_size (full=2048, or TP-split) -# - E_total = n_routed_experts + n_shared_experts (257 or 33) -# - top_k_total = nexpertspertoken + nsharedexperts (8+1=9) -# -# ────────────────────────────────────────────────────────────────────── -def generate_input( - dhidden: int, - dexpert: int, - nroutedexperts: int, - nexpertspertoken: int, - nsharedexperts: int, - bs: int, - seed: int, -) -> input_t: - d_hidden = dhidden - d_expert = dexpert - n_routed_experts = nroutedexperts - n_shared_experts = nsharedexperts - routed_top_k = nexpertspertoken - total_top_k = routed_top_k + n_shared_experts # e.g. 8 + 1 = 9 - E_total = n_routed_experts + n_shared_experts # e.g. 256 + 1 = 257 - M = bs # number of tokens - - # Padded dimensions (AITER MXFP4 requires 256-alignment) - d_hidden_pad = _pad_to(d_hidden, PAD_ALIGN) - d_expert_pad = _pad_to(d_expert, PAD_ALIGN) - - config = { - "d_hidden": d_hidden, - "d_expert": d_expert, - "d_hidden_pad": d_hidden_pad, - "d_expert_pad": d_expert_pad, - "n_routed_experts": n_routed_experts, - "n_shared_experts": n_shared_experts, - "n_experts_per_token": routed_top_k, - "total_top_k": total_top_k, - "bs": M, - } - - gen = torch.Generator(device='cuda') - gen.manual_seed(seed) - - # ── hidden_states [M, d_hidden] ── - hidden_states = torch.randn( - (M, d_hidden), device='cuda', dtype=torch.bfloat16, generator=gen, - ) - - # ── Router: softmax top-k (routed experts only) ── - router_weight = torch.randn( - (n_routed_experts, d_hidden), device='cuda', dtype=torch.bfloat16, generator=gen, - ) / math.sqrt(d_hidden) - router_logits = F.linear(hidden_states, router_weight) # [M, n_routed_experts] - scores = router_logits.softmax(dim=-1) - routed_weights, routed_ids = torch.topk( - scores, k=routed_top_k, dim=-1, sorted=False - ) - routed_weights = routed_weights.to(torch.float32) - routed_ids = routed_ids.to(torch.int32) - - # ── Append shared expert(s): always selected, weight = 1.0 ── - # Shared experts are indexed as n_routed_experts, n_routed_experts+1, ... - shared_ids = torch.arange( - n_routed_experts, E_total, device='cuda', dtype=torch.int32 - ).unsqueeze(0).expand(M, -1) # [M, n_shared_experts] - shared_weights = torch.ones( - (M, n_shared_experts), device='cuda', dtype=torch.float32 - ) - - topk_ids = torch.cat([routed_ids, shared_ids], dim=-1) # [M, total_top_k] - topk_weights = torch.cat([routed_weights, shared_weights], dim=-1) # [M, total_top_k] - - gate_up_bf16 = torch.randn( - (E_total, 2 * d_expert_pad, d_hidden_pad), device='cuda', dtype=torch.bfloat16, generator=gen, - ) / math.sqrt(d_hidden) - down_bf16 = torch.randn( - (E_total, d_hidden_pad, d_expert_pad), device='cuda', dtype=torch.bfloat16, generator=gen, - ) / math.sqrt(d_expert) - - torch_quant = aiter.get_torch_quant(QuantType.per_1x32) - gate_up_weight, gate_up_weight_scale = torch_quant(gate_up_bf16, quant_dtype=dtypes.fp4x2) - down_weight, down_weight_scale = torch_quant(down_bf16, quant_dtype=dtypes.fp4x2) - gate_up_weight = gate_up_weight.view(E_total, 2 * d_expert_pad, d_hidden_pad // 2) - down_weight = down_weight.view(E_total, d_hidden_pad, d_expert_pad // 2) - - gate_up_weight_shuffled = shuffle_weight(gate_up_weight, layout=(16, 16)) - down_weight_shuffled = shuffle_weight(down_weight, layout=(16, 16)) - gate_up_weight_scale_shuffled = fp4_utils.e8m0_shuffle(gate_up_weight_scale) - down_weight_scale_shuffled = fp4_utils.e8m0_shuffle(down_weight_scale) - - return ( - hidden_states, # [M, d_hidden] bf16 - gate_up_weight, # [E_total, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (raw) - down_weight, # [E_total, d_hidden_pad, d_expert_pad//2] fp4x2 (raw) - gate_up_weight_scale, # [E_total, 2*d_expert_pad, scale_K] e8m0 (raw) - down_weight_scale, # [E_total, d_hidden_pad, scale_K] e8m0 (raw) - gate_up_weight_shuffled, # [E_total, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (pre-shuffled) - down_weight_shuffled, # [E_total, d_hidden_pad, d_expert_pad//2] fp4x2 (pre-shuffled) - gate_up_weight_scale_shuffled, # [padded, flat] e8m0 (pre-shuffled) - down_weight_scale_shuffled, # [padded, flat] e8m0 (pre-shuffled) - topk_weights, # [M, total_top_k] float32 - topk_ids, # [M, total_top_k] int32 - config, - ) - - - - -# ────────────────────────────────────────────────────────────────────── -# ref_kernel_pytorch: pure PyTorch implementation (dequant + matmul) -# ────────────────────────────────────────────────────────────────────── -def _dequant_mxfp4(weight_fp4, scale_e8m0): - """ - Dequantize MXFP4 weight to float32. - - weight_fp4: [N, K//2] fp4x2 (raw, not shuffled) - scale_e8m0: [padded_N, ceil(K/32)] e8m0 (M-dim padded to 256-align by dynamic_mxfp4_quant) - - Returns: [N, K] float32 - """ - # fp4x2 -> float32 lookup: [N, K] - w_f32 = fp4_utils.mxfp4_to_f32(weight_fp4) # [N, K] - # e8m0 -> float32 power-of-2 scale: [padded_N, scale_K] - s_f32 = fp4_utils.e8m0_to_f32(scale_e8m0) # [padded_N, scale_K] - N, K = w_f32.shape - # Trim scale rows to match weight rows (scale M-dim is padded to 256) - s_f32 = s_f32[:N, :] - # Broadcast scale across block_size=32 columns - s_f32 = s_f32.repeat_interleave(MXFP4_BLOCK_SIZE, dim=-1)[:, :K] # [N, K] - return w_f32 * s_f32 - -# ────────────────────────────────────────────────────────────────────── -# ref_kernel_pytorch: pure PyTorch implementation (dequant + matmul) -# will not run. only for reference -# ────────────────────────────────────────────────────────────────────── -def ref_kernel_pytorch(data: input_t) -> output_t: - """ - Pure PyTorch reference: dequantize MXFP4 weights -> bf16 matmul -> SwiGLU -> matmul. - Uses the raw (un-shuffled) weights. - """ - ( - hidden_states, # [M, d_hidden] bf16 - gate_up_weight, # [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 - down_weight, # [E, d_hidden_pad, d_expert_pad//2] fp4x2 - gate_up_weight_scale, # [E, 2*d_expert_pad, scale_K] e8m0 - down_weight_scale, # [E, d_hidden_pad, scale_K] e8m0 - gate_up_weight_shuffled, - down_weight_shuffled, - gate_up_weight_scale_shuffled, - down_weight_scale_shuffled, - topk_weights, # [M, top_k] float32 - topk_ids, # [M, top_k] int32 - config, - ) = data - - d_hidden = config["d_hidden"] - d_expert = config["d_expert"] - d_hidden_pad = config["d_hidden_pad"] - d_expert_pad = config["d_expert_pad"] - M = hidden_states.shape[0] - top_k = topk_ids.shape[1] - E = gate_up_weight.shape[0] - - # Dequantize all expert weights to float32 - # gate_up: [E, 2*d_expert_pad, d_hidden_pad] -> trim to [E, 2*d_expert, d_hidden] - # down: [E, d_hidden_pad, d_expert_pad] -> trim to [E, d_hidden, d_expert] - gate_up_dq = torch.stack([ - _dequant_mxfp4(gate_up_weight[e], gate_up_weight_scale[e]) - for e in range(E) - ]) # [E, 2*d_expert_pad, d_hidden_pad] - gate_up_dq = gate_up_dq[:, :2 * d_expert, :d_hidden].to(torch.bfloat16) - - down_dq = torch.stack([ - _dequant_mxfp4(down_weight[e], down_weight_scale[e]) - for e in range(E) - ]) # [E, d_hidden_pad, d_expert_pad] - down_dq = down_dq[:, :d_hidden, :d_expert].to(torch.bfloat16) - - # Split gate_up -> gate [E, d_expert, d_hidden], up [E, d_expert, d_hidden] - gate_w, up_w = gate_up_dq.chunk(2, dim=1) # each [E, d_expert, d_hidden] - - # Per-token MoE forward - output = torch.zeros((M, d_hidden), dtype=torch.bfloat16, device=hidden_states.device) - - for i in range(M): - x = hidden_states[i] # [d_hidden] - for k in range(top_k): - eid = topk_ids[i, k].item() - w = topk_weights[i, k].item() - - # Stage 1: gate_proj + up_proj + SwiGLU - gate_out = F.silu(x @ gate_w[eid].T) # [d_expert] - up_out = x @ up_w[eid].T # [d_expert] - intermediate = gate_out * up_out # [d_expert] - - # Stage 2: down_proj - # down_dq[eid] is [d_hidden, d_expert], .T is [d_expert, d_hidden] - expert_out = intermediate @ down_dq[eid].T # [d_hidden] - - output[i] += w * expert_out - - return output - - - -# ────────────────────────────────────────────────────────────────────── -# ref_kernel: calls AITER fused_moe with MXFP4 quantized weights -# ────────────────────────────────────────────────────────────────────── -def ref_kernel(data: input_t) -> output_t: - """ - Reference implementation using AITER's fused_moe kernel with MXFP4 quantized weights. - - Input data tuple (E = n_routed_experts + n_shared_experts, total_top_k = routed + shared): - hidden_states: [M, d_hidden] bf16 - gate_up_weight: [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (raw, before shuffle) - down_weight: [E, d_hidden_pad, d_expert_pad//2] fp4x2 (raw, before shuffle) - gate_up_weight_scale: [E, 2*d_expert_pad, scale_K] e8m0 (raw, before shuffle) - down_weight_scale: [E, d_hidden_pad, scale_K] e8m0 (raw, before shuffle) - gate_up_weight_shuffled: [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (pre-shuffled) - down_weight_shuffled: [E, d_hidden_pad, d_expert_pad//2] fp4x2 (pre-shuffled) - gate_up_weight_scale_shuffled:[padded, flat] e8m0 (pre-shuffled) - down_weight_scale_shuffled: [padded, flat] e8m0 (pre-shuffled) - topk_weights: [M, total_top_k] float32 - topk_ids: [M, total_top_k] int32 - config: dict - - Returns: - output: [M, d_hidden] bf16 - """ - ( - hidden_states, - gate_up_weight, - down_weight, - gate_up_weight_scale, - down_weight_scale, - gate_up_weight_shuffled, - down_weight_shuffled, - gate_up_weight_scale_shuffled, - down_weight_scale_shuffled, - topk_weights, - topk_ids, - config, - ) = data - - hidden_pad = config["d_hidden_pad"] - config["d_hidden"] - intermediate_pad = config["d_expert_pad"] - config["d_expert"] - - output = fused_moe( - hidden_states, - gate_up_weight_shuffled, - down_weight_shuffled, - topk_weights, - topk_ids, - expert_mask=None, - activation=ActivationType.Silu, - quant_type=QuantType.per_1x32, # MXFP4 uses per_1x32 block scaling - doweight_stage1=False, - w1_scale=gate_up_weight_scale_shuffled, - w2_scale=down_weight_scale_shuffled, - a1_scale=None, - a2_scale=None, - hidden_pad=hidden_pad, - intermediate_pad=intermediate_pad, - ) - - return output - - - -check_implementation = make_match_reference(ref_kernel, rtol=5e-2, atol=5e-2) diff --git a/problems/amd_202602/moe-mxfp4/submission.py b/problems/amd_202602/moe-mxfp4/submission.py deleted file mode 100644 index a771b32c..00000000 --- a/problems/amd_202602/moe-mxfp4/submission.py +++ /dev/null @@ -1,66 +0,0 @@ -import torch -from typing import Dict -from task import input_t, output_t - -from aiter import ActivationType, QuantType -from aiter.fused_moe import fused_moe - - -def custom_kernel(data: input_t) -> output_t: - """ - Submission template for DeepSeek-R1 MXFP4 MoE kernel. - - Input data tuple: - hidden_states: [M, d_hidden] bf16 - gate_up_weight: [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (raw) - down_weight: [E, d_hidden_pad, d_expert_pad//2] fp4x2 (raw) - gate_up_weight_scale: [E, 2*d_expert_pad, scale_K] e8m0 (raw) - down_weight_scale: [E, d_hidden_pad, scale_K] e8m0 (raw) - gate_up_weight_shuffled: [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (shuffled) - down_weight_shuffled: [E, d_hidden_pad, d_expert_pad//2] fp4x2 (shuffled) - gate_up_weight_scale_shuffled:[padded, flat] e8m0 (shuffled) - down_weight_scale_shuffled: [padded, flat] e8m0 (shuffled) - topk_weights: [M, total_top_k] float32 - topk_ids: [M, total_top_k] int32 - config: dict - - Returns: - output: [M, d_hidden] bf16 - """ - ( - hidden_states, - gate_up_weight, - down_weight, - gate_up_weight_scale, - down_weight_scale, - gate_up_weight_shuffled, - down_weight_shuffled, - gate_up_weight_scale_shuffled, - down_weight_scale_shuffled, - topk_weights, - topk_ids, - config, - ) = data - - hidden_pad = config["d_hidden_pad"] - config["d_hidden"] - intermediate_pad = config["d_expert_pad"] - config["d_expert"] - - output = fused_moe( - hidden_states, - gate_up_weight_shuffled, - down_weight_shuffled, - topk_weights, - topk_ids, - expert_mask=None, - activation=ActivationType.Silu, - quant_type=QuantType.per_1x32, - doweight_stage1=False, - w1_scale=gate_up_weight_scale_shuffled, - w2_scale=down_weight_scale_shuffled, - a1_scale=None, - a2_scale=None, - hidden_pad=hidden_pad, - intermediate_pad=intermediate_pad, - ) - - return output diff --git a/problems/amd_202602/moe-mxfp4/task.py b/problems/amd_202602/moe-mxfp4/task.py deleted file mode 100644 index a19edc83..00000000 --- a/problems/amd_202602/moe-mxfp4/task.py +++ /dev/null @@ -1,28 +0,0 @@ -from typing import TypeVar, Tuple, Dict -import torch - -input_t = TypeVar("input_t", bound=Tuple[ - torch.Tensor, # hidden_states [M, d_hidden] - torch.Tensor, # gate_up_weight [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (raw) - torch.Tensor, # down_weight [E, d_hidden_pad, d_expert_pad//2] fp4x2 (raw) - torch.Tensor, # gate_up_weight_scale [E, 2*d_expert_pad, scale_K] e8m0 (raw) - torch.Tensor, # down_weight_scale [E, d_hidden_pad, scale_K] e8m0 (raw) - torch.Tensor, # gate_up_weight_shuffled [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (shuffled) - torch.Tensor, # down_weight_shuffled [E, d_hidden_pad, d_expert_pad//2] fp4x2 (shuffled) - torch.Tensor, # gate_up_weight_scale_shuffled [padded, flat] e8m0 (shuffled) - torch.Tensor, # down_weight_scale_shuffled [padded, flat] e8m0 (shuffled) - torch.Tensor, # topk_weights [M, total_top_k] - torch.Tensor, # topk_ids [M, total_top_k] - Dict, # config -]) -output_t = TypeVar("output_t", bound=torch.Tensor) - - -class TestSpec: - dhidden: int # hidden dimension (7168 for DeepSeek-R1) - dexpert: int # intermediate dimension per expert (per partition) - nroutedexperts: int # number of local routed experts on this GPU - nexpertspertoken: int # top-k routed experts per token (8 for DeepSeek-R1) - nsharedexperts: int # number of shared experts (1 for DeepSeek-R1), always selected - bs: int # batch size = number of tokens in this batch - seed: int diff --git a/problems/amd_202602/moe-mxfp4/task.yml b/problems/amd_202602/moe-mxfp4/task.yml deleted file mode 100644 index c3ac0e21..00000000 --- a/problems/amd_202602/moe-mxfp4/task.yml +++ /dev/null @@ -1,125 +0,0 @@ -# name: 3_moe_mxfp4 - -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "eval.py"} - -lang: "py" - -description: | - You will implement a DeepSeek-R1 style MXFP4 Mixture-of-Experts (MoE) fused kernel optimized for AMD Instinct MI355X GPU. - - To be explicit, you will be given a tuple of tensors: - ``` - (hidden_states, - gate_up_weight, down_weight, # fp4x2 raw - gate_up_weight_scale, down_weight_scale, # e8m0 raw - gate_up_weight_shuffled, down_weight_shuffled, # fp4x2 pre-shuffled - gate_up_weight_scale_shuffled, down_weight_scale_shuffled, # e8m0 pre-shuffled - topk_weights, topk_ids, - config) - ``` - where: - * `hidden_states` is M x d_hidden in bfloat16 (the input activations, M = batch of tokens) - * `gate_up_weight` is [E, 2*d_expert_pad, d_hidden_pad//2] in MXFP4 (fp4x2), raw layout. - Fused gate + up projection weights for each expert. E = number of local experts. - * `down_weight` is [E, d_hidden_pad, d_expert_pad//2] in MXFP4 (fp4x2), raw layout. - Down projection weights for each expert. - * `gate_up_weight_scale` is [E, 2*d_expert_pad, d_hidden_pad//32] in E8M0, raw layout. - Block scales (block_size=32) for gate_up_weight. - * `down_weight_scale` is [E, d_hidden_pad, d_expert_pad//32] in E8M0, raw layout. - Block scales for down_weight. - * `gate_up_weight_shuffled` / `down_weight_shuffled` are the same weights shuffled to - (16,16) tile-coalesced layout for the CK kernel. - * `gate_up_weight_scale_shuffled` / `down_weight_scale_shuffled` are the scales after - e8m0_shuffle, flattened to [padded, flat]. - * `topk_weights` is [M, total_top_k] float32: routing weights (routed experts + shared experts). - * `topk_ids` is [M, total_top_k] int32: expert indices. First nexpertspertoken columns are - routed expert ids (0..n_routed-1), last nsharedexperts columns are shared expert ids - (n_routed..n_routed+n_shared-1). Shared experts are always selected with weight=1.0. - * `config` is a dict with: d_hidden, d_expert, d_hidden_pad, d_expert_pad, - n_routed_experts, n_shared_experts, n_experts_per_token, total_top_k, bs. - - Then, the fused_moe kernel flow is: - (1) Quant activations to MXFP4: aiter per-1x32 dynamic quantization of hidden_states. - (2) Stage 1 GEMM + activation (per token i, per assigned expert j): - - gate = x_i @ W_gate_j.T # [d_hidden] x [d_expert, d_hidden].T -> [d_expert] - - up = x_i @ W_up_j.T # [d_hidden] x [d_expert, d_hidden].T -> [d_expert] - - intermediate = SiLU(gate) * up # SwiGLU activation, -> [d_expert] - (W_gate and W_up are fused as gate_up_weight, so this is one a4w4 GEMM + fused activation) - (3) Stage 2 GEMM: - - expert_out = intermediate @ W_down_j.T # [d_expert] x [d_hidden, d_expert].T -> [d_hidden] - (4) Weighted reduction: - - output_i += w_ij * expert_out # accumulate across top_k experts - All weight GEMMs are a4w4 (MXFP4 activations x MXFP4 weights, per-1x32 block scaling). - The AITER CK kernel fuses all of the above into a 2-stage pipeline across all tokens and experts. - - DeepSeek-R1 MoE specs: - - hidden_size = 7168, moe_intermediate_size = 2048 - - 256 routed experts + 1 shared expert (total 257), top-8 routed + 1 shared = 9 per token - - 58 MoE layers (layer 3-60) - - The shared expert processes ALL tokens unconditionally (weight=1.0) - - d_hidden_pad and d_expert_pad are the dimensions padded to 256-alignment for the CK kernel. - - The ranking criteria is the geometric mean of the benchmark results. - - ``` - The AITER reference performance is (E includes shared expert, top_k = routed + shared): - bs E d_hidden d_expert top_k time[us] - 16 257 7168 256 9 152.7 - 128 257 7168 256 9 239.0 - 512 257 7168 256 9 336.5 - 16 33 7168 512 9 106.2 - 128 33 7168 512 9 141.1 - 512 33 7168 512 9 225.0 - 512 33 7168 2048 9 380.4 - ``` - - Input: - - hidden_states: [M, d_hidden] bf16 - - gate_up_weight: [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (raw, before shuffle) - - down_weight: [E, d_hidden_pad, d_expert_pad//2] fp4x2 (raw, before shuffle) - - gate_up_weight_scale: [E, 2*d_expert_pad, d_hidden_pad//32] e8m0 (raw, before shuffle) - - down_weight_scale: [E, d_hidden_pad, d_expert_pad//32] e8m0 (raw, before shuffle) - - gate_up_weight_shuffled: [E, 2*d_expert_pad, d_hidden_pad//2] fp4x2 (pre-shuffled for CK) - - down_weight_shuffled: [E, d_hidden_pad, d_expert_pad//2] fp4x2 (pre-shuffled for CK) - - gate_up_weight_scale_shuffled: [padded, flat] e8m0 (pre-shuffled for CK) - - down_weight_scale_shuffled: [padded, flat] e8m0 (pre-shuffled for CK) - - topk_weights: [M, total_top_k] float32 - - topk_ids: [M, total_top_k] int32 - - config: dict with dimensions - - Output: - - output: [M, d_hidden] bf16 - -config: - main: "eval.py" - -templates: - Python: "submission.py" - -test_timeout: 540 -benchmark_timeout: 540 -ranked_timeout: 840 -ranking_by: "geom" - -tests: - - {"dhidden": 4096, "dexpert": 1024, "nroutedexperts": 256, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 8, "seed": 9371} - - {"dhidden": 7168, "dexpert": 2048, "nroutedexperts": 32, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 32, "seed": 2291} - - {"dhidden": 4096, "dexpert": 1536, "nroutedexperts": 64, "nexpertspertoken": 6, "nsharedexperts": 1, "bs": 128, "seed": 81934} - -benchmarks: - # TP=8 - - {"dhidden": 7168, "dexpert": 256, "nroutedexperts": 256, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 16, "seed": 9371} - - {"dhidden": 7168, "dexpert": 256, "nroutedexperts": 256, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 128, "seed": 2291} - - {"dhidden": 7168, "dexpert": 256, "nroutedexperts": 256, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 512, "seed": 81934} - # TP=4 - - {"dhidden": 7168, "dexpert": 512, "nroutedexperts": 32, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 16, "seed": 2291} - - {"dhidden": 7168, "dexpert": 512, "nroutedexperts": 32, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 128, "seed": 81934} - - {"dhidden": 7168, "dexpert": 512, "nroutedexperts": 32, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 512, "seed": 81934} - # EP on - - {"dhidden": 7168, "dexpert": 2048, "nroutedexperts": 32, "nexpertspertoken": 8, "nsharedexperts": 1, "bs": 512, "seed": 81934} diff --git a/problems/amd_202602/mxfp4-mm/reference.py b/problems/amd_202602/mxfp4-mm/reference.py deleted file mode 100644 index e4439c84..00000000 --- a/problems/amd_202602/mxfp4-mm/reference.py +++ /dev/null @@ -1,112 +0,0 @@ -""" -FP4 quant + FP4 GEMM reference: bf16 A, MXFP4 B -> MXFP4 per-1x32 quant A -> gemm_a4w4 -> bf16 C. -Quant logic follows aiter op_tests/test_gemm_a4w4.py (get_triton_quant(QuantType.per_1x32)). - -NOTE: Explicitly uses dynamic_mxfp4_quant from aiter.ops.triton.quant (patched in #975) - rather than going through aiter.get_triton_quant, which may dispatch to the - unpatched fp4_utils.py kernel. See ROCm/aiter#974, ROCm/aiter#975. -""" -import torch -from task import input_t, output_t -from utils import make_match_reference -from aiter import QuantType,dtypes -import aiter -from aiter.ops.shuffle import shuffle_weight -from aiter.ops.triton.quant import dynamic_mxfp4_quant # #975-patched kernel -from aiter.utility.fp4_utils import e8m0_shuffle -# K must be divisible by 64 (scale group 32 and fp4 pack 2) -SCALE_GROUP_SIZE = 32 - -def _quant_mxfp4(x, shuffle=True): - x_fp4, bs_e8m0 = dynamic_mxfp4_quant(x) - if shuffle: - bs_e8m0 = e8m0_shuffle(bs_e8m0) - return x_fp4.view(dtypes.fp4x2), bs_e8m0.view(dtypes.fp8_e8m0) - -def generate_input(m: int, n: int, k: int, seed: int):# -> input_t: - """ - Generate random bf16 inputs A [m, k], B [n, k] and quantized MXFP4 B, shuffled B and B_scale. - - Returns: - Tuple of (A, B), both bf16 on cuda. - """ - assert k % 64 == 0, "k must be divisible by 64 (scale group 32 and fp4 pack 2)" - gen = torch.Generator(device="cuda") - gen.manual_seed(seed) - A = torch.randn((m, k), dtype=torch.bfloat16, device="cuda", generator=gen) - B = torch.randn((n, k), dtype=torch.bfloat16, device="cuda", generator=gen) - B_q, B_scale_sh = _quant_mxfp4(B, shuffle=True) - # shuffle B(weight) to (16,16) tile coalesced - B_shuffle = shuffle_weight(B_q, layout=(16, 16)) - return (A, B, B_q, B_shuffle, B_scale_sh) - -def run_torch_fp4_mm( - x: torch.Tensor, - w: torch.Tensor, - x_scales: torch.Tensor, - w_scales: torch.Tensor, - dtype: torch.dtype = torch.bfloat16, -) -> torch.Tensor: - """ - PyTorch reference: dequant MXFP4 + E8M0 scale -> f32 -> mm -> dtype. - Same logic as aiter op_tests/test_gemm_a4w4.run_torch. - x: [m, k//2] fp4 packed, w: [n, k//2] fp4 packed - x_scales: [m, k//32] E8M0, w_scales: [n, k//32] E8M0 - Returns: [m, n] in dtype - """ - from aiter.utility import fp4_utils - - m, _ = x.shape - n, _ = w.shape - # fp4 packed -> f32 - x_f32 = fp4_utils.mxfp4_to_f32(x) - w_f32 = fp4_utils.mxfp4_to_f32(w) - # E8M0 scale: [*, k//32] -> repeat 32 along k -> f32 - x_scales = x_scales[:m].repeat_interleave(SCALE_GROUP_SIZE, dim=1) - x_scales_f32 = fp4_utils.e8m0_to_f32(x_scales) - x_f32 = x_f32 * x_scales_f32 - w_scales = w_scales[:n].repeat_interleave(SCALE_GROUP_SIZE, dim=1) - w_scales_f32 = fp4_utils.e8m0_to_f32(w_scales) - w_f32 = w_f32 * w_scales_f32 - return torch.mm(x_f32, w_f32.T).to(dtype)[:m, :n] - - -def ref_kernel(data: input_t) -> output_t: - """ - Reference: MXFP4 per-1x32 quant on A and B; both PyTorch ref and gemm_a4w4 are given. - Returns gemm_a4w4 for check_implementation. - """ - A, B, B_q, B_shuffle, B_scale_sh = data - A = A.contiguous() - B = B.contiguous() - m, k = A.shape - n, _ = B.shape - - # 1) PyTorch impl just for your reference: dequant fp4 + e8m0 -> f32 -> mm -> bf16 - # Per-1x32 MXFP4 quant - # A_q, A_scale = _quant_mxfp4(A, shuffle=False) - # B_q, B_scale = _quant_mxfp4(B, shuffle=False) - - # gemm_a4w4 expects A [M,K/2], B [N,K/2] as dtypes.fp4x2; A_scale/B_scale [*,K/32] E8M0 - # quant_func returns scale as dtypes.fp8_e8m0; gemm_a4w4 accepts E8M0, no view to uint8 needed - # slice to exact shapes [m,k_scale] / [n,k_scale] (quant may return padded scale) - - # k_scale = k // SCALE_GROUP_SIZE - # A_scale = A_scale[:m, :k_scale].contiguous() - # B_scale = B_scale[:n, :k_scale].contiguous() - # out_torch = run_torch_fp4_mm(A_q, B_q, A_scale, B_scale, torch.bfloat16) - - # 2) aiter.gemm_a4w4 path: needs shuffled B_q and shuffled scales (see test_gemm_a4w4.py:102-105) - A_q, A_scale_sh = _quant_mxfp4(A, shuffle=True) - # to be noted, aiter also has other a4w4 implements using triton, https://github.com/ROCm/aiter/blob/main/aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py - out_gemm = aiter.gemm_a4w4( - A_q, - B_shuffle, - A_scale_sh, - B_scale_sh, - dtype=dtypes.bf16, - bpreshuffle=True, - ) - return out_gemm - -check_implementation = make_match_reference(ref_kernel, rtol=1e-02, atol=1e-02) \ No newline at end of file diff --git a/problems/amd_202602/mxfp4-mm/submission.py b/problems/amd_202602/mxfp4-mm/submission.py deleted file mode 100644 index eaf4050f..00000000 --- a/problems/amd_202602/mxfp4-mm/submission.py +++ /dev/null @@ -1,39 +0,0 @@ -""" -FP4 quant + FP4 GEMM reference: bf16 A, MXFP4 B -> MXFP4 per-1x32 quant A -> gemm_a4w4 -> bf16 C. -Quant logic follows aiter op_tests/test_gemm_a4w4.py (get_triton_quant(QuantType.per_1x32)). -""" -from task import input_t, output_t - - -def custom_kernel(data: input_t) -> output_t: - """ - Reference: MXFP4 per-1x32 quant on A; B_shuffle, B_scale_sh from generate_input. - gemm_a4w4 with bpreshuffle=True. - """ - import aiter - from aiter import QuantType, dtypes - from aiter.ops.triton.quant import dynamic_mxfp4_quant - from aiter.utility.fp4_utils import e8m0_shuffle - - def _quant_mxfp4(x, shuffle=True): - x_fp4, bs_e8m0 = dynamic_mxfp4_quant(x) - if shuffle: - bs_e8m0 = e8m0_shuffle(bs_e8m0) - return x_fp4.view(dtypes.fp4x2), bs_e8m0.view(dtypes.fp8_e8m0) - - A, B, B_q, B_shuffle, B_scale_sh = data - A = A.contiguous() - B = B.contiguous() - m, k = A.shape - n, _ = B.shape - - A_q, A_scale_sh = _quant_mxfp4(A, shuffle=True) - out_gemm = aiter.gemm_a4w4( - A_q, - B_shuffle, - A_scale_sh, - B_scale_sh, - dtype=dtypes.bf16, - bpreshuffle=True, - ) - return out_gemm diff --git a/problems/amd_202602/mxfp4-mm/task.py b/problems/amd_202602/mxfp4-mm/task.py deleted file mode 100644 index a258eac0..00000000 --- a/problems/amd_202602/mxfp4-mm/task.py +++ /dev/null @@ -1,21 +0,0 @@ -""" -quant + FP4 GEMM: bf16 A, B -> MXFP4 1x32 per-block quant -> gemm_a4w4 -> bf16 C. -""" -import torch -from typing import TypeVar, TypedDict - -# Input: (A, B, B_q, B_shuffle, B_scale_sh) from generate_input. -# A [m,k], B [n,k] bf16; B_q quantized MXFP4; B_shuffle = shuffle_weight(B_q,(16,16)); B_scale_sh from quant(B, shuffle=True). -# Output: bf16 C [m, n]. -input_t = TypeVar( - "input_t", - bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor], -) -output_t = TypeVar("output_t", bound=torch.Tensor) - - -class TestSpec(TypedDict): - m: int - n: int - k: int - seed: int diff --git a/problems/amd_202602/mxfp4-mm/task.yml b/problems/amd_202602/mxfp4-mm/task.yml deleted file mode 100644 index 457a7b47..00000000 --- a/problems/amd_202602/mxfp4-mm/task.yml +++ /dev/null @@ -1,66 +0,0 @@ -# name: mxfp4-mm - -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - You will implement a quantize func and block scaled MXFP4 matrix-matrix multiplication kernel optimized for AMD Instinct MI355X GPU. - To be explicit, you will be given a tuple of tensors: - ``` - (A, B, B_q, B_shuffle, B_scale_sh) - ``` - where: - * `A` is M x K in K-major order in bfloat16 - * `B` is N x K in K-major order in bfloat16 - * `B_q` is N x K/2 in K-major order in MXFP4 - * `B_shuffle` is N x K/2 in shuffled order in MXFP4, shuffled to (16,16) tile coalesced - * `B_scale_sh` is * x K/32 in E8M0, * means it will be padded. - - Then, the kernel flow is bf16 A, MXFP4 B -> MXFP4 per-1x32 quant A -> gemm_a4w4 -> BF16 C [m,n]. - To be specific, the invocation flow is: - (1) Quant A to MXFP4: aiter.get_triton_quant(QuantType.per_1x32). - (2) GEMM: aiter.gemm_a4w4. - m, n divisible by 64; k divisible by 64. - - The ranking criteria is the geometric mean of the benchmark results. - Pls note that this is the elimination round, whoever rank top5 are selected into the next round, e2e optimization for deepseek-R1-MXFP4 and GPTOSS-MXFP4 mdoel - ``` - The aiter performance is: - M N K time[us] - 4 2880 512 8.198 - 16 2112 7168 20.873 - 32 4096 512 9.462 - 32 2880 512 9.173 - 64 7168 2048 12.738 - 256 3072 1536 12.219 - ``` -config: - main: "eval.py" - -templates: - Python: "submission.py" - -test_timeout: 420 -benchmark_timeout: 420 -ranked_timeout: 600 -ranking_by: "geom" - -tests: - - {"m": 8, "n": 2112, "k": 7168, "seed": 124} - - {"m": 16, "n": 3072, "k": 1536, "seed": 6635} - - {"m": 64, "n": 3072, "k": 1536, "seed": 45} - - {"m": 256, "n": 2880, "k": 512, "seed": 78} - -benchmarks: - - {"m": 4, "n": 2880, "k": 512, "seed": 4565} - - {"m": 16, "n": 2112, "k": 7168, "seed": 15} - - {"m": 32, "n": 4096, "k": 512, "seed": 457} - - {"m": 32, "n": 2880, "k": 512, "seed": 54} - - {"m": 64, "n": 7168, "k": 2048, "seed": 687} - - {"m": 256, "n": 3072, "k": 1536, "seed": 7856} diff --git a/problems/amd_202602/utils.py b/problems/amd_202602/utils.py deleted file mode 100644 index 42f36d30..00000000 --- a/problems/amd_202602/utils.py +++ /dev/null @@ -1,175 +0,0 @@ -import random -from typing import Tuple - -import numpy as np -import torch - - -def set_seed(seed=42): - random.seed(seed) - np.random.seed(seed) - torch.manual_seed(seed) - if torch.cuda.is_available(): - torch.cuda.manual_seed(seed) - torch.cuda.manual_seed_all(seed) - - -def get_device(use_cuda: bool = True) -> torch.device: - """Get the appropriate device (GPU or CPU).""" - if use_cuda: - if torch.cuda.is_available(): - return torch.device("cuda") - elif torch.backends.mps.is_available(): - return torch.device("mps") - else: - print("No compatible GPU found. Falling back to CPU.") - return torch.device("cpu") - - -# Adapted from https://github.com/linkedin/Liger-Kernel/blob/main/test/utils.py -@torch.no_grad() -def verbose_allclose( - received: torch.Tensor, - expected: torch.Tensor, - rtol=1e-05, - atol=1e-08, - max_print=5 -) -> Tuple[bool, list[str]]: - """ - Assert that two tensors are element-wise equal within a tolerance, providing detailed information about mismatches. - - Parameters: - received (torch.Tensor): Tensor we actually got. - expected (torch.Tensor): Tensor we expected to receive. - rtol (float): Relative tolerance; relative to expected - atol (float): Absolute tolerance. - max_print (int): Maximum number of mismatched elements to print. - """ - # Check if the shapes of the tensors match - if received.shape != expected.shape: - return False, ["SIZE MISMATCH"] - - # Calculate the difference between the tensors - diff = torch.abs(received.to(torch.float32) - expected.to(torch.float32)) - - # Determine the tolerance - tolerance = atol + rtol * torch.abs(expected) - - # Find tolerance mismatched elements - tol_mismatched = diff > tolerance - - # Find nan mismatched elements - nan_mismatched = torch.logical_xor(torch.isnan(received), torch.isnan(expected)) - - # Find +inf mismatched elements - posinf_mismatched = torch.logical_xor(torch.isposinf(received), torch.isposinf(expected)) - # Find -inf mismatched elements - neginf_mismatched = torch.logical_xor(torch.isneginf(received), torch.isneginf(expected)) - - # Find all mismatched elements - mismatched = torch.logical_or( - torch.logical_or(tol_mismatched, nan_mismatched), - torch.logical_or(posinf_mismatched, neginf_mismatched), - ) - - mismatched_indices = torch.nonzero(mismatched) - - # Count the number of mismatched elements - num_mismatched = mismatched.count_nonzero().item() - - # Generate detailed information if there are mismatches - if num_mismatched >= 1: - mismatch_details = [f"Number of mismatched elements: {num_mismatched}"] - - for index in mismatched_indices[:max_print]: - i = tuple(index.tolist()) - mismatch_details.append(f"ERROR at {i}: {received[i]} {expected[i]}") - if num_mismatched > max_print: - mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.") - return False, mismatch_details - - return True, [f"Maximum error: {torch.max(diff)}"] - - -@torch.no_grad() -def verbose_allequal(received: torch.Tensor, expected: torch.Tensor, max_print: int = 5) -> Tuple[bool, list[str]]: - """ - Assert that two tensors are element-wise perfectly equal, providing detailed information about mismatches. - - Parameters: - received (torch.Tensor): Tensor we actually got. - expected (torch.Tensor): Tensor we expected to receive. - max_print (int): Maximum number of mismatched elements to print. - - Returns: - Empty string if tensors are equal, otherwise detailed error information - """ - mismatched = torch.not_equal(received, expected) - mismatched_indices = torch.nonzero(mismatched) - - # Count the number of mismatched elements - num_mismatched = mismatched.count_nonzero().item() - - # Generate detailed information if there are mismatches - if num_mismatched >= 1: - mismatch_details = [f"Number of mismatched elements: {num_mismatched}"] - - for index in mismatched_indices[:max_print]: - i = tuple(index.tolist()) - mismatch_details.append(f"ERROR at {i}: {received[i]} {expected[i]}") - if num_mismatched > max_print: - mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.") - return False, mismatch_details - - return True, [] - -def _is_mla_case(data) -> bool: - """ - Detect mixed-MLA style input tuple: - (q, kv_data, qo_indptr, kv_indptr, config) - """ - if not isinstance(data, tuple) or len(data) < 5: - return False - config = data[4] - if not isinstance(config, dict): - return False - mla_keys = { - "num_heads", - "num_kv_heads", - "qk_head_dim", - "kv_lora_rank", - "qk_rope_head_dim", - "v_head_dim", - } - return mla_keys.issubset(config.keys()) - -def match_reference(data, output, reference: callable, rtol=1e-05, atol=1e-08, tol_err_ratio=0.05): - """ - Convenient "default" implementation for tasks' `check_implementation` function. - """ - expected = reference(data) - good, reasons = verbose_allclose(output, expected, rtol=rtol, atol=atol) - # Only for MLA: aligned with aiter - if (not good) and _is_mla_case(data) and output.shape == expected.shape: - mismatch_mask = ~torch.isclose(output, expected, rtol=rtol, atol=atol) - mismatch_ratio = (mismatch_mask.sum() / output.numel()).item() - if mismatch_ratio <= tol_err_ratio: - return True, ( - f"warning: mismatch_ratio={mismatch_ratio:.6f} " - f"(<= tol_err_ratio={tol_err_ratio}) with rtol={rtol}, atol={atol}" - ) - - if len(reasons) > 0: - return good, "\\n".join(reasons) - - return good, '' - - -def make_match_reference(reference: callable, **kwargs): - def wrapped(data, output): - return match_reference(data, output, reference=reference, **kwargs) - return wrapped - -def clear_l2_cache_large(): - dummy = torch.randn((16000, 1024, 1024), device="cuda") - del dummy diff --git a/problems/bioml.yaml b/problems/bioml.yaml index 03dab68f..09af151b 100644 --- a/problems/bioml.yaml +++ b/problems/bioml.yaml @@ -8,7 +8,7 @@ description: "Popular and important kernels for BioML models like AlphaFold3" problems: - directory: bioml/trimul name: trimul - deadline: "2026-05-09" + deadline: "2026-02-06" gpus: - B200 - H100 diff --git a/problems/helion.yaml b/problems/helion.yaml deleted file mode 100644 index e978396c..00000000 --- a/problems/helion.yaml +++ /dev/null @@ -1,29 +0,0 @@ -name: Helion Kernel Challenge -deadline: "2026-03-14" -description: "GPU kernel challenges inspired by Helion kernel ideas — convolution, quantization, and gated deltanet operators from production LLM architectures." -problems: - - directory: helion/causal_conv1d_py - name: causal_conv1d - deadline: "2026-03-15 01:00" - gpus: - - B200_Nebius - - directory: helion/fp8_quant_py - name: fp8_quant - deadline: "2026-03-15 01:00" - gpus: - - B200_Nebius - - directory: helion/gated_deltanet_chunk_fwd_h_py - name: gated_deltanet_chunk_fwd_h - deadline: "2026-03-15 01:00" - gpus: - - B200_Nebius - - directory: helion/gated_deltanet_chunk_fwd_o_py - name: gated_deltanet_chunk_fwd_o - deadline: "2026-03-15 01:00" - gpus: - - B200_Nebius - - directory: helion/gated_deltanet_recompute_w_u_py - name: gated_deltanet_recompute_w_u - deadline: "2026-03-15 01:00" - gpus: - - B200_Nebius diff --git a/problems/helion/causal_conv1d_py/reference.py b/problems/helion/causal_conv1d_py/reference.py deleted file mode 100644 index 0d2ae2f5..00000000 --- a/problems/helion/causal_conv1d_py/reference.py +++ /dev/null @@ -1,35 +0,0 @@ -import torch -import torch.nn.functional as F -from task import input_t, output_t -from utils import make_match_reference, DeterministicContext - - -def generate_input(B: int, D: int, S: int, W: int, seed: int) -> input_t: - gen = torch.Generator(device="cuda") - gen.manual_seed(seed) - x = torch.randn(B, D, S, dtype=torch.float32, device="cuda", generator=gen).contiguous() - weight = torch.randn(D, W, dtype=torch.float32, device="cuda", generator=gen).contiguous() - bias = torch.randn(D, dtype=torch.float32, device="cuda", generator=gen).contiguous() - return x, weight, bias - - -def ref_kernel(data: input_t) -> output_t: - with DeterministicContext(): - x, weight, bias = data - B, D, S = x.shape - W = weight.shape[1] - - # Causal (left) padding - x_padded = F.pad(x, (W - 1, 0)) - - # Depthwise conv1d (groups=D) - output = F.conv1d( - x_padded, - weight.unsqueeze(1), # [D, 1, W] - bias=bias, - groups=D, - ) - return output - - -check_implementation = make_match_reference(ref_kernel, rtol=1e-3, atol=1e-3) diff --git a/problems/helion/causal_conv1d_py/submission.py b/problems/helion/causal_conv1d_py/submission.py deleted file mode 100644 index 92716763..00000000 --- a/problems/helion/causal_conv1d_py/submission.py +++ /dev/null @@ -1,81 +0,0 @@ -from task import input_t, output_t - -import torch -import helion -import helion.language as hl - - -# Per-shape configs: map (B, D, S, W) to optimized helion.Config objects. -# Autotune locally for each shape, then paste the best config here. -SHAPE_CONFIGS: dict[tuple, helion.Config] = { - # Test shapes - (1, 64, 64, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (2, 128, 128, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (1, 256, 256, 3): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (1, 128, 64, 8): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (4, 64, 128, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - # Benchmark shapes - (1, 768, 512, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (1, 768, 2048, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (1, 1536, 2048, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (1, 2560, 2048, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (1, 2560, 4096, 4): helion.Config(block_sizes=[1, 8], num_warps=1, num_stages=1), # TODO: replace with your autotuned config -} - - -# Optional: add advanced_controls_file to your Config for extra performance (see docs). -# Autotune with autotune_search_acf to find the best ACF, then hardcode it: -# helion.Config(..., advanced_controls_file="/opt/booster_pack/causal_conv_0.acf") - - -# NOTE: This is an intentionally inefficient baseline implementation. -def _make_kernel(config: helion.Config): - @helion.kernel(static_shapes=True, config=config) - def kernel( - x_pad: torch.Tensor, # (B, D, L) zero-padded input - w: torch.Tensor, # (D, W) filter coefficients - b: torch.Tensor, # (D,) additive offset - ) -> torch.Tensor: - B = x_pad.size(0) - D = x_pad.size(1) - L = x_pad.size(2) - W = hl.specialize(w.size(1)) - N = L - W + 1 - - y = torch.empty(B, D, N, dtype=x_pad.dtype, device=x_pad.device) - - for rb, rd, rs in hl.tile([B, D, N], block_size=[1, None, None]): - bi = rb.begin - acc1 = hl.zeros([rd, rs], dtype=torch.float32) - acc2 = hl.zeros([rd, rs], dtype=torch.float32) - acc3 = hl.zeros([rd, rs], dtype=torch.float32) - for j in range(W): - c1 = w[rd, j].to(torch.float32) - x1 = hl.load(x_pad, [bi, rd, rs.index + j]).to(torch.float32) - acc1 = acc1 + x1 * c1[:, None] - c2 = w[rd, j].to(torch.float32) - x2 = hl.load(x_pad, [bi, rd, rs.index + j]).to(torch.float32) - acc2 = acc2 + x2 * c2[:, None] - c3 = w[rd, j].to(torch.float32) - x3 = hl.load(x_pad, [bi, rd, rs.index + j]).to(torch.float32) - acc3 = acc3 + x3 * c3[:, None] - acc = (acc1 + acc2 + acc3) / 3.0 - acc = acc + b[rd].to(torch.float32)[:, None] - y[rb, rd, rs] = acc[None, :, :].to(y.dtype) - - return y - - return kernel - - -_KERNELS = {shape: _make_kernel(cfg) for shape, cfg in SHAPE_CONFIGS.items()} - - -def custom_kernel(data: input_t) -> output_t: - x, weight, bias = data - B, D, S = x.shape - W = weight.shape[1] - kernel = _KERNELS[(B, D, S, W)] - pad_zeros = torch.zeros(B, D, W - 1, dtype=x.dtype, device=x.device) - padded = torch.cat([pad_zeros, x], dim=2) - return kernel(padded, weight, bias) diff --git a/problems/helion/causal_conv1d_py/task.py b/problems/helion/causal_conv1d_py/task.py deleted file mode 100644 index 00a02fe6..00000000 --- a/problems/helion/causal_conv1d_py/task.py +++ /dev/null @@ -1,12 +0,0 @@ -from typing import TypedDict, TypeVar -import torch - -input_t = TypeVar("input_t", bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor]) -output_t = TypeVar("output_t", bound=torch.Tensor) - -class TestSpec(TypedDict): - B: int - D: int - S: int - W: int - seed: int diff --git a/problems/helion/causal_conv1d_py/task.yml b/problems/helion/causal_conv1d_py/task.yml deleted file mode 100644 index 1f4e8f0b..00000000 --- a/problems/helion/causal_conv1d_py/task.yml +++ /dev/null @@ -1,49 +0,0 @@ -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - Implement a causal depthwise 1D convolution kernel. - - This is a core component of Mamba/Mamba-2 architectures. Each channel is - convolved independently (depthwise) with causal (left) zero-padding so that - output[t] depends only on input[t-W+1:t+1]. - - For each batch b, channel d, and time t: - out[b, d, t] = bias[d] + sum_{k=0}^{W-1} weight[d, k] * x[b, d, t - W + 1 + k] - where out-of-bounds values are treated as zero. - - Input: tuple(x, weight, bias) where: - - x: torch.Tensor of shape [B, D, S] (float32) - - weight: torch.Tensor of shape [D, W] (float32) - - bias: torch.Tensor of shape [D] (float32) - - Output: torch.Tensor of shape [B, D, S] (float32) - -config: - main: "eval.py" - -templates: - Python: "../template.py" - -tests: - - {"B": 1, "D": 64, "S": 64, "W": 4, "seed": 4242} - - {"B": 2, "D": 128, "S": 128, "W": 4, "seed": 5236} - - {"B": 1, "D": 256, "S": 256, "W": 3, "seed": 1001} - - {"B": 1, "D": 128, "S": 64, "W": 8, "seed": 5531} - - {"B": 4, "D": 64, "S": 128, "W": 4, "seed": 9173} - -benchmarks: - - {"B": 1, "D": 1536, "S": 2048, "W": 4, "seed": 2146} - - {"B": 1, "D": 2560, "S": 2048, "W": 4, "seed": 3129} - - {"B": 1, "D": 2560, "S": 4096, "W": 4, "seed": 54352} - -test_timeout: 180 -benchmark_timeout: 180 -ranked_timeout: 420 -ranking_by: "geom" diff --git a/problems/helion/eval.py b/problems/helion/eval.py deleted file mode 100644 index cbc0f1d6..00000000 --- a/problems/helion/eval.py +++ /dev/null @@ -1,578 +0,0 @@ -import base64 -import dataclasses -import multiprocessing -import re -import time -import os -import sys -import math -from pathlib import Path -from typing import Any, Optional - -import torch.cuda - - -class PopcornOutput: - def __init__(self, fd: int): - self.file = os.fdopen(fd, 'w') - os.set_inheritable(fd, False) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.file.close() - - def print(self, *args, **kwargs): - print(*args, **kwargs, file=self.file, flush=True) - - def log(self, key, value): - self.print(f"{key}: {value}") - - -@dataclasses.dataclass -class TestCase: - args: dict - spec: str - - -def _combine(a: int, b: int) -> int: - # combine two integers into one: - # we need this to generate a secret seed based on the test-level seed and - # the global secret seed. - # the test-level seeds are public knowledge, and typically relatively small numbers, - # so we need to make sure they don't provide any useful info for the full seed. - # This Cantor construction ensures that if the secret seed is a large number, - # then so is the overall seed. - return int(a + (a+b)*(a+b+1)//2) - - -def get_test_cases(file_name: str, seed: Optional[int]) -> list[TestCase]: - try: - content = Path(file_name).read_text() - except Exception as E: - print(f"Could not open test file`{file_name}`: {E}", file=sys.stderr) - exit(113) - - tests = [] - lines = content.splitlines() - match = r"\s*([a-zA-Z_]\w*):\s*([a-zA-Z_]\w*|[+-]?[0-9]+)\s*" - for line in lines: - parts = line.split(";") - case = {} - for part in parts: - matched = re.match(match, part) - if not re.fullmatch(match, part): - print(f"invalid test case: '{line}': '{part}'", file=sys.stderr) - exit(113) - key = matched[1] - val = matched[2] - try: - val = int(val) - except ValueError: - if val == "true": - val = True - elif val == "false": - val = False - - case[key] = val - tests.append(TestCase(spec=line, args=case)) - - if seed is not None: - for test in tests: - if "seed" in test.args: - test.args["seed"] = _combine(test.args["seed"], seed) - - return tests - - -@dataclasses.dataclass -class Stats: - runs: int - mean: float - std: float - err: float - best: float - worst: float - - -def calculate_stats(durations: list[int]): - """ - Calculate statistical data from a list of durations. - - @param durations: A list of durations in nanoseconds. - @return: A Stats object containing the number of runs, mean, standard deviation, error, best, and worst durations. - """ - runs = len(durations) - total = sum(durations) - best = min(durations) - worst = max(durations) - - avg = total / runs - variance = sum(map(lambda x: (x - avg)**2, durations)) - std = math.sqrt(variance / (runs - 1)) - err = std / math.sqrt(runs) - - return Stats(runs=runs, mean=avg, std=std, err=err, best=float(best), - worst=float(worst)) - - -def _clone_data(data): - """ - Recursively goes through data and clones all tensors. - """ - if isinstance(data, tuple): - return tuple(_clone_data(x) for x in data) - elif isinstance(data, list): - return [_clone_data(x) for x in data] - elif isinstance(data, dict): - return {k: _clone_data(v) for k, v in data.items()} - elif isinstance(data, torch.Tensor): - return data.clone() - else: - return data - - -def _copy_data_inplace(dst, src): - """ - Recursively copy tensor data from src into dst (same structure, same shapes). - Used to feed new inputs into CUDA graph buffers without recapturing. - """ - if isinstance(dst, torch.Tensor): - dst.copy_(src) - elif isinstance(dst, (tuple, list)): - for d, s in zip(dst, src): - _copy_data_inplace(d, s) - elif isinstance(dst, dict): - for k in dst: - _copy_data_inplace(dst[k], src[k]) - - -def _do_bench_cudagraph(fn, rep_ms=100, return_mode="mean", clear_l2=True): - """ - Benchmark fn using CUDA graphs with optional L2 cache clearing. - Based on triton.testing.do_bench_cudagraph + triton-lang/triton#8384. - - :param fn: Callable to benchmark (no args). - :param rep_ms: Target repetition time per measurement in milliseconds. - :param return_mode: "min", "max", "mean", "median", or "all" (list of ms). - :param clear_l2: If True, flush L2 cache before each invocation and subtract - the flushing overhead from reported times. - :return: Time(s) in milliseconds. - """ - assert return_mode in ["min", "max", "mean", "median", "all"] - - # 256 MB cache tensor — larger than any current GPU L2 - cache = torch.empty(32 * 1024 * 1024, dtype=torch.int64, device="cuda") if clear_l2 else None - - def maybe_clear_cache(): - if cache is not None: - cache.zero_() - - with torch.cuda.stream(torch.cuda.Stream()): - # warmup - maybe_clear_cache() - fn() - - # step 1 — estimate per-call time - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) - start_event.record() - for _ in range(5): - maybe_clear_cache() - fn() - end_event.record() - torch.cuda.synchronize() - estimate_ms = start_event.elapsed_time(end_event) / 5 - - n_repeat = max(1, int(rep_ms / estimate_ms)) if estimate_ms > 0 else 1000 - - # step 2 — capture graph with n_repeat unrolled calls - g = torch.cuda.CUDAGraph() - with torch.cuda.graph(g): - for _ in range(n_repeat): - maybe_clear_cache() - fn() - torch.cuda.synchronize() - - # step 3 — if L2 clearing enabled, capture a separate graph to measure - # the clearing overhead so we can subtract it - cache_clear_graph = None - if clear_l2: - cache_clear_graph = torch.cuda.CUDAGraph() - with torch.cuda.graph(cache_clear_graph): - for _ in range(n_repeat): - maybe_clear_cache() - torch.cuda.synchronize() - - # step 4 — measure - n_retries = 10 - cache_clear_times = [] - total_times = [] - for _ in range(n_retries): - if cache_clear_graph is not None: - s = torch.cuda.Event(enable_timing=True) - e = torch.cuda.Event(enable_timing=True) - s.record() - cache_clear_graph.replay() - e.record() - torch.cuda.synchronize() - cache_clear_times.append(s.elapsed_time(e) / n_repeat) - - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) - start_event.record() - g.replay() - end_event.record() - torch.cuda.synchronize() - total_times.append(start_event.elapsed_time(end_event) / n_repeat) - - if clear_l2: - ret = [max(0, t - c) for t, c in zip(total_times, cache_clear_times)] - else: - ret = total_times - - if return_mode == "all": - return ret - elif return_mode == "min": - return min(ret) - elif return_mode == "max": - return max(ret) - elif return_mode == "mean": - return sum(ret) / len(ret) - elif return_mode == "median": - return sorted(ret)[len(ret) // 2] - - -def _run_single_test(test: TestCase): - """ - Runs a single test case via CUDA graph capture + replay. - This validates that the kernel is capturable and produces correct output. - """ - from submission import custom_kernel - from reference import check_implementation, generate_input - - data = generate_input(**test.args) - check_copy = _clone_data(data) - - # Warmup call to trigger JIT compilation (outside graph capture) - _ = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - - # Capture and replay through CUDA graph - input_data = _clone_data(data) - try: - g = torch.cuda.CUDAGraph() - with torch.cuda.graph(g): - output = custom_kernel(input_data) - except Exception as e: - return False, f"Failed to capture kernel in CUDA graph: {e}" - g.replay() - torch.cuda.synchronize() - - return check_implementation(check_copy, output) - - -def run_single_test(pool: multiprocessing.Pool, test: TestCase): - """ - Runs a single test in another process. - """ - return pool.apply(_run_single_test, (test,)) - - -def run_testing(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - """ - Executes the actual test case code and checks for correctness. - - @param logger: A PopcornOutput object used for logging test results. - @param tests: A list of TestCase objects representing the test cases to be executed. - @return: An integer representing the exit status: 0 if all tests pass, otherwise 112. - """ - passed = True - logger.log("test-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"test.{idx}.spec", test.spec) - good, message = run_single_test(pool, test) - if not good: - logger.log(f"test.{idx}.status", "fail") - logger.log(f"test.{idx}.error", message) - passed = False - else: - logger.log(f"test.{idx}.status", "pass") - if message: - logger.log(f"test.{idx}.message", message) - - if passed: - logger.log("check", "pass") - return 0 - else: - logger.log("check", "fail") - return 112 - - -def _run_single_benchmark(test: TestCase, recheck: bool, rep_ms: int) -> Stats | Any: - """ - Runs one benchmark. Do not call directly. - - Correctness is verified via CUDA graph capture + replay first. - Timing only runs if all correctness checks pass. - - :param test: Test case with input arguments. - :param recheck: If True, run additional correctness checks with varying seeds. - :param rep_ms: Target repetition time per measurement in milliseconds. - """ - from submission import custom_kernel - from reference import check_implementation, generate_input - - data = generate_input(**test.args) - check_copy = _clone_data(data) - - # Warmup (JIT compilation) - _ = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - - # Capture in CUDA graph and run initial correctness check - input_data = _clone_data(data) - try: - g = torch.cuda.CUDAGraph() - with torch.cuda.graph(g): - output = custom_kernel(input_data) - except Exception as e: - return f"Failed to capture kernel in CUDA graph: {e}" - g.replay() - torch.cuda.synchronize() - good, message = check_implementation(check_copy, output) - if not good: - return message - - if recheck: - # Reuse the captured graph with new input data for each seed - for i in range(10): - if "seed" in test.args: - test.args["seed"] += 13 - new_data = generate_input(**test.args) - check_copy = _clone_data(new_data) - _copy_data_inplace(input_data, new_data) - g.replay() - torch.cuda.synchronize() - good, message = check_implementation(check_copy, output) - if not good: - return message - - # Timing (only reached if all correctness checks passed) - data = generate_input(**test.args) - fn = lambda: custom_kernel(data) - times_ms = _do_bench_cudagraph(fn, rep_ms=rep_ms, return_mode="all", clear_l2=True) - time.sleep(10) # GPU cooldown to avoid thermal throttling - durations = [t * 1e6 for t in times_ms] # convert ms to ns - return calculate_stats(durations) - - -def run_single_benchmark(pool: multiprocessing.Pool, test: TestCase, recheck: bool, rep_ms: int): - """ - Run a benchmark in a subprocess. - - :param pool: Process pool. - :param test: TestCase object. - :param recheck: Flag for whether to explicitly check functional correctness. - :param rep_ms: Target repetition time per measurement in milliseconds. - :return: A Stats object or an error string. - """ - return pool.apply(_run_single_benchmark, (test, recheck, rep_ms)) - - -def run_benchmarking(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - """ - Executes benchmarking code for a CUDA Kernel and logs runtimes. - - @param logger: A PopcornOutput object used for logging benchmark results. - @param pool: Process on which the benchmarks will be launched. - @param tests: A list of TestCase objects representing the test cases to be benchmarked. - @return: An integer representing the exit status: 0 if all benchmarks pass, otherwise 112. - """ - # warm up - run_single_benchmark(pool, tests[0], False, 20) - - passed = True - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - result = run_single_benchmark(pool, test, False, 100) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - passed = False - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", result) - - if passed: - logger.log("check", "pass") - return 0 - else: - logger.log("check", "fail") - return 112 - - -def run_single_profile(test: TestCase) -> str: - """ - Runs a single test case. Do not call directly - """ - from submission import custom_kernel - from reference import generate_input - from torch.profiler import profile, record_function, ProfilerActivity - data = generate_input(**test.args) - torch.cuda.synchronize() - - with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: - submission_output = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - return prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=20) - - -def run_profiling(logger: PopcornOutput, tests: list[TestCase]): - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - report = run_single_profile(test) - logger.log(f"benchmark.{idx}.report", base64.b64encode(report.encode("utf-8"), b"+*").decode("utf-8")) - logger.log("check", "pass") - return 0 - - -def run_local(): - """ - Local eval mode: reads task.yml from a problem directory, runs correctness tests - and benchmarks, prints results to stdout. No Popcorn infrastructure needed. - - Usage: python eval.py - mode: test, benchmark, or both - problem_dir: path to the problem directory containing task.yml - """ - import yaml - - if len(sys.argv) < 3: - print("Usage: python eval.py ", file=sys.stderr) - print(" mode: test, benchmark, or both", file=sys.stderr) - print(" problem_dir: path to problem directory containing task.yml", file=sys.stderr) - return 1 - - mode = sys.argv[1] - problem_dir = Path(sys.argv[2]) - - if mode not in ("test", "benchmark", "both"): - print(f"Unknown mode '{mode}'. Use 'test', 'benchmark', or 'both'.", file=sys.stderr) - return 1 - - problem_dir = problem_dir.resolve() - task_path = problem_dir / "task.yml" - if not task_path.exists(): - print(f"Error: task.yml not found in {problem_dir}", file=sys.stderr) - return 1 - - task = yaml.safe_load(task_path.read_text()) - - # chdir into the problem directory so that `from submission import ...` works - os.chdir(problem_dir) - sys.path.insert(0, str(problem_dir)) - - from utils import set_seed - - set_seed(42) - exit_code = 0 - - # --- Correctness tests --- - if mode in ("test", "both"): - tests = [TestCase(args=dict(t), spec=str(t)) for t in task.get("tests", [])] - print(f"Running {len(tests)} correctness tests...") - all_passed = True - for idx, test in enumerate(tests): - good, message = _run_single_test(test) - status = "PASS" if good else "FAIL" - print(f" Test {idx}: {status} {test.spec}") - if not good: - print(f" {message}") - all_passed = False - if all_passed: - print("All tests passed.") - else: - print("Some tests FAILED.") - exit_code = 1 - - # --- Benchmarks --- - if mode in ("benchmark", "both"): - benchmarks = [TestCase(args=dict(t), spec=str(t)) for t in task.get("benchmarks", [])] - print(f"\nRunning {len(benchmarks)} benchmarks...") - - # Warmup - _run_single_benchmark(benchmarks[0], False, 20) - - for idx, bench in enumerate(benchmarks): - result = _run_single_benchmark(bench, False, 100) - if isinstance(result, Stats): - mean_ms = result.mean / 1e6 # Stats stores ns - min_ms = result.best / 1e6 - max_ms = result.worst / 1e6 - print(f" Benchmark {idx}: {mean_ms:.4f} ms (min={min_ms:.4f}, max={max_ms:.4f}) {bench.spec}") - else: - print(f" Benchmark {idx}: FAIL (correctness) {bench.spec}") - print(f" {result}") - exit_code = 1 - - return exit_code - - -def main(): - os.environ["HELION_DISALLOW_AUTOTUNING"] = "1" - fd = os.getenv("POPCORN_FD") - if not fd: - return run_local() - - if len(sys.argv) < 3: - return 2 - - from utils import set_seed - - mode = sys.argv[1] - seed = os.getenv("POPCORN_SEED") - os.unsetenv("POPCORN_SEED") - seed = int(seed) if seed else None - set_seed(seed or 42) - tests = get_test_cases(sys.argv[2], seed) - - with PopcornOutput(int(fd)) as logger: - import multiprocessing - mp_context = multiprocessing.get_context('spawn') - with mp_context.Pool(1) as pool: - if mode == "test": - return run_testing(logger, pool, tests) - if mode == "benchmark": - return run_benchmarking(logger, pool, tests) - - if mode == "leaderboard": - # warmup - run_single_benchmark(pool, tests[0], False, 20) - logger.log("benchmark-count", len(tests)) - passed = True - for i in range(len(tests)): - result = run_single_benchmark(pool, tests[i], True, 200) - logger.log(f"benchmark.{i}.spec", tests[i].spec) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{i}.{field.name}", getattr(result, field.name)) - else: - passed = False - logger.log(f"benchmark.{i}.status", "fail") - logger.log(f"benchmark.{i}.error", str(result)) - break - - logger.log("check", "pass" if passed else "fail") - elif mode == "profile": - run_profiling(logger, tests) - else: - # TODO: Implement script mode - return 2 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/problems/helion/fp8_quant_py/reference.py b/problems/helion/fp8_quant_py/reference.py deleted file mode 100644 index bcad6943..00000000 --- a/problems/helion/fp8_quant_py/reference.py +++ /dev/null @@ -1,59 +0,0 @@ -import torch -from task import input_t, output_t -from utils import verbose_allclose - -FP8_MAX = 448.0 -FP8_MIN = -448.0 -FP8_EPS = 1e-10 - - -def generate_input(num_tokens: int, hidden_dim: int, group_size: int, seed: int) -> input_t: - gen = torch.Generator(device="cuda") - gen.manual_seed(seed) - x = torch.randn(num_tokens, hidden_dim, dtype=torch.float32, device="cuda", generator=gen).contiguous() - x_q = torch.empty(num_tokens, hidden_dim, dtype=torch.float32, device="cuda").contiguous() - x_s = torch.empty(num_tokens, hidden_dim // group_size, dtype=torch.float32, device="cuda").contiguous() - return x, x_q, x_s - - -def ref_kernel(data: input_t) -> output_t: - x, x_q, x_s = data - num_tokens, hidden_dim = x.shape - num_groups = x_s.shape[1] - group_size = hidden_dim // num_groups - - x_f32 = x.float() - x_grouped = x_f32.reshape(num_tokens, num_groups, group_size) - - # Per-group absmax - absmax = x_grouped.abs().amax(dim=-1).clamp(min=FP8_EPS) - - # Scale = absmax / fp8_max - scale = absmax / FP8_MAX - - # Quantize - quantized = (x_grouped / scale.unsqueeze(-1)).clamp(FP8_MIN, FP8_MAX) - quantized = quantized.reshape(num_tokens, hidden_dim) - - x_q[...] = quantized - x_s[...] = scale - return x_q, x_s - - -def check_implementation(data, output): - expected = ref_kernel(data) - expected_q, expected_s = expected - received_q, received_s = output - - reasons_q = verbose_allclose(received_q, expected_q, rtol=1e-3, atol=1e-3) - reasons_s = verbose_allclose(received_s, expected_s, rtol=1e-3, atol=1e-3) - - reasons = [] - if reasons_q: - reasons.append("quantized values mismatch: " + " ".join(reasons_q)) - if reasons_s: - reasons.append("scales mismatch: " + " ".join(reasons_s)) - - if reasons: - return False, " | ".join(reasons) - return True, "" diff --git a/problems/helion/fp8_quant_py/submission.py b/problems/helion/fp8_quant_py/submission.py deleted file mode 100644 index 4b562fa9..00000000 --- a/problems/helion/fp8_quant_py/submission.py +++ /dev/null @@ -1,88 +0,0 @@ -from task import input_t, output_t - -import torch -import helion -import helion.language as hl -from pathlib import Path - - -# Per-shape configs: map (num_tokens, hidden_dim, group_size) to optimized helion.Config objects. -# Autotune locally for each shape, then paste the best config here. -SHAPE_CONFIGS: dict[tuple, helion.Config] = { - # Test shapes - (1, 256, 64): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (4, 512, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (16, 1024, 64): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (1, 4096, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (8, 4096, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - # Benchmark shapes - # (1, 4096, 128) already covered above - (16, 4096, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (256, 4096, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (256, 8192, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (4096, 7168, 128): helion.Config(block_sizes=[1], num_warps=1, num_stages=1), # TODO: replace with your autotuned config -} - - -# Optional: add advanced_controls_file to your Config for extra performance (see docs). -# Autotune with autotune_search_acf to find the best ACF, then hardcode it: -# helion.Config(..., advanced_controls_file="/opt/booster_pack/fp8_group_quant_0.acf") - - -# NOTE: This is an intentionally inefficient baseline implementation. -def _make_kernel(config: helion.Config): - @helion.kernel(static_shapes=True, config=config) - def kernel( - data: torch.Tensor, # [N, G] input rows - scales_out: torch.Tensor, # [N] output normalization factors - ) -> torch.Tensor: - nrows = data.size(0) - ncols = hl.specialize(data.size(1)) - MAX_VAL = 448.0 - - qout = torch.empty(nrows, ncols, dtype=torch.float32, device=data.device) - - for rr in hl.tile(nrows): - row = data[rr, :].to(torch.float32) - - abs1 = torch.abs(row) - amax1 = torch.amax(abs1, -1) - abs2 = torch.abs(row) - amax2 = torch.amax(abs2, -1) - abs3 = torch.abs(row) - amax3 = torch.amax(abs3, -1) - amax = (amax1 + amax2 + amax3) / 3.0 - amax = torch.clamp(amax, min=1e-10) - scale = amax / MAX_VAL - - q1 = row / scale[:, None] - q2 = row / scale[:, None] - q3 = row / scale[:, None] - qout[rr, :] = (q1 + q2 + q3) / 3.0 - scales_out[rr] = scale - - return qout - - return kernel - - -_KERNELS = {shape: _make_kernel(cfg) for shape, cfg in SHAPE_CONFIGS.items()} - - -def custom_kernel(data: input_t) -> output_t: - x, x_q, x_s = data - T, H = x.shape - G = x_s.shape[1] - gsz = H // G - N = T * G - - kernel = _KERNELS[(T, H, gsz)] - - flat_in = x.reshape(N, gsz) - flat_s = x_s.reshape(N) - - flat_q = kernel(flat_in, flat_s) - - x_q[...] = flat_q.reshape(T, H) - x_s[...] = flat_s.reshape(T, G) - return x_q, x_s diff --git a/problems/helion/fp8_quant_py/task.py b/problems/helion/fp8_quant_py/task.py deleted file mode 100644 index 8fb6c1f0..00000000 --- a/problems/helion/fp8_quant_py/task.py +++ /dev/null @@ -1,11 +0,0 @@ -from typing import TypedDict, TypeVar -import torch - -input_t = TypeVar("input_t", bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor]) -output_t = TypeVar("output_t", bound=tuple[torch.Tensor, torch.Tensor]) - -class TestSpec(TypedDict): - num_tokens: int - hidden_dim: int - group_size: int - seed: int diff --git a/problems/helion/fp8_quant_py/task.yml b/problems/helion/fp8_quant_py/task.yml deleted file mode 100644 index df7c36d5..00000000 --- a/problems/helion/fp8_quant_py/task.yml +++ /dev/null @@ -1,56 +0,0 @@ -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - Implement a per-token-group FP8 E4M3 quantization kernel. - - This is THE standard activation quantization method in production LLM inference - (DeepSeek-V3, Llama 3, Qwen3). It dynamically quantizes activations to FP8 - format with per-group scale factors for W8A8 quantized inference. - - For each group of `group_size` contiguous elements: - 1. absmax = max(|x_group|) - 2. scale = max(absmax, eps) / 448.0 - 3. x_q = clamp(x / scale, -448.0, 448.0) - - Where 448.0 is the max representable value in FP8 E4M3 format. - - NOTE: Output is float32 clamped to FP8 range (for broad GPU compatibility). - - Input: tuple(x, x_q, x_s) where: - - x: torch.Tensor of shape [num_tokens, hidden_dim] (float32) - - x_q: pre-allocated output [num_tokens, hidden_dim] (float32) - - x_s: pre-allocated scales [num_tokens, hidden_dim // group_size] (float32) - - Output: tuple(x_q, x_s) where: - - x_q: quantized values [num_tokens, hidden_dim] (float32, clamped to FP8 range) - - x_s: per-group scale factors [num_tokens, hidden_dim // group_size] (float32) - -config: - main: "eval.py" - -templates: - Python: "../template.py" - -tests: - - {"num_tokens": 1, "hidden_dim": 256, "group_size": 64, "seed": 4242} - - {"num_tokens": 4, "hidden_dim": 512, "group_size": 128, "seed": 5236} - - {"num_tokens": 16, "hidden_dim": 1024, "group_size": 64, "seed": 1001} - - {"num_tokens": 1, "hidden_dim": 4096, "group_size": 128, "seed": 5531} - - {"num_tokens": 8, "hidden_dim": 4096, "group_size": 128, "seed": 9173} - -benchmarks: - - {"num_tokens": 256, "hidden_dim": 4096, "group_size": 128, "seed": 2146} - - {"num_tokens": 256, "hidden_dim": 8192, "group_size": 128, "seed": 3129} - - {"num_tokens": 4096, "hidden_dim": 7168, "group_size": 128, "seed": 54352} - -test_timeout: 180 -benchmark_timeout: 180 -ranked_timeout: 420 -ranking_by: "geom" diff --git a/problems/helion/gated_deltanet_chunk_fwd_h_py/reference.py b/problems/helion/gated_deltanet_chunk_fwd_h_py/reference.py deleted file mode 100644 index 9d9b7204..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_h_py/reference.py +++ /dev/null @@ -1,110 +0,0 @@ -import torch -from task import input_t, output_t -from utils import verbose_allclose - -CHUNK_SIZE = 64 - - -def _chunk_local_cumsum_eager(g, chunk_size): - B, T, H = g.shape - C = chunk_size - return g.float().reshape(B, T // C, C, H).cumsum(dim=2).reshape(B, T, H) - - -def _chunk_scaled_dot_kkt_fwd_eager(k, g_cumsum, beta, chunk_size): - B, T, H, K = k.shape - C = chunk_size - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - g_c = g_cumsum.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - beta_c = beta.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - kkt = k_c @ k_c.transpose(-1, -2) - strict_lower = torch.tril(torch.ones(C, C, device=k.device), diagonal=-1) - g_diff = g_c.unsqueeze(-1) - g_c.unsqueeze(-2) - g_diff = g_diff * strict_lower - A = kkt * beta_c.unsqueeze(-1) * torch.exp(g_diff) * strict_lower - return A.permute(0, 1, 3, 2, 4).reshape(B, T, H, C).to(torch.float32) - - -def _solve_tril_eager(A, output_dtype): - B, T, H, C = A.shape - NT = T // C - A_mat = A.float().reshape(B, NT, C, H, C).permute(0, 1, 3, 2, 4) - eye = torch.eye(C, device=A.device).expand_as(A_mat) - result = torch.linalg.solve_triangular(eye + A_mat, eye, upper=False) - return result.permute(0, 1, 3, 2, 4).reshape(B, T, H, C).to(output_dtype) - - -def _recompute_w_u_fwd_eager(k, v, beta, A, g): - B, T, H, K = k.shape - V = v.shape[-1] - C = A.shape[-1] - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - v_c = v.float().reshape(B, NT, C, H, V).permute(0, 1, 3, 2, 4) - beta_c = beta.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - g_c = g.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - A_c = A.float().reshape(B, NT, C, H, C).permute(0, 1, 3, 2, 4) - u_c = A_c @ (v_c * beta_c.unsqueeze(-1)) - w_c = A_c @ (k_c * (beta_c * torch.exp(g_c)).unsqueeze(-1)) - w = w_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, K).to(k.dtype) - u = u_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, V).to(v.dtype) - return w, u - - -def generate_input(B: int, T: int, H: int, K: int, V: int, seed: int) -> input_t: - torch.manual_seed(seed) - device = "cuda" - k = torch.randn(B, T, H, K, dtype=torch.float32, device=device) / K**0.5 - v = torch.randn(B, T, H, V, dtype=torch.float32, device=device) - beta = torch.sigmoid(torch.randn(B, T, H, dtype=torch.float32, device=device)) - g_inc = -torch.abs(torch.randn(B, T, H, dtype=torch.float32, device=device)) - g = g_inc.cumsum(dim=1) - g_cumsum = _chunk_local_cumsum_eager(g, chunk_size=CHUNK_SIZE) - A = _chunk_scaled_dot_kkt_fwd_eager(k=k, g_cumsum=g_cumsum, beta=beta, chunk_size=CHUNK_SIZE) - A = _solve_tril_eager(A=A, output_dtype=k.dtype) - w, u = _recompute_w_u_fwd_eager(k=k, v=v, beta=beta, A=A, g=g_cumsum) - return k.contiguous(), w.contiguous(), u.contiguous(), g_cumsum.contiguous() - - -def ref_kernel(data: input_t) -> output_t: - k, w, u, g = data - B, T, H, K = k.shape - V = u.shape[-1] - C = CHUNK_SIZE - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - w_c = w.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - u_c = u.float().reshape(B, NT, C, H, V).permute(0, 1, 3, 2, 4) - g_c = g.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - h_all = torch.zeros(B, NT, H, K, V, dtype=torch.float32, device=k.device) - v_new_c = torch.zeros_like(u_c) - h = torch.zeros(B, H, K, V, dtype=torch.float32, device=k.device) - for c in range(NT): - h_all[:, c] = h - v_new_c[:, c] = u_c[:, c] - w_c[:, c] @ h - g_last = g_c[:, c, :, -1] - gate = torch.exp(g_last.unsqueeze(-1) - g_c[:, c]) - v_gated = v_new_c[:, c] * gate.unsqueeze(-1) - h = h * torch.exp(g_last).unsqueeze(-1).unsqueeze(-1) + k_c[:, c].transpose(-1, -2) @ v_gated - v_new_out = v_new_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, V).to(u.dtype) - return h_all.to(k.dtype), v_new_out - - -def check_implementation(data, output): - expected = ref_kernel(data) - exp_h, exp_v = expected - got_h, got_v = output - - reasons_h = verbose_allclose(got_h.float(), exp_h.float(), rtol=1e-3, atol=1e-3) - reasons_v = verbose_allclose(got_v.float(), exp_v.float(), rtol=1e-3, atol=1e-3) - - reasons = [] - if reasons_h: - reasons.append("h mismatch: " + " ".join(reasons_h)) - if reasons_v: - reasons.append("v_new mismatch: " + " ".join(reasons_v)) - - if reasons: - return False, " | ".join(reasons) - return True, "" diff --git a/problems/helion/gated_deltanet_chunk_fwd_h_py/submission.py b/problems/helion/gated_deltanet_chunk_fwd_h_py/submission.py deleted file mode 100644 index 04e0ecfc..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_h_py/submission.py +++ /dev/null @@ -1,97 +0,0 @@ -from task import input_t, output_t - -import torch -import helion -import helion.language as hl - - -# Per-shape configs: map (B, T, H, K, V) to optimized helion.Config objects. -# Autotune locally for each shape, then paste the best config here. -SHAPE_CONFIGS: dict[tuple, helion.Config] = { - # Test shapes - (1, 64, 2, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (2, 128, 4, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (1, 256, 4, 64, 128): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - # Benchmark shapes - (1, 64, 1, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (2, 512, 3, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (2, 1024, 3, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (3, 1024, 4, 100, 100): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (4, 1024, 4, 128, 128): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (2, 1536, 4, 128, 128): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (4, 2048, 8, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config -} - - -# Optional: add advanced_controls_file to your Config for extra performance (see docs). -# Autotune with autotune_search_acf to find the best ACF, then hardcode it: -# helion.Config(..., advanced_controls_file="/opt/booster_pack/chunk_fwd_h_0.acf") - - -# NOTE: This is an intentionally inefficient baseline implementation. -def _make_kernel(config: helion.Config): - @helion.kernel(static_shapes=True, dot_precision="ieee", config=config) - def kernel( - k: torch.Tensor, # [B, T, H, K] - w: torch.Tensor, # [B, T, H, K] - u: torch.Tensor, # [B, T, H, V] - g: torch.Tensor, # [B, T, H] - ) -> tuple[torch.Tensor, torch.Tensor]: - B, T, H, K = k.shape - V = u.shape[-1] - C = 64 - K = hl.specialize(K) - V = hl.specialize(V) - - NT = (T + C - 1) // C - h_out = torch.empty(B, NT, H, K, V, dtype=k.dtype, device=k.device) - v_out = torch.empty_like(u) - - BH = B * H - - for flat, tv in hl.tile([BH, V], block_size=[1, 8]): - b_idx = flat.begin // H - h_idx = flat.begin % H - state = hl.zeros([K, tv], dtype=torch.float32) - - for tc in hl.tile(T, block_size=C): - chunk_idx = tc.begin // C - t_end = min(tc.begin + C, T) - 1 - - h_out[b_idx, chunk_idx, h_idx, :, tv] = state.to(k.dtype) - - proj1 = hl.dot( - w[b_idx, tc, h_idx, :], state, out_dtype=torch.float32 - ) - proj2 = hl.dot( - w[b_idx, tc, h_idx, :], state, out_dtype=torch.float32 - ) - proj = (proj1 + proj2) * 0.5 - diff = u[b_idx, tc, h_idx, tv].to(torch.float32) - proj - v_out[b_idx, tc, h_idx, tv] = diff.to(u.dtype) - - g_end = g[b_idx, t_end, h_idx].to(torch.float32) - g_t = g[b_idx, tc, h_idx].to(torch.float32) - valid = tc.index < T - alpha = torch.where(valid, torch.exp(g_end - g_t), 0.0) - k_adj = k[b_idx, tc, h_idx, :] * alpha[:, None] - - state = state * torch.exp(g_end) - upd1 = hl.dot(k_adj.T, diff, out_dtype=torch.float32) - upd2 = hl.dot(k_adj.T, diff, out_dtype=torch.float32) - state = state + (upd1 + upd2) * 0.5 - - return h_out, v_out - - return kernel - - -_KERNELS = {shape: _make_kernel(cfg) for shape, cfg in SHAPE_CONFIGS.items()} - - -def custom_kernel(data: input_t) -> output_t: - k, w, u, g = data - B, T, H, K = k.shape - V = u.shape[-1] - kernel = _KERNELS[(B, T, H, K, V)] - return kernel(k, w, u, g) diff --git a/problems/helion/gated_deltanet_chunk_fwd_h_py/task.py b/problems/helion/gated_deltanet_chunk_fwd_h_py/task.py deleted file mode 100644 index 248a342e..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_h_py/task.py +++ /dev/null @@ -1,13 +0,0 @@ -from typing import TypedDict, TypeVar -import torch - -input_t = TypeVar("input_t", bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]) -output_t = TypeVar("output_t", bound=tuple[torch.Tensor, torch.Tensor]) - -class TestSpec(TypedDict): - B: int - T: int - H: int - K: int - V: int - seed: int diff --git a/problems/helion/gated_deltanet_chunk_fwd_h_py/task.yml b/problems/helion/gated_deltanet_chunk_fwd_h_py/task.yml deleted file mode 100644 index 217db171..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_h_py/task.yml +++ /dev/null @@ -1,62 +0,0 @@ -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - Implement the chunk_fwd_h (inter-chunk state recurrence) kernel for Gated DeltaNet. - - This kernel maintains a hidden state h of shape [K, V] across chunks and computes - v_new (corrected values) for each chunk. It is the sequential bottleneck in the - chunkwise parallel forward pass of Gated DeltaNet (arXiv:2412.06464, ICLR 2025). - - The sequence is divided into chunks of BT=64 timesteps. Processing is sequential - across chunks but parallel across (B, H) and within each chunk: - - For each (b, h) pair, starting with h_state = zeros(K, V): - For each chunk c = 0, 1, ..., NT-1: - 1. Store: h_out[b, c, h] = h_state - 2. Compute: v_new = u - w @ h_state - 3. Gate: v_gated[t] = v_new[t] * exp(g[last_t] - g[t]) - 4. Decay: h_state = h_state * exp(g[last_t]) - 5. Update: h_state = h_state + k^T @ v_gated - - Input: tuple(k, w, u, g) where: - - k: torch.Tensor of shape [B, T, H, K] (float32) — keys - - w: torch.Tensor of shape [B, T, H, K] (float32) — WY-transformed keys - - u: torch.Tensor of shape [B, T, H, V] (float32) — WY-transformed values - - g: torch.Tensor of shape [B, T, H] (float32) — cumulative gate - - Output: tuple(h, v_new) where: - - h: torch.Tensor of shape [B, NT, H, K, V] (float32) — per-chunk hidden states - - v_new: torch.Tensor of shape [B, T, H, V] (float32) — corrected values - - Constraint: T must be a multiple of 64. NT = T // 64. - - See also: Helion examples/gdn_fwd_h.py for a related implementation - (simpler variant that returns only h, without v_new output). - -config: - main: "eval.py" - -templates: - Python: "../template.py" - -tests: - - {"B": 1, "T": 64, "H": 2, "K": 64, "V": 64, "seed": 4242} - - {"B": 2, "T": 128, "H": 4, "K": 64, "V": 64, "seed": 5236} - - {"B": 1, "T": 256, "H": 4, "K": 64, "V": 128, "seed": 1001} - -benchmarks: - - {"B": 1, "T": 64, "H": 1, "K": 64, "V": 64, "seed": 31232} - - {"B": 2, "T": 512, "H": 3, "K": 64, "V": 64, "seed": 4052} - - {"B": 2, "T": 1024, "H": 3, "K": 64, "V": 64, "seed": 2146} - -test_timeout: 180 -benchmark_timeout: 180 -ranked_timeout: 420 -ranking_by: "geom" diff --git a/problems/helion/gated_deltanet_chunk_fwd_o_py/reference.py b/problems/helion/gated_deltanet_chunk_fwd_o_py/reference.py deleted file mode 100644 index 54be0f2f..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_o_py/reference.py +++ /dev/null @@ -1,115 +0,0 @@ -import torch -from task import input_t, output_t -from utils import make_match_reference - -CHUNK_SIZE = 64 - - -def _chunk_local_cumsum_eager(g, chunk_size): - B, T, H = g.shape - C = chunk_size - return g.float().reshape(B, T // C, C, H).cumsum(dim=2).reshape(B, T, H) - - -def _chunk_scaled_dot_kkt_fwd_eager(k, g_cumsum, beta, chunk_size): - B, T, H, K = k.shape - C = chunk_size - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - g_c = g_cumsum.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - beta_c = beta.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - kkt = k_c @ k_c.transpose(-1, -2) - strict_lower = torch.tril(torch.ones(C, C, device=k.device), diagonal=-1) - g_diff = g_c.unsqueeze(-1) - g_c.unsqueeze(-2) - g_diff = g_diff * strict_lower - A = kkt * beta_c.unsqueeze(-1) * torch.exp(g_diff) * strict_lower - return A.permute(0, 1, 3, 2, 4).reshape(B, T, H, C).to(torch.float32) - - -def _solve_tril_eager(A, output_dtype): - B, T, H, C = A.shape - NT = T // C - A_mat = A.float().reshape(B, NT, C, H, C).permute(0, 1, 3, 2, 4) - eye = torch.eye(C, device=A.device).expand_as(A_mat) - result = torch.linalg.solve_triangular(eye + A_mat, eye, upper=False) - return result.permute(0, 1, 3, 2, 4).reshape(B, T, H, C).to(output_dtype) - - -def _recompute_w_u_fwd_eager(k, v, beta, A, g): - B, T, H, K = k.shape - V = v.shape[-1] - C = A.shape[-1] - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - v_c = v.float().reshape(B, NT, C, H, V).permute(0, 1, 3, 2, 4) - beta_c = beta.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - g_c = g.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - A_c = A.float().reshape(B, NT, C, H, C).permute(0, 1, 3, 2, 4) - u_c = A_c @ (v_c * beta_c.unsqueeze(-1)) - w_c = A_c @ (k_c * (beta_c * torch.exp(g_c)).unsqueeze(-1)) - w = w_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, K).to(k.dtype) - u = u_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, V).to(v.dtype) - return w, u - - -def _chunk_fwd_h_eager(k, w, u, g): - B, T, H, K = k.shape - V = u.shape[-1] - C = CHUNK_SIZE - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - w_c = w.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - u_c = u.float().reshape(B, NT, C, H, V).permute(0, 1, 3, 2, 4) - g_c = g.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - h_all = torch.zeros(B, NT, H, K, V, dtype=torch.float32, device=k.device) - v_new_c = torch.zeros_like(u_c) - h = torch.zeros(B, H, K, V, dtype=torch.float32, device=k.device) - for c in range(NT): - h_all[:, c] = h - v_new_c[:, c] = u_c[:, c] - w_c[:, c] @ h - g_last = g_c[:, c, :, -1] - gate = torch.exp(g_last.unsqueeze(-1) - g_c[:, c]) - v_gated = v_new_c[:, c] * gate.unsqueeze(-1) - h = h * torch.exp(g_last).unsqueeze(-1).unsqueeze(-1) + k_c[:, c].transpose(-1, -2) @ v_gated - v_new_out = v_new_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, V).to(u.dtype) - return h_all.to(k.dtype), v_new_out - - -def generate_input(B: int, T: int, H: int, K: int, V: int, seed: int) -> input_t: - torch.manual_seed(seed) - device = "cuda" - q = torch.randn(B, T, H, K, dtype=torch.float32, device=device) - k = torch.randn(B, T, H, K, dtype=torch.float32, device=device) / K**0.5 - v = torch.randn(B, T, H, V, dtype=torch.float32, device=device) - beta = torch.sigmoid(torch.randn(B, T, H, dtype=torch.float32, device=device)) - g_inc = -torch.abs(torch.randn(B, T, H, dtype=torch.float32, device=device)) - g = g_inc.cumsum(dim=1) - g_cumsum = _chunk_local_cumsum_eager(g, chunk_size=CHUNK_SIZE) - A = _chunk_scaled_dot_kkt_fwd_eager(k=k, g_cumsum=g_cumsum, beta=beta, chunk_size=CHUNK_SIZE) - A = _solve_tril_eager(A=A, output_dtype=k.dtype) - w, u = _recompute_w_u_fwd_eager(k=k, v=v, beta=beta, A=A, g=g_cumsum) - h, v_new = _chunk_fwd_h_eager(k=k, w=w, u=u, g=g_cumsum) - return q.contiguous(), k.contiguous(), v_new.contiguous(), h.contiguous(), g_cumsum.contiguous() - - -def ref_kernel(data: input_t) -> output_t: - q, k, v_new, h, g = data - B, T, H, K = q.shape - V = v_new.shape[-1] - C = CHUNK_SIZE - NT = T // C - scale = K ** -0.5 - q_c = q.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - v_c = v_new.float().reshape(B, NT, C, H, V).permute(0, 1, 3, 2, 4) - g_c = g.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - o_inter = (q_c @ h.float()) * torch.exp(g_c).unsqueeze(-1) - causal = torch.tril(torch.ones(C, C, dtype=torch.bool, device=q.device)) - g_diff = g_c.unsqueeze(-1) - g_c.unsqueeze(-2) - g_diff = torch.where(causal, g_diff, torch.zeros_like(g_diff)) - qk = q_c @ k_c.transpose(-1, -2) * torch.exp(g_diff) * causal - o = (o_inter + qk @ v_c) * scale - return o.permute(0, 1, 3, 2, 4).reshape(B, T, H, V).to(q.dtype) - - -check_implementation = make_match_reference(ref_kernel, rtol=1e-3, atol=1e-3) diff --git a/problems/helion/gated_deltanet_chunk_fwd_o_py/submission.py b/problems/helion/gated_deltanet_chunk_fwd_o_py/submission.py deleted file mode 100644 index eb4de947..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_o_py/submission.py +++ /dev/null @@ -1,89 +0,0 @@ -from task import input_t, output_t - -import torch -import helion -import helion.language as hl - - -# Per-shape configs: map (B, T, H, K, V) to optimized helion.Config objects. -# Autotune locally for each shape, then paste the best config here. -SHAPE_CONFIGS: dict[tuple, helion.Config] = { - # Test shapes - (1, 64, 2, 64, 64): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: use any config that passes correctness check - (2, 128, 4, 64, 64): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: use any config that passes correctness check - (1, 256, 4, 64, 128): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: use any config that passes correctness check - # Benchmark shapes - (1, 64, 1, 64, 64): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config - (2, 512, 3, 64, 64): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config - (2, 1024, 3, 64, 64): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config - (3, 1024, 4, 100, 100): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config - (4, 1024, 4, 128, 128): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config - (2, 1536, 4, 128, 128): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config - (4, 2048, 8, 64, 64): helion.Config(block_sizes=[], num_warps=8, num_stages=2), # TODO: replace with your autotuned config -} - - -# Optional: add advanced_controls_file to your Config for extra performance (see docs). -# Autotune with autotune_search_acf to find the best ACF, then hardcode it: -# helion.Config(..., advanced_controls_file="/opt/booster_pack/chunk_fwd_o_0.acf") - - -# NOTE: This is an intentionally inefficient baseline implementation. -def _make_kernel(config: helion.Config): - @helion.kernel(static_shapes=True, dot_precision="ieee", config=config) - def kernel( - q: torch.Tensor, # [B, T, H, K] - k: torch.Tensor, # [B, T, H, K] - v: torch.Tensor, # [B, T, H, V] - h: torch.Tensor, # [B, NT, H, K, V] - g: torch.Tensor, # [B, T, H] - scale: float, - ) -> torch.Tensor: - B, T, H, K = q.shape - V = v.shape[-1] - C = 64 - K = hl.specialize(K) - V = hl.specialize(V) - - out = torch.empty_like(v) - - BH = B * H - for flat_bh, tile_t in hl.tile([BH, T], block_size=[1, C]): - b_idx = flat_bh.begin // H - h_idx = flat_bh.begin % H - c_idx = tile_t.begin // C - - g_vals = g[b_idx, tile_t, h_idx] - q_tile = q[b_idx, tile_t, h_idx, :] - k_tile = k[b_idx, tile_t, h_idx, :] - v_tile = v[b_idx, tile_t, h_idx, :] - - # intra-chunk: q @ k^T * exp(g_i - g_j), with causal mask - qk = hl.dot(q_tile, k_tile.T) - idx = hl.arange(tile_t.block_size) - g_diff = g_vals[:, None] - g_vals[None, :] - causal_mask = idx[:, None] >= idx[None, :] - sim = torch.where(causal_mask, qk * torch.exp(g_diff), 0.0) - local_out = hl.dot(sim.to(v.dtype), v_tile) - - # inter-chunk: (q @ h) * exp(g) - q_s = q_tile * torch.exp(g_vals)[:, None] - global_out = hl.dot(q_s, h[b_idx, c_idx, h_idx, :, :]) - - out[b_idx, tile_t, h_idx, :] = ((global_out + local_out) * scale).to(out.dtype) - - return out - - return kernel - - -_KERNELS = {shape: _make_kernel(cfg) for shape, cfg in SHAPE_CONFIGS.items()} - - -def custom_kernel(data: input_t) -> output_t: - q, k, v_new, h, g = data - B, T, H, K = q.shape - V = v_new.shape[-1] - scale = K ** -0.5 - kernel = _KERNELS[(B, T, H, K, V)] - return kernel(q, k, v_new, h, g, scale) diff --git a/problems/helion/gated_deltanet_chunk_fwd_o_py/task.py b/problems/helion/gated_deltanet_chunk_fwd_o_py/task.py deleted file mode 100644 index 08d4b4f6..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_o_py/task.py +++ /dev/null @@ -1,13 +0,0 @@ -from typing import TypedDict, TypeVar -import torch - -input_t = TypeVar("input_t", bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]) -output_t = TypeVar("output_t", bound=torch.Tensor) - -class TestSpec(TypedDict): - B: int - T: int - H: int - K: int - V: int - seed: int diff --git a/problems/helion/gated_deltanet_chunk_fwd_o_py/task.yml b/problems/helion/gated_deltanet_chunk_fwd_o_py/task.yml deleted file mode 100644 index 7b8e2a08..00000000 --- a/problems/helion/gated_deltanet_chunk_fwd_o_py/task.yml +++ /dev/null @@ -1,55 +0,0 @@ -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - Implement the chunk_fwd_o (output computation) kernel for Gated DeltaNet. - - This kernel computes the final output by combining inter-chunk (state-based) - and intra-chunk (attention-based) contributions for the chunkwise parallel - forward pass of Gated DeltaNet (arXiv:2412.06464, ICLR 2025). - - The sequence is divided into chunks of BT=64 timesteps. For each chunk - independently: - inter = q @ h * exp(g) - intra = causal_mask(q @ k^T * exp(g[:, None] - g[None, :])) @ v_new - output = (inter + intra) * scale - - where scale = K^(-0.5), and causal_mask zeros out entries where row < col. - - Input: tuple(q, k, v_new, h, g) where: - - q: torch.Tensor of shape [B, T, H, K] (float32) — queries - - k: torch.Tensor of shape [B, T, H, K] (float32) — keys - - v_new: torch.Tensor of shape [B, T, H, V] (float32) — corrected values - - h: torch.Tensor of shape [B, NT, H, K, V] (float32) — per-chunk states - - g: torch.Tensor of shape [B, T, H] (float32) — cumulative gate - - Output: torch.Tensor of shape [B, T, H, V] (float32) - - Constraint: T must be a multiple of 64. NT = T // 64. scale = K^(-0.5). - -config: - main: "eval.py" - -templates: - Python: "../template.py" - -tests: - - {"B": 1, "T": 64, "H": 2, "K": 64, "V": 64, "seed": 4242} - - {"B": 2, "T": 128, "H": 4, "K": 64, "V": 64, "seed": 5236} - - {"B": 1, "T": 256, "H": 4, "K": 64, "V": 128, "seed": 1001} - -benchmarks: - - {"B": 1, "T": 64, "H": 1, "K": 64, "V": 64, "seed": 31232} - - {"B": 2, "T": 512, "H": 3, "K": 64, "V": 64, "seed": 4052} - - {"B": 2, "T": 1024, "H": 3, "K": 64, "V": 64, "seed": 2146} - -test_timeout: 180 -benchmark_timeout: 180 -ranked_timeout: 420 -ranking_by: "geom" diff --git a/problems/helion/gated_deltanet_recompute_w_u_py/reference.py b/problems/helion/gated_deltanet_recompute_w_u_py/reference.py deleted file mode 100644 index bd7c1507..00000000 --- a/problems/helion/gated_deltanet_recompute_w_u_py/reference.py +++ /dev/null @@ -1,86 +0,0 @@ -import torch -from task import input_t, output_t -from utils import verbose_allclose - -CHUNK_SIZE = 64 - - -def _chunk_local_cumsum_eager(g, chunk_size): - B, T, H = g.shape - C = chunk_size - return g.float().reshape(B, T // C, C, H).cumsum(dim=2).reshape(B, T, H) - - -def _chunk_scaled_dot_kkt_fwd_eager(k, g_cumsum, beta, chunk_size): - B, T, H, K = k.shape - C = chunk_size - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - g_c = g_cumsum.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - beta_c = beta.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - kkt = k_c @ k_c.transpose(-1, -2) - strict_lower = torch.tril(torch.ones(C, C, device=k.device), diagonal=-1) - g_diff = g_c.unsqueeze(-1) - g_c.unsqueeze(-2) - g_diff = g_diff * strict_lower - A = kkt * beta_c.unsqueeze(-1) * torch.exp(g_diff) * strict_lower - return A.permute(0, 1, 3, 2, 4).reshape(B, T, H, C).to(torch.float32) - - -def _solve_tril_eager(A, output_dtype): - B, T, H, C = A.shape - NT = T // C - A_mat = A.float().reshape(B, NT, C, H, C).permute(0, 1, 3, 2, 4) - eye = torch.eye(C, device=A.device).expand_as(A_mat) - result = torch.linalg.solve_triangular(eye + A_mat, eye, upper=False) - return result.permute(0, 1, 3, 2, 4).reshape(B, T, H, C).to(output_dtype) - - -def generate_input(B: int, T: int, H: int, K: int, V: int, seed: int) -> input_t: - torch.manual_seed(seed) - device = "cuda" - k = torch.randn(B, T, H, K, dtype=torch.float32, device=device) / K**0.5 - v = torch.randn(B, T, H, V, dtype=torch.float32, device=device) - beta = torch.sigmoid(torch.randn(B, T, H, dtype=torch.float32, device=device)) - g_inc = -torch.abs(torch.randn(B, T, H, dtype=torch.float32, device=device)) - g = g_inc.cumsum(dim=1) - g_cumsum = _chunk_local_cumsum_eager(g, chunk_size=CHUNK_SIZE) - A = _chunk_scaled_dot_kkt_fwd_eager(k=k, g_cumsum=g_cumsum, beta=beta, chunk_size=CHUNK_SIZE) - A = _solve_tril_eager(A=A, output_dtype=k.dtype) - return k.contiguous(), v.contiguous(), beta.contiguous(), A.contiguous(), g_cumsum.contiguous() - - -def ref_kernel(data: input_t) -> output_t: - k, v, beta, A, g = data - B, T, H, K = k.shape - V = v.shape[-1] - C = A.shape[-1] - NT = T // C - k_c = k.float().reshape(B, NT, C, H, K).permute(0, 1, 3, 2, 4) - v_c = v.float().reshape(B, NT, C, H, V).permute(0, 1, 3, 2, 4) - beta_c = beta.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - g_c = g.float().reshape(B, NT, C, H).permute(0, 1, 3, 2) - A_c = A.float().reshape(B, NT, C, H, C).permute(0, 1, 3, 2, 4) - u_c = A_c @ (v_c * beta_c.unsqueeze(-1)) - w_c = A_c @ (k_c * (beta_c * torch.exp(g_c)).unsqueeze(-1)) - w = w_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, K).to(k.dtype) - u = u_c.permute(0, 1, 3, 2, 4).reshape(B, T, H, V).to(v.dtype) - return w, u - - -def check_implementation(data, output): - expected = ref_kernel(data) - exp_w, exp_u = expected - got_w, got_u = output - - reasons_w = verbose_allclose(got_w, exp_w, rtol=1e-3, atol=1e-3) - reasons_u = verbose_allclose(got_u, exp_u, rtol=1e-3, atol=1e-3) - - reasons = [] - if reasons_w: - reasons.append("w mismatch: " + " ".join(reasons_w)) - if reasons_u: - reasons.append("u mismatch: " + " ".join(reasons_u)) - - if reasons: - return False, " | ".join(reasons) - return True, "" diff --git a/problems/helion/gated_deltanet_recompute_w_u_py/submission.py b/problems/helion/gated_deltanet_recompute_w_u_py/submission.py deleted file mode 100644 index 07fb0691..00000000 --- a/problems/helion/gated_deltanet_recompute_w_u_py/submission.py +++ /dev/null @@ -1,100 +0,0 @@ -from task import input_t, output_t - -import torch -import helion -import helion.language as hl - - -# Per-shape configs: map (B, T, H, K, V) to optimized helion.Config objects. -# Autotune locally for each shape, then paste the best config here. -SHAPE_CONFIGS: dict[tuple, helion.Config] = { - # Test shapes - (1, 64, 2, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (2, 128, 4, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - (1, 256, 4, 64, 128): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: use any config that passes correctness check - # Benchmark shapes - (1, 64, 1, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (2, 512, 3, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (2, 1024, 3, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (3, 1024, 4, 100, 100): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (4, 1024, 4, 128, 128): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (2, 1536, 4, 128, 128): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config - (4, 2048, 8, 64, 64): helion.Config(block_sizes=[], num_warps=1, num_stages=1), # TODO: replace with your autotuned config -} - - -# Optional: add advanced_controls_file to your Config for extra performance (see docs). -# Autotune with autotune_search_acf to find the best ACF, then hardcode it: -# helion.Config(..., advanced_controls_file="/opt/booster_pack/recompute_w_u_fwd_0.acf") - - -# NOTE: This is an intentionally inefficient baseline implementation. -def _make_kernel(config: helion.Config): - @helion.kernel(static_shapes=True, dot_precision="ieee", config=config) - def kernel( - k: torch.Tensor, # [B, T, H, K] - v: torch.Tensor, # [B, T, H, V] - beta: torch.Tensor, # [B, T, H] - A: torch.Tensor, # [B, T, H, BT] - g: torch.Tensor, # [B, T, H] - ) -> tuple[torch.Tensor, torch.Tensor]: - B, T, H, K = k.shape - V = v.shape[-1] - C = hl.specialize(A.shape[-1]) - K = hl.specialize(K) - V = hl.specialize(V) - - w_out = torch.empty_like(k) - u_out = torch.empty_like(v) - - BH = B * H - for flat_bh, rt in hl.tile([BH, T], block_size=[1, C]): - b_idx = flat_bh.begin // H - h_idx = flat_bh.begin % H - - w_acc1 = hl.zeros([rt, K], dtype=torch.float32) - u_acc1 = hl.zeros([rt, V], dtype=torch.float32) - w_acc2 = hl.zeros([rt, K], dtype=torch.float32) - u_acc2 = hl.zeros([rt, V], dtype=torch.float32) - - for ci in range(C): - t_ci = rt.begin + ci - a_col = A[b_idx, rt, h_idx, ci].to(torch.float32) - coeff_ci = beta[b_idx, t_ci, h_idx].to(torch.float32) - decay_ci = torch.exp(g[b_idx, t_ci, h_idx].to(torch.float32)) - - k_ci = k[b_idx, t_ci, h_idx, :].to(torch.float32) - v_ci = v[b_idx, t_ci, h_idx, :].to(torch.float32) - - w_acc1 = w_acc1 + a_col[:, None] * (k_ci * coeff_ci * decay_ci)[None, :] - u_acc1 = u_acc1 + a_col[:, None] * (v_ci * coeff_ci)[None, :] - - for ci in range(C - 1, -1, -1): - t_ci = rt.begin + ci - a_col = A[b_idx, rt, h_idx, ci].to(torch.float32) - coeff_ci = beta[b_idx, t_ci, h_idx].to(torch.float32) - decay_ci = torch.exp(g[b_idx, t_ci, h_idx].to(torch.float32)) - - k_ci = k[b_idx, t_ci, h_idx, :].to(torch.float32) - v_ci = v[b_idx, t_ci, h_idx, :].to(torch.float32) - - w_acc2 = w_acc2 + a_col[:, None] * (k_ci * coeff_ci * decay_ci)[None, :] - u_acc2 = u_acc2 + a_col[:, None] * (v_ci * coeff_ci)[None, :] - - w_out[b_idx, rt, h_idx, :] = ((w_acc1 + w_acc2) * 0.5).to(k.dtype) - u_out[b_idx, rt, h_idx, :] = ((u_acc1 + u_acc2) * 0.5).to(v.dtype) - - return w_out, u_out - - return kernel - - -_KERNELS = {shape: _make_kernel(cfg) for shape, cfg in SHAPE_CONFIGS.items()} - - -def custom_kernel(data: input_t) -> output_t: - k, v, beta, A, g = data - B, T, H, K = k.shape - V = v.shape[-1] - kernel = _KERNELS[(B, T, H, K, V)] - return kernel(k, v, beta, A, g) diff --git a/problems/helion/gated_deltanet_recompute_w_u_py/task.py b/problems/helion/gated_deltanet_recompute_w_u_py/task.py deleted file mode 100644 index 2887eb89..00000000 --- a/problems/helion/gated_deltanet_recompute_w_u_py/task.py +++ /dev/null @@ -1,13 +0,0 @@ -from typing import TypedDict, TypeVar -import torch - -input_t = TypeVar("input_t", bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]) -output_t = TypeVar("output_t", bound=tuple[torch.Tensor, torch.Tensor]) - -class TestSpec(TypedDict): - B: int - T: int - H: int - K: int - V: int - seed: int diff --git a/problems/helion/gated_deltanet_recompute_w_u_py/task.yml b/problems/helion/gated_deltanet_recompute_w_u_py/task.yml deleted file mode 100644 index 3a8820fc..00000000 --- a/problems/helion/gated_deltanet_recompute_w_u_py/task.yml +++ /dev/null @@ -1,60 +0,0 @@ -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "../eval.py"} - -lang: "py" - -description: | - Implement the recompute_w_u forward kernel for Gated DeltaNet. - - This kernel computes WY-transformed keys (w) and values (u) for the chunkwise - parallel forward pass of Gated DeltaNet (arXiv:2412.06464, ICLR 2025). It is - one of three per-chunk kernels in the forward pipeline. - - The sequence is divided into non-overlapping chunks of BT=64 timesteps. - For each chunk independently: - u = A @ diag(beta) @ v (WY-transformed values) - w = A @ diag(beta * exp(g)) @ k (WY-transformed keys) - - Equivalently: - u = A @ (v * beta[:, None]) - w = A @ (k * beta[:, None] * exp(g)[:, None]) - - where A is a [BT, BT] WY representation matrix per chunk. - - Input: tuple(k, v, beta, A, g) where: - - k: torch.Tensor of shape [B, T, H, K] (float32) — keys - - v: torch.Tensor of shape [B, T, H, V] (float32) — values - - beta: torch.Tensor of shape [B, T, H] (float32) — gating coefficients - - A: torch.Tensor of shape [B, T, H, BT] (float32) — WY matrix (BT=64) - - g: torch.Tensor of shape [B, T, H] (float32) — cumulative gate - - Output: tuple(w, u) where: - - w: torch.Tensor of shape [B, T, H, K] (float32) — WY-transformed keys - - u: torch.Tensor of shape [B, T, H, V] (float32) — WY-transformed values - - Constraint: T must be a multiple of 64. - -config: - main: "eval.py" - -templates: - Python: "../template.py" - -tests: - - {"B": 1, "T": 64, "H": 2, "K": 64, "V": 64, "seed": 4242} - - {"B": 2, "T": 128, "H": 4, "K": 64, "V": 64, "seed": 5236} - - {"B": 1, "T": 256, "H": 4, "K": 64, "V": 128, "seed": 1001} - -benchmarks: - - {"B": 1, "T": 64, "H": 1, "K": 64, "V": 64, "seed": 31232} - - {"B": 2, "T": 512, "H": 3, "K": 64, "V": 64, "seed": 4052} - - {"B": 2, "T": 1024, "H": 3, "K": 64, "V": 64, "seed": 2146} - -test_timeout: 180 -benchmark_timeout: 180 -ranked_timeout: 420 -ranking_by: "geom" diff --git a/problems/helion/template.py b/problems/helion/template.py deleted file mode 100644 index 37d04820..00000000 --- a/problems/helion/template.py +++ /dev/null @@ -1,31 +0,0 @@ -from task import input_t, output_t -import torch -import helion -import helion.language as hl - - -# Per-shape configs: map input shape tuples to optimized helion.Config objects. -# Autotune locally for each shape, then paste the best config here. -# Include all test and benchmark shapes from task.yml. -SHAPE_CONFIGS: dict[tuple, helion.Config] = { - # (shape_dim_1, shape_dim_2, ...): helion.Config(...), # TODO: replace with your config -} - - -def _make_kernel(config: helion.Config): - @helion.kernel(static_shapes=True, config=config) - def kernel(...) -> ...: - # Your Helion kernel implementation - ... - - return kernel - - -_KERNELS = {shape: _make_kernel(cfg) for shape, cfg in SHAPE_CONFIGS.items()} - - -def custom_kernel(data: input_t) -> output_t: - # Extract shape key from input tensors to select the right kernel - # shape_key = (...) - # kernel = _KERNELS[shape_key] - pass diff --git a/problems/helion/utils.py b/problems/helion/utils.py deleted file mode 100644 index e8a9082f..00000000 --- a/problems/helion/utils.py +++ /dev/null @@ -1,176 +0,0 @@ -import os -import random -import numpy as np -import torch - - -def set_seed(seed=42): - random.seed(seed) - np.random.seed(seed) - torch.manual_seed(seed) - if torch.cuda.is_available(): - torch.cuda.manual_seed(seed) - torch.cuda.manual_seed_all(seed) - - -def get_device(use_cuda: bool = True) -> torch.device: - """Get the appropriate device (GPU or CPU).""" - if use_cuda: - if torch.cuda.is_available(): - return torch.device("cuda") - elif torch.backends.mps.is_available(): - return torch.device("mps") - else: - print("No compatible GPU found. Falling back to CPU.") - return torch.device("cpu") - - -# Adapted from https://github.com/linkedin/Liger-Kernel/blob/main/test/utils.py -@torch.no_grad() -def verbose_allclose( - received: torch.Tensor, - expected: torch.Tensor, - rtol=1e-05, - atol=1e-08, - max_print=5 -) -> list[str]: - """ - Assert that two tensors are element-wise equal within a tolerance, providing detailed information about mismatches. - - Parameters: - received (torch.Tensor): Tensor we actually got. - expected (torch.Tensor): Tensor we expected to receive. - rtol (float): Relative tolerance; relative to expected - atol (float): Absolute tolerance. - max_print (int): Maximum number of mismatched elements to print. - - Raises: - AssertionError: If the tensors are not all close within the given tolerance. - """ - # Check if the shapes of the tensors match - if received.shape != expected.shape: - return ["SIZE MISMATCH"] - - # Calculate the difference between the tensors - diff = torch.abs(received - expected) - - # Determine the tolerance - tolerance = atol + rtol * torch.abs(expected) - - # Find tolerance mismatched elements - tol_mismatched = diff > tolerance - - # Find nan mismatched elements - nan_mismatched = torch.logical_xor(torch.isnan(received), torch.isnan(expected)) - - # Find +inf mismatched elements - posinf_mismatched = torch.logical_xor(torch.isposinf(received), torch.isposinf(expected)) - # Find -inf mismatched elements - neginf_mismatched = torch.logical_xor(torch.isneginf(received), torch.isneginf(expected)) - - # Find all mismatched elements - mismatched = torch.logical_or( - torch.logical_or(tol_mismatched, nan_mismatched), - torch.logical_or(posinf_mismatched, neginf_mismatched), - ) - - mismatched_indices = torch.nonzero(mismatched) - - # Count the number of mismatched elements - num_mismatched = mismatched.count_nonzero().item() - - # Generate detailed information if there are mismatches - if num_mismatched >= 1: - mismatch_details = [f"Number of mismatched elements: {num_mismatched}"] - - for index in mismatched_indices[:max_print]: - i = tuple(index.tolist()) - mismatch_details.append(f"ERROR AT {i}: {received[i]} {expected[i]}") - if num_mismatched > max_print: - mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.") - return mismatch_details - - return [] - - -@torch.no_grad() -def verbose_allequal(received: torch.Tensor, expected: torch.Tensor, max_print: int=5): - """ - Assert that two tensors are element-wise perfectly equal, providing detailed information about mismatches. - - Parameters: - received (torch.Tensor): Tensor we actually got. - expected (torch.Tensor): Tensor we expected to receive. - max_print (int): Maximum number of mismatched elements to print. - - Returns: - Empty string if tensors are equal, otherwise detailed error information - """ - mismatched = torch.not_equal(received, expected) - mismatched_indices = torch.nonzero(mismatched) - - # Count the number of mismatched elements - num_mismatched = mismatched.count_nonzero().item() - - # Generate detailed information if there are mismatches - if num_mismatched >= 1: - mismatch_details = [f"Number of mismatched elements: {num_mismatched}"] - - for index in mismatched_indices[:max_print]: - i = tuple(index.tolist()) - mismatch_details.append(f"ERROR AT {i}: {received[i]} {expected[i]}") - if num_mismatched > max_print: - mismatch_details.append(f"... and {num_mismatched - max_print} more mismatched elements.") - return mismatch_details - - return [] - - -def match_reference(data, output, reference: callable, rtol=1e-05, atol=1e-08) -> tuple[bool, str]: - """ - Convenient "default" implementation for tasks' `check_implementation` function. - """ - expected = reference(data) - reasons = verbose_allclose(output, expected, rtol=rtol, atol=atol) - - if len(reasons) > 0: - return False, "mismatch found! custom implementation doesn't match reference: " + " ".join(reasons) - - return True, '' - - -def make_match_reference(reference: callable, **kwargs): - def wrapped(data, output): - return match_reference(data, output, reference=reference, **kwargs) - return wrapped - - -class DeterministicContext: - def __init__(self): - self.allow_tf32 = None - self.deterministic = None - self.cublas = None - - def __enter__(self): - self.cublas = os.environ.get('CUBLAS_WORKSPACE_CONFIG', '') - self.allow_tf32 = torch.backends.cudnn.allow_tf32 - self.deterministic = torch.backends.cudnn.deterministic - torch.backends.cudnn.allow_tf32 = False - torch.backends.cudnn.deterministic = True - torch.use_deterministic_algorithms(True) - return self - - def __exit__(self, exc_type, exc_value, traceback): - torch.backends.cudnn.allow_tf32 = self.allow_tf32 - torch.backends.cudnn.deterministic = self.deterministic - torch.use_deterministic_algorithms(False) - os.environ['CUBLAS_WORKSPACE_CONFIG'] = self.cublas - -def clear_l2_cache(): - # import cupy as cp - # cp.cuda.runtime.deviceSetLimit(cp.cuda.runtime.cudaLimitPersistingL2CacheSize, 0) - # create a large dummy tensor - dummy = torch.empty((32, 1024, 1024), dtype=torch.int64, device="cuda") - # write stuff to - dummy.fill_(42) - del dummy \ No newline at end of file diff --git a/problems/linalg.yaml b/problems/linalg.yaml deleted file mode 100644 index 34ed8ce2..00000000 --- a/problems/linalg.yaml +++ /dev/null @@ -1,17 +0,0 @@ -name: Linear Algebra - -deadline: "" - -description: "Core linear algebra kernels for modern accelerator workloads." - -problems: - - directory: linalg/qr_py - name: qr - deadline: "2026-06-30" - gpus: - - B200 - - directory: linalg/qr_v2 - name: qr_v2 - deadline: "2026-06-30" - gpus: - - B200 diff --git a/problems/linalg/qr_py/eval.py b/problems/linalg/qr_py/eval.py deleted file mode 100644 index cd2c6bd3..00000000 --- a/problems/linalg/qr_py/eval.py +++ /dev/null @@ -1,311 +0,0 @@ -import dataclasses -import math -import multiprocessing -import os -import re -import sys -import time -from pathlib import Path -from typing import Any, Optional - -import torch - -from reference import check_implementation, generate_input -from utils import clear_l2_cache, set_seed - -try: - from task import TestSpec -except ImportError: - TestSpec = dict - - -MAX_ITERATIONS_PER_BENCHMARK = 50 -BENCHMARK_INPUT_BYTES_TARGET = 256 * 1024 * 1024 - - -class PopcornOutput: - def __init__(self, fd: int): - self.file = os.fdopen(fd, "w") - os.set_inheritable(fd, False) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.file.close() - - def print(self, *args, **kwargs): - print(*args, **kwargs, file=self.file, flush=True) - - def log(self, key, value): - self.print(f"{key}: {value}") - - -@dataclasses.dataclass -class TestCase: - args: dict - spec: str - - -@dataclasses.dataclass -class Stats: - runs: int - mean: float - std: float - err: float - best: float - worst: float - - -def _combine(a: int, b: int) -> int: - return int(a + (a + b) * (a + b + 1) // 2) - - -def get_test_cases(file_name: str, seed: Optional[int]) -> list[TestCase]: - try: - content = Path(file_name).read_text() - except Exception as exc: - print(f"Could not open test file `{file_name}`: {exc}", file=sys.stderr) - exit(113) - - tests = [] - match = r"\s*([a-zA-Z]+):\s*([a-zA-Z]+|[+-]?[0-9]+)\s*" - for line in content.splitlines(): - case = {} - for part in line.split(";"): - matched = re.match(match, part) - if not re.fullmatch(match, part): - print(f"invalid test case: '{line}': '{part}'", file=sys.stderr) - exit(113) - key = matched[1] - val = matched[2] - try: - val = int(val) - except ValueError: - pass - case[key] = val - tests.append(TestCase(spec=line, args=case)) - - if seed is not None: - for test in tests: - if "seed" in test.args: - test.args["seed"] = _combine(test.args["seed"], seed) - return tests - - -def calculate_stats(durations: list[float]) -> Stats: - runs = len(durations) - total = sum(durations) - avg = total / runs - variance = sum((x - avg) ** 2 for x in durations) - std = math.sqrt(variance / (runs - 1)) if runs > 1 else 0.0 - err = std / math.sqrt(runs) if runs > 0 else 0.0 - return Stats( - runs=runs, - mean=avg, - std=std, - err=err, - best=float(min(durations)), - worst=float(max(durations)), - ) - - -def _clone_data(data): - if isinstance(data, tuple): - return tuple(_clone_data(x) for x in data) - if isinstance(data, list): - return [_clone_data(x) for x in data] - if isinstance(data, dict): - return {k: _clone_data(v) for k, v in data.items()} - if isinstance(data, torch.Tensor): - return data.clone() - return data - - -def _run_single_test(test: TestCase): - from submission import custom_kernel - - data = generate_input(**test.args) - torch.cuda.synchronize() - output = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - return check_implementation(data, output) - - -def run_single_test(pool: multiprocessing.Pool, test: TestCase): - return pool.apply(_run_single_test, (test,)) - - -def run_testing(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - passed = True - logger.log("test-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"test.{idx}.spec", test.spec) - good, message = run_single_test(pool, test) - if good: - logger.log(f"test.{idx}.status", "pass") - if message: - logger.log(f"test.{idx}.message", message) - else: - logger.log(f"test.{idx}.status", "fail") - logger.log(f"test.{idx}.error", message) - passed = False - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - - -def _make_data_batch(test: TestCase, count: int): - args = dict(test.args) - data_list = [] - for _ in range(count): - if "seed" in args: - args["seed"] += 42 - data_list.append(generate_input(**args)) - return data_list - - -def _benchmark_batch_count(test: TestCase) -> int: - batch = int(test.args.get("batch", 1)) - n = int(test.args.get("n", 1)) - # Input storage is A. Keep the generated batch modest - # because large QR cases are already batched inside a single input. - bytes_per_input = (batch * n * n) * 4 - if bytes_per_input <= 0: - return 1 - return max(1, min(MAX_ITERATIONS_PER_BENCHMARK, BENCHMARK_INPUT_BYTES_TARGET // bytes_per_input)) - - -def _run_single_benchmark( - test: TestCase, - recheck: bool, - max_repeats: int, - max_time_ns: float, -) -> Stats | Any: - from submission import custom_kernel - - data_list = _make_data_batch(test, _benchmark_batch_count(test)) - check_copy = _clone_data(data_list) - - outputs = [custom_kernel(_clone_data(data)) for data in data_list] - for reference_data, output in zip(check_copy, outputs): - good, message = check_implementation(reference_data, output) - if not good: - return message - - durations = [] - bm_start_time = time.perf_counter_ns() - for i in range(max_repeats): - torch.cuda.synchronize() - clear_l2_cache() - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) - start_event.record() - outputs = [custom_kernel(data) for data in data_list] - end_event.record() - torch.cuda.synchronize() - durations.append(start_event.elapsed_time(end_event) * 1e6 / len(data_list)) - - if recheck: - for reference_data, output in zip(check_copy, outputs): - good, message = check_implementation(reference_data, output) - if not good: - return message - - total_bm_duration = time.perf_counter_ns() - bm_start_time - if i > 1 and total_bm_duration > 1e8: - stats = calculate_stats(durations) - if ( - stats.err / stats.mean < 0.001 - or stats.mean * stats.runs > max_time_ns - or total_bm_duration > 120e9 - ): - break - - return calculate_stats(durations) - - -def run_single_benchmark( - pool: multiprocessing.Pool, - test: TestCase, - recheck: bool, - max_repeats: int, - max_time_ns: float, -): - return pool.apply(_run_single_benchmark, (test, recheck, max_repeats, max_time_ns)) - - -def run_benchmarking(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - run_single_benchmark(pool, tests[0], False, 200, 10e7) - - passed = True - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - # recheck=True: re-validate the output of every timed iteration, not just - # the pre-timing warmup. Without this, the timed loop (which for the - # low-`count` shapes reuses one input object across all repeats) never - # re-checks its outputs, so a kernel that diverges only inside the timed - # region -- e.g. one that caches and replays an output keyed on the - # reused input -- is scored as fast without ever being caught locally. - # `leaderboard` mode already rechecks; this brings `benchmark` mode in - # line so a wrong timed output fails here too. - result = run_single_benchmark(pool, test, True, 200, 10e9) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", result) - passed = False - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - - -def main(): - fd = os.getenv("POPCORN_FD") - if not fd: - return 111 - if len(sys.argv) < 3: - return 2 - - mode = sys.argv[1] - seed = os.getenv("POPCORN_SEED") - os.unsetenv("POPCORN_SEED") - seed = int(seed) if seed else None - set_seed(seed or 42) - tests = get_test_cases(sys.argv[2], seed) - - with PopcornOutput(int(fd)) as logger: - mp_context = multiprocessing.get_context("spawn") - with mp_context.Pool(1) as pool: - if mode == "test": - return run_testing(logger, pool, tests) - if mode == "benchmark": - return run_benchmarking(logger, pool, tests) - if mode == "leaderboard": - for test in tests: - run_single_benchmark(pool, test, False, 1000, 5e8) - logger.log("benchmark-count", len(tests)) - passed = True - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - result = run_single_benchmark(pool, test, True, 1000, 30e9) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", str(result)) - passed = False - break - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - if mode == "profile": - logger.log("check", "fail") - logger.log("error", "profile mode is not implemented for qr eval.py") - return 2 - return 2 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/problems/linalg/qr_py/reference.py b/problems/linalg/qr_py/reference.py deleted file mode 100644 index fc8ace77..00000000 --- a/problems/linalg/qr_py/reference.py +++ /dev/null @@ -1,192 +0,0 @@ -import torch -from task import input_t, output_t - - -_FACTOR_RTOL_FACTOR = 20.0 -_ORTH_RTOL_FACTOR = 100.0 - - -def _apply_column_scaling(a: torch.Tensor, cond: int) -> torch.Tensor: - # `cond` is a deterministic dynamic-range knob, not an exact condition number. - if cond: - n = a.shape[-1] - scales = torch.logspace(0.0, -float(cond), n, device=a.device, dtype=torch.float32) - return a * scales - return a.contiguous() - - -def _band_mask(n: int, bandwidth: int, device: torch.device) -> torch.Tensor: - idx = torch.arange(n, device=device) - return (idx[:, None] - idx[None, :]).abs() <= bandwidth - - -def generate_input(batch: int, n: int, cond: int, seed: int, case: str = "dense") -> input_t: - assert batch > 0, "batch must be positive" - assert n > 0, "n must be positive" - assert cond >= 0, "cond must be non-negative" - - device = "cuda" if torch.cuda.is_available() else "cpu" - gen = torch.Generator(device=device) - gen.manual_seed(seed) - - case = case.lower() - a = torch.randn((batch, n, n), device=device, dtype=torch.float32, generator=gen) - - if case == "dense": - a = _apply_column_scaling(a, cond) - elif case == "upper": - diag_boost = torch.linspace(1.0, 0.25, n, device=device, dtype=torch.float32) - a = torch.triu(a) - a.diagonal(dim1=-2, dim2=-1).add_(diag_boost) - a = _apply_column_scaling(a, cond) - elif case == "diagonal": - diag = torch.randn((batch, n), device=device, dtype=torch.float32, generator=gen) - diag = diag.sign().clamp(min=0.0).mul(2.0).sub(1.0) * torch.logspace( - 0.0, -float(max(cond, 2)), n, device=device, dtype=torch.float32 - ) - a = torch.diag_embed(diag) - elif case == "rankdef": - rank = max(1, (3 * n) // 4) - a[:, :, rank:] = 0.0 - a = _apply_column_scaling(a, cond) - elif case == "nearrank": - rank = max(1, (3 * n) // 4) - tail = n - rank - if tail > 0: - noise = torch.randn( - (batch, n, tail), device=device, dtype=torch.float32, generator=gen - ) - a[:, :, rank:] = a[:, :, :tail] + 1.0e-5 * noise - a = _apply_column_scaling(a, cond) - elif case == "clustered": - scales = torch.ones((n,), device=device, dtype=torch.float32) - scales[n // 2 :] = 4.0 * torch.finfo(torch.float32).eps - if n >= 8: - lo = max(0, n // 2 - 2) - hi = min(n, n // 2 + 2) - scales[lo:hi] = torch.sqrt(torch.tensor(torch.finfo(torch.float32).eps, device=device)) - a = a * scales - elif case == "band": - bandwidth = max(2, min(32, n // 32)) - a = a * _band_mask(n, bandwidth, device) - diag_boost = torch.linspace(1.0, 0.5, n, device=device, dtype=torch.float32) - a.diagonal(dim1=-2, dim2=-1).add_(diag_boost) - a = _apply_column_scaling(a, cond) - elif case == "nearcollinear": - base = torch.randn((batch, n, 1), device=device, dtype=torch.float32, generator=gen) - noise = torch.randn((batch, n, n), device=device, dtype=torch.float32, generator=gen) - a = base.expand(batch, n, n) + 1.0e-4 * noise - a = _apply_column_scaling(a, cond) - elif case == "rowscale": - row_cond = max(cond, 4) - scales = torch.logspace(0.0, -float(row_cond), n, device=device, dtype=torch.float32) - a = scales.reshape(1, n, 1) * a - else: - raise ValueError(f"unknown QR test case: {case}") - - return a.contiguous() - - -def ref_kernel(data: input_t) -> output_t: - # Starter/reference path: correctness first; submissions compete on speed. - return torch.geqrf(data) - - -def _property_rtol(n: int, factor: float) -> float: - eps = torch.finfo(torch.float32).eps - return factor * max(n, 1) * eps - - -def _scaled_residual( - residual: torch.Tensor, - scale: torch.Tensor, - n: int, -) -> torch.Tensor: - eps = torch.finfo(torch.float32).eps - return residual / (eps * max(n, 1) * scale.clamp_min(1e-30)) - - -def _matrix_l1_norm(value: torch.Tensor) -> torch.Tensor: - return torch.linalg.matrix_norm(value.double(), ord=1, dim=(-2, -1)) - - -def _check_tensor(name: str, value: torch.Tensor, shape: tuple[int, ...], device: torch.device) -> str | None: - if not isinstance(value, torch.Tensor): - return f"{name} must be a torch.Tensor" - if value.shape != shape: - return f"{name} shape must be {shape}, got {tuple(value.shape)}" - if value.dtype != torch.float32: - return f"{name} dtype must be torch.float32, got {value.dtype}" - if value.device != device: - return f"{name} must be on {device}, got {value.device}" - if not torch.isfinite(value).all().item(): - return f"{name} contains NaN or Inf" - return None - - -def check_implementation(data: input_t, output: output_t) -> tuple[bool, str]: - a = data - batch, n, _ = a.shape - factor_rtol = _property_rtol(n, _FACTOR_RTOL_FACTOR) - orth_rtol = _property_rtol(n, _ORTH_RTOL_FACTOR) - - if not isinstance(output, tuple) or len(output) != 2: - return False, "output must be a tuple `(H, tau)`" - - h, tau = output - error = _check_tensor("H", h, (batch, n, n), a.device) - if error is not None: - return False, error - error = _check_tensor("tau", tau, (batch, n), a.device) - if error is not None: - return False, error - - q = torch.linalg.householder_product(h, tau) - r = torch.triu(h) - a_check = a.double() - q_check = q.double() - r_check = r.double() - projected = q_check.transpose(-1, -2) @ a_check - factor_residual = _matrix_l1_norm(r_check - projected).amax() - factor_scale = _matrix_l1_norm(a_check).amax() - factor_allowed = factor_rtol * factor_scale - factor_scaled = _scaled_residual(factor_residual, factor_scale, n) - if factor_residual.item() > factor_allowed.item(): - return False, ( - "R - Q.T @ A is too large: " - f"residual={factor_residual.item():.3g}, allowed={factor_allowed.item():.3g}, " - f"scaled={factor_scaled.item():.3g}" - ) - - eye = torch.eye(n, device=a.device, dtype=torch.float64).expand(batch, n, n) - qtq = q_check.transpose(-1, -2) @ q_check - orth_residual = _matrix_l1_norm(qtq - eye).amax() - orth_scale = _matrix_l1_norm(eye).amax() - orth_allowed = orth_rtol * orth_scale - orth_scaled = _scaled_residual(orth_residual, orth_scale, n) - if orth_residual.item() > orth_allowed.item(): - return False, ( - "Q is not orthogonal enough: " - f"residual={orth_residual.item():.3g}, allowed={orth_allowed.item():.3g}, " - f"scaled={orth_scaled.item():.3g}" - ) - - lower = torch.tril(projected, diagonal=-1) - tri_residual = _matrix_l1_norm(lower).amax() - tri_scale = _matrix_l1_norm(a_check).amax() - tri_scaled = _scaled_residual(tri_residual, tri_scale, n) - - recon = q_check @ r_check - recon_residual = _matrix_l1_norm(recon - a_check).amax() - recon_scale = _matrix_l1_norm(a_check).amax() - recon_scaled = _scaled_residual(recon_residual, recon_scale, n) - - return True, ( - f"factor_rtol={factor_rtol:.3g}; " - f"orth_rtol={orth_rtol:.3g}; " - f"scaled_factor_residual={factor_scaled.item():.3g}; " - f"scaled_reconstruction_residual={recon_scaled.item():.3g}; " - f"scaled_triangular_residual={tri_scaled.item():.3g}; " - f"scaled_orthogonality_residual={orth_scaled.item():.3g}; " - f"batch={batch}; n={n}" - ) diff --git a/problems/linalg/qr_py/submission.py b/problems/linalg/qr_py/submission.py deleted file mode 100644 index ac92e0ac..00000000 --- a/problems/linalg/qr_py/submission.py +++ /dev/null @@ -1,6 +0,0 @@ -import torch -from task import input_t, output_t - - -def custom_kernel(data: input_t) -> output_t: - return torch.geqrf(data) diff --git a/problems/linalg/qr_py/task.py b/problems/linalg/qr_py/task.py deleted file mode 100644 index e0547dcc..00000000 --- a/problems/linalg/qr_py/task.py +++ /dev/null @@ -1,13 +0,0 @@ -import torch -from typing import NotRequired, TypeVar, TypedDict - -input_t = TypeVar("input_t", bound=torch.Tensor) -output_t = TypeVar("output_t", bound=tuple[torch.Tensor, torch.Tensor]) - - -class TestSpec(TypedDict): - batch: int - n: int - cond: int - seed: int - case: NotRequired[str] diff --git a/problems/linalg/qr_py/task.yml b/problems/linalg/qr_py/task.yml deleted file mode 100644 index 8e935eba..00000000 --- a/problems/linalg/qr_py/task.yml +++ /dev/null @@ -1,100 +0,0 @@ -# name: qr - -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../../pmpp_v2/utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "eval.py"} - -lang: "py" - -description: | - Implement batched square compact-Householder QR factorization. - - Input is `A`, a `batch x n x n` CUDA tensor in `torch.float32`. - - Return `(H, tau)` in the same compact Householder convention as - `torch.geqrf(A)`. `H` is a `batch x n x n` FP32 tensor containing `R` in its - upper triangle and Householder vectors below the diagonal. `tau` is a - `batch x n` FP32 tensor containing reflector coefficients. The checker - materializes `Q = torch.linalg.householder_product(H, tau)`, uses - `R_factor = triu(H)`, and validates the LAPACK-style QR factorization - residual `R_factor - Q.T @ A` and orthogonality of `Q`. Since `R_factor` - is extracted with `triu`, triangularity is part of the factorization check: - if `Q.T @ A` has meaningful lower-triangular leakage, then it cannot match - `R_factor`. The checker reports that lower-triangular leakage and the - reconstruction residual as diagnostics. - - This shape set targets optimizer-style matrix statistics where gradients are - viewed as `[for_each..., basis_dim, contracted_dim]`, statistics are formed as - `G @ G.T`, and QR is run on square `basis_dim x basis_dim` matrices. Batched - `512 x 512` is especially important, while `1024`, `2048`, and `4096` cover - larger square factors. - - Test and benchmark specs include a `cond` field. In this task `cond` is a - deterministic input-scaling knob, not an exact requested condition number: - dense cases multiply columns by `logspace(0, -cond, n)`, so larger `cond` - creates a wider dynamic range across columns. Some stress cases use their own - structure, such as rank-deficient, near-rank-deficient, banded, row-scaled, - near-collinear, upper-triangular, or clustered-scale inputs. - - Correctness is a hard gate against the original FP32 input and the FP32 - `torch.geqrf` compact-factor contract. Low-bit FP16, FP8, or NVFP4 work is - allowed only as an internal implementation strategy: returned factors must - still be FP32 and must satisfy the same QR invariants as an FP32 - factorization. Residuals are measured in FP64 to reduce checker noise, but - the target tolerance is still FP32 accuracy. The numerical property tolerance - is purely relative, with no QR `atol`. - The hard gates are the LAPACK-style factor residual, which uses - `rtol = 20 * n * eps32`, and orthogonality, which uses - `rtol = 100 * n * eps32`, each applied to the corresponding matrix L1 norm. - Triangularity is reported as lower-triangular leakage in `Q.T @ A` and is - already implied by the factor residual against `triu(H)`. - - Among passing submissions, ranking is by runtime using the geometric mean of - benchmark cases. We will also celebrate notable submissions beyond the main - leaderboard: the fastest, the most elegant, and the strangest working kernels. - -config: - main: "eval.py" - -templates: - Python: "submission.py" - -test_timeout: 240 -benchmark_timeout: 480 -ranked_timeout: 900 -ranking_by: "geom" -gpus: - - B200 - -tests: - - {"batch": 20, "n": 32, "cond": 1, "seed": 53124} - - {"batch": 40, "n": 176, "cond": 1, "seed": 3321} - - {"batch": 40, "n": 352, "cond": 1, "seed": 1200} - - {"batch": 16, "n": 512, "cond": 2, "seed": 32523} - - {"batch": 4, "n": 1024, "cond": 2, "seed": 4327} - - {"batch": 1, "n": 4096, "cond": 1, "seed": 75342} - - {"batch": 16, "n": 512, "cond": 4, "seed": 32524, "case": "dense"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32525, "case": "rankdef"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32526, "case": "clustered"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32527, "case": "band"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32528, "case": "rowscale"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32529, "case": "nearcollinear"} - - {"batch": 4, "n": 1024, "cond": 4, "seed": 4328, "case": "dense"} - - {"batch": 4, "n": 1024, "cond": 0, "seed": 4329, "case": "rankdef"} - - {"batch": 4, "n": 1024, "cond": 0, "seed": 4330, "case": "nearrank"} - - {"batch": 4, "n": 1024, "cond": 0, "seed": 4331, "case": "clustered"} - - {"batch": 2, "n": 2048, "cond": 2, "seed": 224466, "case": "dense"} - - {"batch": 2, "n": 2048, "cond": 0, "seed": 224467, "case": "rankdef"} - - {"batch": 1, "n": 4096, "cond": 0, "seed": 75343, "case": "upper"} - -benchmarks: - - {"batch": 20, "n": 32, "cond": 1, "seed": 43214} - - {"batch": 40, "n": 176, "cond": 1, "seed": 423011} - - {"batch": 40, "n": 352, "cond": 1, "seed": 123456} - - {"batch": 640, "n": 512, "cond": 2, "seed": 1029} - - {"batch": 60, "n": 1024, "cond": 2, "seed": 75342} - - {"batch": 8, "n": 2048, "cond": 1, "seed": 224466} - - {"batch": 2, "n": 4096, "cond": 1, "seed": 32412} diff --git a/problems/linalg/qr_v2/eval.py b/problems/linalg/qr_v2/eval.py deleted file mode 100644 index cd2c6bd3..00000000 --- a/problems/linalg/qr_v2/eval.py +++ /dev/null @@ -1,311 +0,0 @@ -import dataclasses -import math -import multiprocessing -import os -import re -import sys -import time -from pathlib import Path -from typing import Any, Optional - -import torch - -from reference import check_implementation, generate_input -from utils import clear_l2_cache, set_seed - -try: - from task import TestSpec -except ImportError: - TestSpec = dict - - -MAX_ITERATIONS_PER_BENCHMARK = 50 -BENCHMARK_INPUT_BYTES_TARGET = 256 * 1024 * 1024 - - -class PopcornOutput: - def __init__(self, fd: int): - self.file = os.fdopen(fd, "w") - os.set_inheritable(fd, False) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.file.close() - - def print(self, *args, **kwargs): - print(*args, **kwargs, file=self.file, flush=True) - - def log(self, key, value): - self.print(f"{key}: {value}") - - -@dataclasses.dataclass -class TestCase: - args: dict - spec: str - - -@dataclasses.dataclass -class Stats: - runs: int - mean: float - std: float - err: float - best: float - worst: float - - -def _combine(a: int, b: int) -> int: - return int(a + (a + b) * (a + b + 1) // 2) - - -def get_test_cases(file_name: str, seed: Optional[int]) -> list[TestCase]: - try: - content = Path(file_name).read_text() - except Exception as exc: - print(f"Could not open test file `{file_name}`: {exc}", file=sys.stderr) - exit(113) - - tests = [] - match = r"\s*([a-zA-Z]+):\s*([a-zA-Z]+|[+-]?[0-9]+)\s*" - for line in content.splitlines(): - case = {} - for part in line.split(";"): - matched = re.match(match, part) - if not re.fullmatch(match, part): - print(f"invalid test case: '{line}': '{part}'", file=sys.stderr) - exit(113) - key = matched[1] - val = matched[2] - try: - val = int(val) - except ValueError: - pass - case[key] = val - tests.append(TestCase(spec=line, args=case)) - - if seed is not None: - for test in tests: - if "seed" in test.args: - test.args["seed"] = _combine(test.args["seed"], seed) - return tests - - -def calculate_stats(durations: list[float]) -> Stats: - runs = len(durations) - total = sum(durations) - avg = total / runs - variance = sum((x - avg) ** 2 for x in durations) - std = math.sqrt(variance / (runs - 1)) if runs > 1 else 0.0 - err = std / math.sqrt(runs) if runs > 0 else 0.0 - return Stats( - runs=runs, - mean=avg, - std=std, - err=err, - best=float(min(durations)), - worst=float(max(durations)), - ) - - -def _clone_data(data): - if isinstance(data, tuple): - return tuple(_clone_data(x) for x in data) - if isinstance(data, list): - return [_clone_data(x) for x in data] - if isinstance(data, dict): - return {k: _clone_data(v) for k, v in data.items()} - if isinstance(data, torch.Tensor): - return data.clone() - return data - - -def _run_single_test(test: TestCase): - from submission import custom_kernel - - data = generate_input(**test.args) - torch.cuda.synchronize() - output = custom_kernel(_clone_data(data)) - torch.cuda.synchronize() - return check_implementation(data, output) - - -def run_single_test(pool: multiprocessing.Pool, test: TestCase): - return pool.apply(_run_single_test, (test,)) - - -def run_testing(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - passed = True - logger.log("test-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"test.{idx}.spec", test.spec) - good, message = run_single_test(pool, test) - if good: - logger.log(f"test.{idx}.status", "pass") - if message: - logger.log(f"test.{idx}.message", message) - else: - logger.log(f"test.{idx}.status", "fail") - logger.log(f"test.{idx}.error", message) - passed = False - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - - -def _make_data_batch(test: TestCase, count: int): - args = dict(test.args) - data_list = [] - for _ in range(count): - if "seed" in args: - args["seed"] += 42 - data_list.append(generate_input(**args)) - return data_list - - -def _benchmark_batch_count(test: TestCase) -> int: - batch = int(test.args.get("batch", 1)) - n = int(test.args.get("n", 1)) - # Input storage is A. Keep the generated batch modest - # because large QR cases are already batched inside a single input. - bytes_per_input = (batch * n * n) * 4 - if bytes_per_input <= 0: - return 1 - return max(1, min(MAX_ITERATIONS_PER_BENCHMARK, BENCHMARK_INPUT_BYTES_TARGET // bytes_per_input)) - - -def _run_single_benchmark( - test: TestCase, - recheck: bool, - max_repeats: int, - max_time_ns: float, -) -> Stats | Any: - from submission import custom_kernel - - data_list = _make_data_batch(test, _benchmark_batch_count(test)) - check_copy = _clone_data(data_list) - - outputs = [custom_kernel(_clone_data(data)) for data in data_list] - for reference_data, output in zip(check_copy, outputs): - good, message = check_implementation(reference_data, output) - if not good: - return message - - durations = [] - bm_start_time = time.perf_counter_ns() - for i in range(max_repeats): - torch.cuda.synchronize() - clear_l2_cache() - start_event = torch.cuda.Event(enable_timing=True) - end_event = torch.cuda.Event(enable_timing=True) - start_event.record() - outputs = [custom_kernel(data) for data in data_list] - end_event.record() - torch.cuda.synchronize() - durations.append(start_event.elapsed_time(end_event) * 1e6 / len(data_list)) - - if recheck: - for reference_data, output in zip(check_copy, outputs): - good, message = check_implementation(reference_data, output) - if not good: - return message - - total_bm_duration = time.perf_counter_ns() - bm_start_time - if i > 1 and total_bm_duration > 1e8: - stats = calculate_stats(durations) - if ( - stats.err / stats.mean < 0.001 - or stats.mean * stats.runs > max_time_ns - or total_bm_duration > 120e9 - ): - break - - return calculate_stats(durations) - - -def run_single_benchmark( - pool: multiprocessing.Pool, - test: TestCase, - recheck: bool, - max_repeats: int, - max_time_ns: float, -): - return pool.apply(_run_single_benchmark, (test, recheck, max_repeats, max_time_ns)) - - -def run_benchmarking(logger: PopcornOutput, pool: multiprocessing.Pool, tests: list[TestCase]): - run_single_benchmark(pool, tests[0], False, 200, 10e7) - - passed = True - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - # recheck=True: re-validate the output of every timed iteration, not just - # the pre-timing warmup. Without this, the timed loop (which for the - # low-`count` shapes reuses one input object across all repeats) never - # re-checks its outputs, so a kernel that diverges only inside the timed - # region -- e.g. one that caches and replays an output keyed on the - # reused input -- is scored as fast without ever being caught locally. - # `leaderboard` mode already rechecks; this brings `benchmark` mode in - # line so a wrong timed output fails here too. - result = run_single_benchmark(pool, test, True, 200, 10e9) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", result) - passed = False - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - - -def main(): - fd = os.getenv("POPCORN_FD") - if not fd: - return 111 - if len(sys.argv) < 3: - return 2 - - mode = sys.argv[1] - seed = os.getenv("POPCORN_SEED") - os.unsetenv("POPCORN_SEED") - seed = int(seed) if seed else None - set_seed(seed or 42) - tests = get_test_cases(sys.argv[2], seed) - - with PopcornOutput(int(fd)) as logger: - mp_context = multiprocessing.get_context("spawn") - with mp_context.Pool(1) as pool: - if mode == "test": - return run_testing(logger, pool, tests) - if mode == "benchmark": - return run_benchmarking(logger, pool, tests) - if mode == "leaderboard": - for test in tests: - run_single_benchmark(pool, test, False, 1000, 5e8) - logger.log("benchmark-count", len(tests)) - passed = True - for idx, test in enumerate(tests): - logger.log(f"benchmark.{idx}.spec", test.spec) - result = run_single_benchmark(pool, test, True, 1000, 30e9) - if isinstance(result, Stats): - for field in dataclasses.fields(Stats): - logger.log(f"benchmark.{idx}.{field.name}", getattr(result, field.name)) - else: - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", str(result)) - passed = False - break - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - if mode == "profile": - logger.log("check", "fail") - logger.log("error", "profile mode is not implemented for qr eval.py") - return 2 - return 2 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/problems/linalg/qr_v2/reference.py b/problems/linalg/qr_v2/reference.py deleted file mode 100644 index 539f9094..00000000 --- a/problems/linalg/qr_v2/reference.py +++ /dev/null @@ -1,266 +0,0 @@ -import torch -from task import input_t, output_t - - -_FACTOR_RTOL_FACTOR = 20.0 -_ORTH_RTOL_FACTOR = 100.0 - - -def _apply_column_scaling(a: torch.Tensor, cond: int) -> torch.Tensor: - # `cond` is a deterministic dynamic-range knob, not an exact condition number. - if cond: - n = a.shape[-1] - scales = torch.logspace(0.0, -float(cond), n, device=a.device, dtype=torch.float32) - return a * scales - return a.contiguous() - - -def _band_mask(n: int, bandwidth: int, device: torch.device) -> torch.Tensor: - idx = torch.arange(n, device=device) - return (idx[:, None] - idx[None, :]).abs() <= bandwidth - - -# Per-matrix conditioning profiles drawn for the "mixed" case. "dense" is the -# well-conditioned majority; the rest are the ill-conditioned stress structures. -_MIXED_PROFILES = ("dense", "rankdef", "nearrank", "clustered", "band", "rowscale", "nearcollinear") -# Relative sampling weights (normalized by torch.multinomial); dense ~= 50%. -_MIXED_WEIGHTS = (6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0) - - -def _apply_case(a: torch.Tensor, case: str, cond: int, gen: torch.Generator) -> torch.Tensor: - # Apply one conditioning profile to an already-drawn base batch `a` of shape - # (m, n, n), drawing any case-specific extra randomness from `gen`. Factored - # out of generate_input so the homogeneous cases and the per-matrix "mixed" - # case share a single implementation. The draw order (base first in the - # caller, then the case extras here) matches the original code, so every - # homogeneous case produces bit-for-bit identical data to before. - m, n = a.shape[0], a.shape[-1] - device = a.device - if case == "dense": - a = _apply_column_scaling(a, cond) - elif case == "upper": - diag_boost = torch.linspace(1.0, 0.25, n, device=device, dtype=torch.float32) - a = torch.triu(a) - a.diagonal(dim1=-2, dim2=-1).add_(diag_boost) - a = _apply_column_scaling(a, cond) - elif case == "diagonal": - diag = torch.randn((m, n), device=device, dtype=torch.float32, generator=gen) - diag = diag.sign().clamp(min=0.0).mul(2.0).sub(1.0) * torch.logspace( - 0.0, -float(max(cond, 2)), n, device=device, dtype=torch.float32 - ) - a = torch.diag_embed(diag) - elif case == "rankdef": - rank = max(1, (3 * n) // 4) - a[:, :, rank:] = 0.0 - a = _apply_column_scaling(a, cond) - elif case == "nearrank": - rank = max(1, (3 * n) // 4) - tail = n - rank - if tail > 0: - noise = torch.randn( - (m, n, tail), device=device, dtype=torch.float32, generator=gen - ) - a[:, :, rank:] = a[:, :, :tail] + 1.0e-5 * noise - a = _apply_column_scaling(a, cond) - elif case == "clustered": - scales = torch.ones((n,), device=device, dtype=torch.float32) - scales[n // 2 :] = 4.0 * torch.finfo(torch.float32).eps - if n >= 8: - lo = max(0, n // 2 - 2) - hi = min(n, n // 2 + 2) - scales[lo:hi] = torch.sqrt(torch.tensor(torch.finfo(torch.float32).eps, device=device)) - a = a * scales - elif case == "band": - bandwidth = max(2, min(32, n // 32)) - a = a * _band_mask(n, bandwidth, device) - diag_boost = torch.linspace(1.0, 0.5, n, device=device, dtype=torch.float32) - a.diagonal(dim1=-2, dim2=-1).add_(diag_boost) - a = _apply_column_scaling(a, cond) - elif case == "nearcollinear": - base = torch.randn((m, n, 1), device=device, dtype=torch.float32, generator=gen) - noise = torch.randn((m, n, n), device=device, dtype=torch.float32, generator=gen) - a = base.expand(m, n, n) + 1.0e-4 * noise - a = _apply_column_scaling(a, cond) - elif case == "rowscale": - row_cond = max(cond, 4) - scales = torch.logspace(0.0, -float(row_cond), n, device=device, dtype=torch.float32) - a = scales.reshape(1, n, 1) * a - else: - raise ValueError(f"unknown QR test case: {case}") - return a - - -def _generate_mixed(a: torch.Tensor, cond: int, gen: torch.Generator) -> torch.Tensor: - # Heterogeneous batch: assign each matrix an independent conditioning profile - # at a RANDOM position in the batch (seeded, so still deterministic), so - # well- and ill-conditioned matrices are interleaved rather than uniform - # across the batch. This matches the real optimizer-statistics regime (the - # per-layer / per-block factors have wildly different conditioning) and it - # removes the loophole where a kernel samples a few matrices, concludes the - # whole batch is well-conditioned, and routes it all to a fast path that is - # only numerically valid for well-conditioned inputs. With a mix present, - # passing the correctness gate requires handling each matrix on its merits. - m = a.shape[0] - device = a.device - weights = torch.tensor(_MIXED_WEIGHTS, dtype=torch.float32, device=device) - labels = torch.multinomial(weights, m, replacement=True, generator=gen) - # Guarantee both a well-conditioned and an ill-conditioned matrix are present. - # (Only relevant for tiny batches; large batches get both with high prob.) - if m >= 2: - is_dense = labels == 0 - if not bool(is_dense.any()): - labels[int(torch.randint(0, m, (1,), device=device, generator=gen))] = 0 - elif bool(is_dense.all()): - pos = int(torch.randint(0, m, (1,), device=device, generator=gen)) - labels[pos] = int(torch.randint(1, len(_MIXED_PROFILES), (1,), device=device, generator=gen)) - # Process profiles in fixed order over the present labels so the RNG draws - # inside _apply_case are deterministic for a given seed. - for k, prof in enumerate(_MIXED_PROFILES): - mask = labels == k - if bool(mask.any()): - a[mask] = _apply_case(a[mask], prof, cond, gen) - return a - - -def generate_input(batch: int, n: int, cond: int, seed: int, case: str = "dense") -> input_t: - assert batch > 0, "batch must be positive" - assert n > 0, "n must be positive" - assert cond >= 0, "cond must be non-negative" - - device = "cuda" if torch.cuda.is_available() else "cpu" - gen = torch.Generator(device=device) - gen.manual_seed(seed) - - case = case.lower() - a = torch.randn((batch, n, n), device=device, dtype=torch.float32, generator=gen) - - if case == "mixed": - a = _generate_mixed(a, cond, gen) - else: - a = _apply_case(a, case, cond, gen) - - return a.contiguous() - - -def ref_kernel(data: input_t) -> output_t: - # Starter/reference path: correctness first; submissions compete on speed. - return torch.geqrf(data) - - -def _property_rtol(n: int, factor: float) -> float: - eps = torch.finfo(torch.float32).eps - return factor * max(n, 1) * eps - - -def _scaled_residual( - residual: torch.Tensor, - scale: torch.Tensor, - n: int, -) -> torch.Tensor: - eps = torch.finfo(torch.float32).eps - return residual / (eps * max(n, 1) * scale.clamp_min(1e-30)) - - -def _matrix_l1_norm(value: torch.Tensor) -> torch.Tensor: - return torch.linalg.matrix_norm(value.double(), ord=1, dim=(-2, -1)) - - -def _check_tensor(name: str, value: torch.Tensor, shape: tuple[int, ...], device: torch.device) -> str | None: - if not isinstance(value, torch.Tensor): - return f"{name} must be a torch.Tensor" - if value.shape != shape: - return f"{name} shape must be {shape}, got {tuple(value.shape)}" - if value.dtype != torch.float32: - return f"{name} dtype must be torch.float32, got {value.dtype}" - if value.device != device: - return f"{name} must be on {device}, got {value.device}" - if not torch.isfinite(value).all().item(): - return f"{name} contains NaN or Inf" - return None - - -def check_implementation(data: input_t, output: output_t) -> tuple[bool, str]: - a = data - batch, n, _ = a.shape - factor_rtol = _property_rtol(n, _FACTOR_RTOL_FACTOR) - orth_rtol = _property_rtol(n, _ORTH_RTOL_FACTOR) - - if not isinstance(output, tuple) or len(output) != 2: - return False, "output must be a tuple `(H, tau)`" - - h, tau = output - error = _check_tensor("H", h, (batch, n, n), a.device) - if error is not None: - return False, error - error = _check_tensor("tau", tau, (batch, n), a.device) - if error is not None: - return False, error - - q = torch.linalg.householder_product(h, tau) - r = torch.triu(h) - if not torch.isfinite(q).all().item(): - return False, "Q materialized from `(H, tau)` contains NaN or Inf" - if not torch.isfinite(r).all().item(): - return False, "R extracted from `triu(H)` contains NaN or Inf" - - a_check = a.double() - q_check = q.double() - r_check = r.double() - projected = q_check.transpose(-1, -2) @ a_check - if not torch.isfinite(projected).all().item(): - return False, "Q.T @ A contains NaN or Inf" - - factor_residual = _matrix_l1_norm(r_check - projected) - factor_scale = _matrix_l1_norm(a_check) - factor_allowed = factor_rtol * factor_scale - factor_scaled = _scaled_residual(factor_residual, factor_scale, n) - if not torch.isfinite(factor_scaled).all().item(): - return False, "R - Q.T @ A residual produced NaN or Inf" - factor_failed = factor_residual > factor_allowed - if bool(factor_failed.any().item()): - worst = int(factor_scaled.argmax().item()) - return False, ( - "R - Q.T @ A is too large: " - f"matrix={worst}, residual={factor_residual[worst].item():.3g}, " - f"allowed={factor_allowed[worst].item():.3g}, " - f"scaled={factor_scaled[worst].item():.3g}" - ) - - eye = torch.eye(n, device=a.device, dtype=torch.float64).expand(batch, n, n) - qtq = q_check.transpose(-1, -2) @ q_check - if not torch.isfinite(qtq).all().item(): - return False, "Q.T @ Q contains NaN or Inf" - orth_residual = _matrix_l1_norm(qtq - eye).amax() - orth_scale = _matrix_l1_norm(eye).amax() - orth_allowed = orth_rtol * orth_scale - orth_scaled = _scaled_residual(orth_residual, orth_scale, n) - if not torch.isfinite(orth_scaled).all().item(): - return False, "Q.T @ Q residual produced NaN or Inf" - if orth_residual.item() > orth_allowed.item(): - return False, ( - "Q is not orthogonal enough: " - f"residual={orth_residual.item():.3g}, allowed={orth_allowed.item():.3g}, " - f"scaled={orth_scaled.item():.3g}" - ) - - lower = torch.tril(projected, diagonal=-1) - tri_residual = _matrix_l1_norm(lower).amax() - tri_scale = _matrix_l1_norm(a_check).amax() - tri_scaled = _scaled_residual(tri_residual, tri_scale, n) - - recon = q_check @ r_check - if not torch.isfinite(recon).all().item(): - return False, "Q @ R contains NaN or Inf" - recon_residual = _matrix_l1_norm(recon - a_check).amax() - recon_scale = _matrix_l1_norm(a_check).amax() - recon_scaled = _scaled_residual(recon_residual, recon_scale, n) - - return True, ( - f"factor_rtol={factor_rtol:.3g}; " - f"orth_rtol={orth_rtol:.3g}; " - f"scaled_factor_residual={factor_scaled.amax().item():.3g}; " - f"scaled_reconstruction_residual={recon_scaled.item():.3g}; " - f"scaled_triangular_residual={tri_scaled.item():.3g}; " - f"scaled_orthogonality_residual={orth_scaled.item():.3g}; " - f"batch={batch}; n={n}" - ) diff --git a/problems/linalg/qr_v2/submission.py b/problems/linalg/qr_v2/submission.py deleted file mode 100644 index ac92e0ac..00000000 --- a/problems/linalg/qr_v2/submission.py +++ /dev/null @@ -1,6 +0,0 @@ -import torch -from task import input_t, output_t - - -def custom_kernel(data: input_t) -> output_t: - return torch.geqrf(data) diff --git a/problems/linalg/qr_v2/task.py b/problems/linalg/qr_v2/task.py deleted file mode 100644 index e0547dcc..00000000 --- a/problems/linalg/qr_v2/task.py +++ /dev/null @@ -1,13 +0,0 @@ -import torch -from typing import NotRequired, TypeVar, TypedDict - -input_t = TypeVar("input_t", bound=torch.Tensor) -output_t = TypeVar("output_t", bound=tuple[torch.Tensor, torch.Tensor]) - - -class TestSpec(TypedDict): - batch: int - n: int - cond: int - seed: int - case: NotRequired[str] diff --git a/problems/linalg/qr_v2/task.yml b/problems/linalg/qr_v2/task.yml deleted file mode 100644 index 0da22d88..00000000 --- a/problems/linalg/qr_v2/task.yml +++ /dev/null @@ -1,121 +0,0 @@ -# name: qr_v2 - -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "task.py", "source": "task.py"} - - {"name": "utils.py", "source": "../../pmpp_v2/utils.py"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "eval.py"} - -lang: "py" - -description: | - Implement batched square compact-Householder QR factorization. - - Input is `A`, a `batch x n x n` CUDA tensor in `torch.float32`. - - Return `(H, tau)` in the same compact Householder convention as - `torch.geqrf(A)`. `H` is a `batch x n x n` FP32 tensor containing `R` in its - upper triangle and Householder vectors below the diagonal. `tau` is a - `batch x n` FP32 tensor containing reflector coefficients. The checker - materializes `Q = torch.linalg.householder_product(H, tau)`, uses - `R_factor = triu(H)`, and validates the LAPACK-style QR factorization - residual `R_factor - Q.T @ A` and orthogonality of `Q`. Since `R_factor` - is extracted with `triu`, triangularity is part of the factorization check: - if `Q.T @ A` has meaningful lower-triangular leakage, then it cannot match - `R_factor`. The checker reports that lower-triangular leakage and the - reconstruction residual as diagnostics. - - This shape set targets optimizer-style matrix statistics where gradients are - viewed as `[for_each..., basis_dim, contracted_dim]`, statistics are formed as - `G @ G.T`, and QR is run on square `basis_dim x basis_dim` matrices. Batched - `512 x 512` is especially important, while `1024`, `2048`, and `4096` cover - larger square factors. - - Test and benchmark specs include a `cond` field. In this task `cond` is a - deterministic input-scaling knob, not an exact requested condition number: - dense cases multiply columns by `logspace(0, -cond, n)`, so larger `cond` - creates a wider dynamic range across columns. Some stress cases use their own - structure, such as rank-deficient, near-rank-deficient, banded, row-scaled, - near-collinear, upper-triangular, or clustered-scale inputs. - - The `mixed` case builds a heterogeneous batch: each matrix is independently - assigned a conditioning profile (a well-conditioned dense majority interleaved - with the ill-conditioned stress structures above) at a random position in the - batch. This mirrors the real optimizer-statistics regime, where the per-layer - or per-block factors batched into one call have widely varying conditioning, - rather than all sharing one structure. The benchmark set (not just the test - set) now includes both `mixed` batches and fully ill-conditioned homogeneous - batches, so conditioning robustness is ranked, not only gated: an - implementation cannot inspect a few matrices, decide the whole batch is - well-conditioned, and route it to a path that is only valid for well-conditioned - inputs, and the runtime cost of the accurate path on hard inputs is part of the - score. Each matrix must be factored correctly on its own merits. - - Correctness is a hard gate against the original FP32 input and the FP32 - `torch.geqrf` compact-factor contract. Low-bit FP16, FP8, or NVFP4 work is - allowed only as an internal implementation strategy: returned factors must - still be FP32 and must satisfy the same QR invariants as an FP32 - factorization. Residuals are measured in FP64 to reduce checker noise, but - the target tolerance is still FP32 accuracy. The numerical property tolerance - is purely relative, with no QR `atol`. - The hard gates are the LAPACK-style factor residual, which uses - `rtol = 20 * n * eps32`, and orthogonality, which uses - `rtol = 100 * n * eps32`, each applied to the corresponding matrix L1 norm. - Triangularity is reported as lower-triangular leakage in `Q.T @ A` and is - already implied by the factor residual against `triu(H)`. - - Among passing submissions, ranking is by runtime using the geometric mean of - benchmark cases. We will also celebrate notable submissions beyond the main - leaderboard: the fastest, the most elegant, and the strangest working kernels. - -config: - main: "eval.py" - -templates: - Python: "submission.py" - -test_timeout: 240 -benchmark_timeout: 480 -ranked_timeout: 900 -ranking_by: "geom" -gpus: - - B200 - -tests: - - {"batch": 20, "n": 32, "cond": 1, "seed": 53124} - - {"batch": 40, "n": 176, "cond": 1, "seed": 3321} - - {"batch": 40, "n": 352, "cond": 1, "seed": 1200} - - {"batch": 16, "n": 512, "cond": 2, "seed": 32523} - - {"batch": 4, "n": 1024, "cond": 2, "seed": 4327} - - {"batch": 1, "n": 4096, "cond": 1, "seed": 75342} - - {"batch": 16, "n": 512, "cond": 4, "seed": 32524, "case": "dense"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32525, "case": "rankdef"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32526, "case": "clustered"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32527, "case": "band"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32528, "case": "rowscale"} - - {"batch": 16, "n": 512, "cond": 0, "seed": 32529, "case": "nearcollinear"} - - {"batch": 4, "n": 1024, "cond": 4, "seed": 4328, "case": "dense"} - - {"batch": 4, "n": 1024, "cond": 0, "seed": 4329, "case": "rankdef"} - - {"batch": 4, "n": 1024, "cond": 0, "seed": 4330, "case": "nearrank"} - - {"batch": 4, "n": 1024, "cond": 0, "seed": 4331, "case": "clustered"} - - {"batch": 2, "n": 2048, "cond": 2, "seed": 224466, "case": "dense"} - - {"batch": 2, "n": 2048, "cond": 0, "seed": 224467, "case": "rankdef"} - - {"batch": 1, "n": 4096, "cond": 0, "seed": 75343, "case": "upper"} - - {"batch": 16, "n": 512, "cond": 2, "seed": 32530, "case": "mixed"} - - {"batch": 4, "n": 1024, "cond": 2, "seed": 4332, "case": "mixed"} - - {"batch": 2, "n": 2048, "cond": 2, "seed": 224468, "case": "mixed"} - -benchmarks: - - {"batch": 20, "n": 32, "cond": 1, "seed": 43214} - - {"batch": 40, "n": 176, "cond": 1, "seed": 423011} - - {"batch": 40, "n": 352, "cond": 1, "seed": 123456} - - {"batch": 640, "n": 512, "cond": 2, "seed": 1029} - - {"batch": 60, "n": 1024, "cond": 2, "seed": 75342} - - {"batch": 8, "n": 2048, "cond": 1, "seed": 224466} - - {"batch": 2, "n": 4096, "cond": 1, "seed": 32412} - - {"batch": 640, "n": 512, "cond": 2, "seed": 770001, "case": "mixed"} - - {"batch": 60, "n": 1024, "cond": 2, "seed": 770002, "case": "mixed"} - - {"batch": 640, "n": 512, "cond": 0, "seed": 770003, "case": "rankdef"} - - {"batch": 640, "n": 512, "cond": 0, "seed": 770004, "case": "clustered"} - - {"batch": 60, "n": 1024, "cond": 0, "seed": 770005, "case": "nearrank"} diff --git a/problems/nvidia.yaml b/problems/nvidia.yaml index f38e8bda..8b1ebf7b 100644 --- a/problems/nvidia.yaml +++ b/problems/nvidia.yaml @@ -27,7 +27,7 @@ problems: - B200 - directory: nvidia/nvfp4_group_gemm name: nvfp4_group_gemm - deadline: "2026-02-21 7:30" + deadline: "2026-02-20" gpus: - B200 - NVIDIA diff --git a/problems/nvidia/eval.py b/problems/nvidia/eval.py index 252f35e4..92906b3f 100644 --- a/problems/nvidia/eval.py +++ b/problems/nvidia/eval.py @@ -405,7 +405,7 @@ def run_single_profile(test: TestCase, pool: multiprocessing.Pool) -> str: """ Runs a single profiling activity in another process. """ - if bool(os.getenv("POPCORN_NCU", "0")): + if int(os.getenv("POPCORN_NCU", "0")) == 1: return pool.apply(_run_single_profile_ncu, (test,)) else: return pool.apply(_run_single_profile_torch, (test,)) @@ -420,7 +420,8 @@ def run_profiling( report = run_single_profile(test, pool) logger.log( f"benchmark.{idx}.report", - base64.b64encode(report.encode("utf-8"), b"+*").decode("utf-8"), + #base64.b64encode(report.encode("utf-8"), b"+*").decode("utf-8"), + report ) logger.log("check", "pass") return 0 diff --git a/problems/pmpp/vectoradd_py/submission.py b/problems/pmpp/vectoradd_py/submission.py deleted file mode 100644 index 0d2ad435..00000000 --- a/problems/pmpp/vectoradd_py/submission.py +++ /dev/null @@ -1,6 +0,0 @@ -from task import input_t, output_t - - -def custom_kernel(data: input_t) -> output_t: - A, B = data - return A + B diff --git a/problems/pmpp_v2/vectoradd_py/solutions/correct/submission_cuda_inline.py b/problems/pmpp_v2/vectoradd_py/solutions/correct/submission_cuda_inline.py index ecd070b4..d6f71050 100644 --- a/problems/pmpp_v2/vectoradd_py/solutions/correct/submission_cuda_inline.py +++ b/problems/pmpp_v2/vectoradd_py/solutions/correct/submission_cuda_inline.py @@ -48,7 +48,7 @@ add_cpp_source = """ #include -torch::Tensor add_cuda(torch::Tensor A, torch::Tensor B, torch::Tensor C); +torch::Tensor add_cuda(torch::Tensor A, torch::Tensor B); """ add_module = load_inline( @@ -59,10 +59,10 @@ verbose=True, ) -def add(A, B, C): - if not A.is_cuda or not B.is_cuda or not C.is_cuda: - raise RuntimeError("All tensors must be on GPU") - return add_module.add_cuda(A, B, C) +def add(A, B): + if not A.is_cuda or not B.is_cuda: + raise RuntimeError("Both tensors must be on GPU") + return add_module.add_cuda(A, B) def custom_kernel(data: input_t) -> output_t: """ @@ -72,13 +72,12 @@ def custom_kernel(data: input_t) -> output_t: Returns: Tensor containing element-wise sum. """ - A, B, C = data + A, B = data - assert A.is_cuda and B.is_cuda and C.is_cuda, "Input/output tensors must be on GPU" + assert A.is_cuda and B.is_cuda, "Input tensors must be on GPU" assert A.shape == B.shape, "Input tensors must have the same shape" - assert C.shape == A.shape, "Output tensor and input tensors must have the same shape" - assert A.dtype == torch.float16 and B.dtype == torch.float16 and C.dtype == torch.float16, "Input/output tensors must be float16" + assert A.dtype == torch.float16 and B.dtype == torch.float16, "Input tensors must be float16" # Simply reuse the existing add function we already defined # This avoids the compilation issues with the inline kernel - return add(A, B, C) + return add(A, B) diff --git a/problems/pmpp_v2/vectoradd_py/submission.py b/problems/pmpp_v2/vectoradd_py/submission.py deleted file mode 100644 index 918a1eb8..00000000 --- a/problems/pmpp_v2/vectoradd_py/submission.py +++ /dev/null @@ -1,7 +0,0 @@ -from task import input_t, output_t - - -def custom_kernel(data: input_t) -> output_t: - A, B, output = data - output[...] = A + B - return output diff --git a/problems/princeton/cross_entropy_py/eval.py b/problems/princeton/cross_entropy_py/eval.py deleted file mode 100644 index 65124d54..00000000 --- a/problems/princeton/cross_entropy_py/eval.py +++ /dev/null @@ -1,351 +0,0 @@ -import dataclasses -import math -import os -import random -import re -import statistics -import sys -from pathlib import Path - -import torch -import torch.nn.functional as F - -from reference import ( - ATOL, - DTYPE, - RTOL, - generate_inputs, - reference_backward, - reference_forward, -) - - -# Original eval parameters -B = 4_096 -WARMUP_ITERS = 20 -BENCH_ITERS = 100 - - -def make_seed_schedule(): - total = WARMUP_ITERS + 3 * BENCH_ITERS - seeds = random.SystemRandom().sample(range(1, 2**31 - 1), total) - warmup_end = WARMUP_ITERS - forward_end = warmup_end + BENCH_ITERS - backward_end = forward_end + BENCH_ITERS - return { - "warmup": seeds[:warmup_end], - "forward": seeds[warmup_end:forward_end], - "backward": seeds[forward_end:backward_end], - "combined": seeds[backward_end:], - } - - -class PopcornOutput: - def __init__(self, fd: int): - self.file = os.fdopen(fd, "w") - os.set_inheritable(fd, False) - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.file.close() - - def print(self, *args, **kwargs): - print(*args, **kwargs, file=self.file, flush=True) - - def log(self, key, value): - self.print(f"{key}: {value}") - - -@dataclasses.dataclass -class TestCase: - args: dict - spec: str - - -@dataclasses.dataclass -class Stats: - runs: int - mean: float - std: float - err: float - best: float - worst: float - fwd_bw: float - bwd_bw: float - combined_bw: float - - -def get_test_cases(file_name: str) -> list[TestCase]: - try: - content = Path(file_name).read_text() - except Exception as exc: - print(f"Could not open test file `{file_name}`: {exc}", file=sys.stderr) - sys.exit(113) - - tests = [] - lines = content.splitlines() - match = r"\s*([a-zA-Z_]+):\s*([a-zA-Z_]+|[+-]?[0-9]+)\s*" - for line in lines: - if not line.strip(): - continue - parts = line.split(";") - case = {} - for part in parts: - matched = re.match(match, part) - if not re.fullmatch(match, part): - print(f"invalid test case: '{line}': '{part}'", file=sys.stderr) - sys.exit(113) - key = matched[1] - value = matched[2] - try: - value = int(value) - except ValueError: - pass - case[key] = value - tests.append(TestCase(spec=line, args=case)) - return tests - - -def load_submission(): - import submission - - for fn_name in ("cross_entropy_forward", "cross_entropy_backward"): - if not hasattr(submission, fn_name): - raise AttributeError(f"Submission is missing function '{fn_name}'.") - return submission - - -def check_correctness(mod, vocab_size): - logits, targets, grad_output = generate_inputs(B, vocab_size) - - ref_loss = reference_forward(logits, targets) - sub_loss = mod.cross_entropy_forward(logits, targets) - - assert sub_loss.shape == ref_loss.shape, ( - f"Forward shape mismatch: expected {ref_loss.shape}, got {sub_loss.shape}" - ) - assert sub_loss.dtype == torch.float32, ( - f"Forward dtype mismatch: expected float32, got {sub_loss.dtype}" - ) - - fwd_close = torch.allclose(sub_loss, ref_loss, atol=ATOL, rtol=RTOL) - max_fwd_err = (sub_loss - ref_loss).abs().max().item() - - ref_grad = reference_backward(logits, targets, grad_output) - sub_grad = mod.cross_entropy_backward(logits, targets, grad_output) - - assert sub_grad.shape == ref_grad.shape, ( - f"Backward shape mismatch: expected {ref_grad.shape}, got {sub_grad.shape}" - ) - assert sub_grad.dtype == DTYPE, ( - f"Backward dtype mismatch: expected {DTYPE}, got {sub_grad.dtype}" - ) - - bwd_close = torch.allclose(sub_grad, ref_grad, atol=ATOL, rtol=RTOL) - max_bwd_err = (sub_grad.float() - ref_grad.float()).abs().max().item() - - return fwd_close, bwd_close, max_fwd_err, max_bwd_err - - -def benchmark_one(mod, vocab_size, seed_schedule): - def phase_inputs(phase, idx): - seed = seed_schedule[phase][idx] - return generate_inputs(B, vocab_size, seed=seed) - - for idx in range(WARMUP_ITERS): - logits, targets, grad_output = phase_inputs("warmup", idx) - mod.cross_entropy_forward(logits, targets) - mod.cross_entropy_backward(logits, targets, grad_output) - torch.cuda.synchronize() - - fwd_times = [] - for idx in range(BENCH_ITERS): - logits, targets, _ = phase_inputs("forward", idx) - start = torch.cuda.Event(enable_timing=True) - end = torch.cuda.Event(enable_timing=True) - start.record() - mod.cross_entropy_forward(logits, targets) - end.record() - torch.cuda.synchronize() - fwd_times.append(start.elapsed_time(end)) - - bwd_times = [] - for idx in range(BENCH_ITERS): - logits, targets, grad_output = phase_inputs("backward", idx) - start = torch.cuda.Event(enable_timing=True) - end = torch.cuda.Event(enable_timing=True) - start.record() - mod.cross_entropy_backward(logits, targets, grad_output) - end.record() - torch.cuda.synchronize() - bwd_times.append(start.elapsed_time(end)) - - combined_times = [] - for idx in range(BENCH_ITERS): - logits, targets, grad_output = phase_inputs("combined", idx) - start = torch.cuda.Event(enable_timing=True) - end = torch.cuda.Event(enable_timing=True) - start.record() - mod.cross_entropy_forward(logits, targets) - mod.cross_entropy_backward(logits, targets, grad_output) - end.record() - torch.cuda.synchronize() - combined_times.append(start.elapsed_time(end)) - - fwd_ms = statistics.median(fwd_times) - bwd_ms = statistics.median(bwd_times) - combined_ms = statistics.median(combined_times) - - fwd_bytes = 2 * B * vocab_size + 12 * B - bwd_bytes = 4 * B * vocab_size + 12 * B - total_bytes = fwd_bytes + bwd_bytes - - fwd_bw = fwd_bytes / (fwd_ms * 1e-3) / 1e9 - bwd_bw = bwd_bytes / (bwd_ms * 1e-3) / 1e9 - combined_bw = total_bytes / (combined_ms * 1e-3) / 1e9 - - # Keep KernelBot scoring on the exact reported metric: median combined ms. - return Stats( - runs=BENCH_ITERS, - mean=combined_ms * 1e6, - std=statistics.pstdev(combined_times) * 1e6, - err=(statistics.pstdev(combined_times) / math.sqrt(len(combined_times))) * 1e6, - best=min(combined_times) * 1e6, - worst=max(combined_times) * 1e6, - fwd_bw=fwd_bw, - bwd_bw=bwd_bw, - combined_bw=combined_bw, - ) - - -def run_testing(logger: PopcornOutput, tests: list[TestCase]) -> int: - try: - mod = load_submission() - except Exception as exc: - logger.log("check", "fail") - logger.log("error", str(exc)) - return 112 - - passed = True - logger.log("test-count", len(tests)) - for idx, test in enumerate(tests): - vocab_size = int(test.args["vocab_size"]) - logger.log(f"test.{idx}.spec", test.spec) - try: - fwd_ok, bwd_ok, fwd_err, bwd_err = check_correctness(mod, vocab_size) - if fwd_ok and bwd_ok: - logger.log(f"test.{idx}.status", "pass") - logger.log( - f"test.{idx}.message", - f"forward max err={fwd_err:.3e}, backward max err={bwd_err:.3e}", - ) - else: - logger.log(f"test.{idx}.status", "fail") - logger.log( - f"test.{idx}.error", - f"forward max err={fwd_err:.3e} {'OK' if fwd_ok else 'FAIL'}; " - f"backward max err={bwd_err:.3e} {'OK' if bwd_ok else 'FAIL'}", - ) - passed = False - except Exception as exc: - logger.log(f"test.{idx}.status", "fail") - logger.log(f"test.{idx}.error", str(exc)) - passed = False - - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - - -def run_benchmarking(logger: PopcornOutput, tests: list[TestCase]) -> int: - try: - mod = load_submission() - except Exception as exc: - logger.log("check", "fail") - logger.log("error", str(exc)) - return 112 - - baseline_mod = type(sys)("baseline") - baseline_mod.cross_entropy_forward = ( - lambda logits, targets: F.cross_entropy(logits.float(), targets, reduction="none") - ) - - def baseline_bwd(logits, targets, grad_output): - probs = torch.softmax(logits.float(), dim=-1) - probs[torch.arange(logits.shape[0], device=logits.device), targets] -= 1.0 - return (probs * grad_output.unsqueeze(1)).to(logits.dtype) - - baseline_mod.cross_entropy_backward = baseline_bwd - - passed = True - logger.log("benchmark-count", len(tests)) - for idx, test in enumerate(tests): - vocab_size = int(test.args["vocab_size"]) - logger.log(f"benchmark.{idx}.spec", test.spec) - try: - seed_schedule = make_seed_schedule() - baseline = benchmark_one(baseline_mod, vocab_size, seed_schedule) - result = benchmark_one(mod, vocab_size, seed_schedule) - speedup = baseline.mean / result.mean - except Exception as exc: - logger.log(f"benchmark.{idx}.status", "fail") - logger.log(f"benchmark.{idx}.error", str(exc)) - passed = False - continue - - logger.log(f"benchmark.{idx}.runs", result.runs) - logger.log(f"benchmark.{idx}.mean", result.mean) - logger.log(f"benchmark.{idx}.std", result.std) - logger.log(f"benchmark.{idx}.err", result.err) - logger.log(f"benchmark.{idx}.best", result.best) - logger.log(f"benchmark.{idx}.worst", result.worst) - logger.log(f"benchmark.{idx}.fwd_bw", result.fwd_bw) - logger.log(f"benchmark.{idx}.bwd_bw", result.bwd_bw) - logger.log(f"benchmark.{idx}.combined_bw", result.combined_bw) - logger.log(f"benchmark.{idx}.speedup", speedup) - logger.log( - f"benchmark.{idx}.message", - ( - f"fwd+bwd={result.mean / 1e6:.3f} ms, " - f"fwd_bw={result.fwd_bw:.1f} GB/s, " - f"bwd_bw={result.bwd_bw:.1f} GB/s, " - f"combined_bw={result.combined_bw:.1f} GB/s, " - f"speedup={speedup:.2f}x" - ), - ) - - logger.log("check", "pass" if passed else "fail") - return 0 if passed else 112 - - -def main(): - fd = os.getenv("POPCORN_FD") - if not fd: - return 111 - - if len(sys.argv) < 3: - return 2 - - if not torch.cuda.is_available(): - with PopcornOutput(int(fd)) as logger: - logger.log("check", "fail") - logger.log("error", "No CUDA GPU available. This script requires a GPU.") - return 112 - - mode = sys.argv[1] - tests = get_test_cases(sys.argv[2]) - - with PopcornOutput(int(fd)) as logger: - if mode == "test": - return run_testing(logger, tests) - if mode in {"benchmark", "leaderboard"}: - return run_benchmarking(logger, tests) - - logger.log("check", "fail") - logger.log("error", f"Unsupported mode: {mode}") - return 2 - - -if __name__ == "__main__": - raise SystemExit(main()) diff --git a/problems/princeton/cross_entropy_py/reference.py b/problems/princeton/cross_entropy_py/reference.py deleted file mode 100644 index 0317f3fa..00000000 --- a/problems/princeton/cross_entropy_py/reference.py +++ /dev/null @@ -1,28 +0,0 @@ -import torch -import torch.nn.functional as F - - -DTYPE = torch.bfloat16 -DEVICE = "cuda" -ATOL = 1e-3 -RTOL = 1e-2 - - -def reference_forward(logits, targets): - return F.cross_entropy(logits.float(), targets, reduction="none") - - -def reference_backward(logits, targets, grad_output): - probs = torch.softmax(logits.float(), dim=-1) - grad = probs - grad[torch.arange(logits.shape[0], device=logits.device), targets] -= 1.0 - grad = grad * grad_output.unsqueeze(1) - return grad.to(logits.dtype) - - -def generate_inputs(batch_size, vocab_size, seed=42): - torch.manual_seed(seed) - logits = torch.randn(batch_size, vocab_size, dtype=DTYPE, device=DEVICE) - targets = torch.randint(0, vocab_size, (batch_size,), device=DEVICE) - grad_output = torch.randn(batch_size, dtype=torch.float32, device=DEVICE) - return logits, targets, grad_output diff --git a/problems/princeton/cross_entropy_py/submission.py b/problems/princeton/cross_entropy_py/submission.py deleted file mode 100644 index e24b7ff2..00000000 --- a/problems/princeton/cross_entropy_py/submission.py +++ /dev/null @@ -1,52 +0,0 @@ -#!POPCORN leaderboard princeton_cross_entropy - -""" -Baseline submission for the cross-entropy problem. - -Replace these functions with a faster implementation. - -The evaluator uses: -- B = 4096 -- V in {32000, 50264, 128256} -- V % 8 == 0 -- finite real-valued logits (no masking with -inf) - -Example local bandwidth calculation for the three ranked shapes: - - def print_max_bw(batch_size, vocab_size, combined_ms): - total_bytes = (6 * batch_size * vocab_size) + (24 * batch_size) - combined_bw = total_bytes / (combined_ms * 1e-3) / 1e9 - print(f\"B={batch_size} V={vocab_size}: {combined_bw:.2f} GB/s\") - -This is only for local debugging. Do not add timing calls inside the hot path -if you care about leaderboard performance. -""" - -import torch -import torch.nn.functional as F - - -def cross_entropy_forward(logits, targets): - """ - Args: - logits: (B, V) torch.bfloat16 - targets: (B,) torch.int64 - Returns: - (B,) torch.float32 - """ - return F.cross_entropy(logits.float(), targets, reduction="none") - - -def cross_entropy_backward(logits, targets, grad_output): - """ - Args: - logits: (B, V) torch.bfloat16 - targets: (B,) torch.int64 - grad_output: (B,) torch.float32 - Returns: - (B, V) torch.bfloat16 - """ - probs = torch.softmax(logits.float(), dim=-1) - probs[torch.arange(logits.shape[0], device=logits.device), targets] -= 1.0 - grad_logits = probs * grad_output.unsqueeze(1) - return grad_logits.to(logits.dtype) diff --git a/problems/princeton/cross_entropy_py/task.yml b/problems/princeton/cross_entropy_py/task.yml deleted file mode 100644 index 1d91457e..00000000 --- a/problems/princeton/cross_entropy_py/task.yml +++ /dev/null @@ -1,48 +0,0 @@ -files: - - {"name": "submission.py", "source": "@SUBMISSION@"} - - {"name": "reference.py", "source": "reference.py"} - - {"name": "eval.py", "source": "eval.py"} - -lang: "py" - -description: | - Implement fused cross-entropy forward and backward kernels for logits of shape (B, V). - - Your submission must define: - - cross_entropy_forward(logits, targets) -> losses - - cross_entropy_backward(logits, targets, grad_output) -> grad_logits - - Inputs: - - logits: torch.bfloat16 tensor of real-valued, finite logits with shape (B, V) - - targets: torch.int64 tensor of shape (B,) - - grad_output: torch.float32 tensor of shape (B,) - - Outputs: - - forward output: torch.float32 tensor of shape (B,) - - backward output: torch.bfloat16 tensor of shape (B, V) - - Assumptions used by the evaluator and benchmark: - - batch size is fixed at B = 4096 - - vocab sizes are V in {32000, 50264, 128256} - - vocab size is guaranteed to be divisible by 8 - - logits are ordinary real numbers; masked values such as -inf are not used - -config: - main: "eval.py" - -tests: - - {"vocab_size": 32000} - - {"vocab_size": 50264} - - {"vocab_size": 128256} - -benchmarks: - - {"vocab_size": 32000} - - {"vocab_size": 50264} - - {"vocab_size": 128256} - -test_timeout: 300 -benchmark_timeout: 900 -ranked_timeout: 1200 -ranking_by: "geom" -gpus: - - A100 diff --git a/problems/princeton2026.yaml b/problems/princeton2026.yaml deleted file mode 100644 index c2f76179..00000000 --- a/problems/princeton2026.yaml +++ /dev/null @@ -1,9 +0,0 @@ -name: Princeton Problem Set -deadline: "2026-04-17 03:59" -description: "Princeton problem set" -problems: - - directory: princeton/cross_entropy_py - name: princeton_cross_entropy - deadline: "2026-04-17 03:59" - gpus: - - A100