RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend#2397
Draft
czoli1976 wants to merge 1 commit into
Draft
RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend#2397czoli1976 wants to merge 1 commit into
czoli1976 wants to merge 1 commit into
Conversation
Adds a portable, bit-exact software model of the AMD/Intel ACE x86 outer-product matrix ISA (x86 Ecosystem Advisory Group whitepaper v1.0; hardware ~2028, no assembler support yet). This lets tract's packing, kernel structure, dispatch, fused epilogues and numerics be built and validated today, with the inner compute swapping for the real instructions once compilers/assemblers support ACE. Datatypes (all registered MMM kernels, validated on the host): - INT8 outer product (top4bssd), reusing the existing PackedI8K4 packing unchanged on both A and B sides. - BF16 (top2bf16ps) with a portable K=2-inner packer. - MXFP8 / MXINT8 block-scaled (OCP MX). FusedKerSpec has no scale-register slot, so the per-block E8M0 scales travel inside the packed byte stream: a unified scaled-block packer segregates the scale strip to the panel tail, keeping the element region byte-identical to PackedI8K4 runs (each MX block is a clean 512-byte element run the kernel reads directly). Also: - format.rs: OCP MX numeric decodes (E8M0, FP8 E4M3/E5M2, FP4 E2M1, bf16 RNE), the canonical quantize_mx_block encoder, and VUNPACKB / VPERM(I2B) marshalling models (these run on today's AVX-512). - detect.rs: has_ace() runtime gate (CPUID placeholder + AMX-style XSAVE XCOMP-perm), false on all current hardware; reserved for the future real kernel, deliberately not gating the emulation. - dummy_ace.S + build.rs: a cfg(tract_ace) assembler-probe seam that fails closed today and lights up when binutils gains ACE. The single per-kernel compute call is the only swap point. Validation: INT8 rides tract's standard proptest harness; bf16/MX use dedicated precision-matched differential tests (exact ==) over a K/shape sweep including multi-panel and partial tiles, plus exhaustive 256-entry FP8 decode tables and an independent E8M0 scale-byte oracle. 157 tests, no warnings, builds and tests on aarch64. Self-contained under linalg/src/ace + linalg/x86_64/ace; the only edits to existing code are `pub mod ace;` in lib.rs and an additive build.rs probe. Not wired into production dispatch (ace::plug is opt-in). Remaining work is gated on the published ACE spec / silicon and flagged with TODO(ace): the exact CPUID bit, the real mnemonics/intrinsics and .S kernels, and performance tuning (blocked-register kernels). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RFC / Draft — seeking maintainer guidance before investing further. Opening as a draft because this is deliberately forward-looking; I'd value your view on whether tract wants to carry it (and in what form) before polishing.
What this is
A portable, bit-exact software model of ACE (AI Compute Extensions) — the joint AMD/Intel x86 outer-product matrix ISA standardized via the x86 Ecosystem Advisory Group (whitepaper v1.0). ACE is an outer-product matrix unit (the x86 cousin of ARM SME's SMOPA / IBM Power10 MMA) with 16×16 accumulator tiles, ZMM-fed operands, and the OCP MX block-scaled low-precision formats.
There is no ACE silicon and no assembler/intrinsic support yet (hardware is not expected before ~2028). So this isn't a performance kernel — it's the integration scaffold: the packing layouts, tile geometry, dispatch wiring, fused epilogues and numerics are built and validated now, structured so that when toolchains gain ACE the inner compute is a one-line swap to the real instruction and nothing else changes.
Why it might be worth carrying
PackedI8K4layout unchanged, and a layout for carrying OCP-MX block scales throughFusedKerSpec(which has no scale-register slot).What's included
All four ACE v1 datatypes as registered MMM kernels (INT8, BF16, MXFP8, MXINT8), the OCP-MX numeric layer (E8M0 / FP8 E4M3+E5M2 / FP4 E2M1 decodes +
VUNPACKB/VPERMmarshalling models, runnable on today's AVX-512), ahas_ace()runtime gate, and acfg(tract_ace)assembler-probe seam that fails closed today.Scope / safety
linalg/src/ace/+linalg/x86_64/ace/. The only edits to existing code arepub mod ace;and an additivebuild.rsprobe — no changes to shared kernels, packing, dispatch, or test infrastructure.ace::plugis explicit); it can never be selected for a real matmul today.TODO(ace): the exact CPUID bit, the ratified mnemonics/intrinsics + real.Skernels, and performance tuning.Open questions for maintainers
Happy to adjust scope, gate it behind a feature, trim to just the numeric/format layer, or close it if it's premature. Mainly looking for a steer.
🤖 Generated with Claude Code