RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend by czoli1976 · Pull Request #2397 · sonos/tract

czoli1976 · 2026-06-21T12:39:50Z

RFC / Draft — seeking maintainer guidance before investing further. Opening as a draft because this is deliberately forward-looking; I'd value your view on whether tract wants to carry it (and in what form) before polishing.

What this is

A portable, bit-exact software model of ACE (AI Compute Extensions) — the joint AMD/Intel x86 outer-product matrix ISA standardized via the x86 Ecosystem Advisory Group (whitepaper v1.0). ACE is an outer-product matrix unit (the x86 cousin of ARM SME's SMOPA / IBM Power10 MMA) with 16×16 accumulator tiles, ZMM-fed operands, and the OCP MX block-scaled low-precision formats.

There is no ACE silicon and no assembler/intrinsic support yet (hardware is not expected before ~2028). So this isn't a performance kernel — it's the integration scaffold: the packing layouts, tile geometry, dispatch wiring, fused epilogues and numerics are built and validated now, structured so that when toolchains gain ACE the inner compute is a one-line swap to the real instruction and nothing else changes.

Why it might be worth carrying

It establishes a bit-exact reference oracle for ACE semantics (the OCP MX / FP8 / FP4 / E8M0 numeric formats, the outer-product accumulation, the block-scale application) — the part that's most bug-prone during real ISA bring-up, validated today against external spec tables.
It pins the packing/data-layout decisions ahead of time — notably that ACE int8 reuses the existing PackedI8K4 layout unchanged, and a layout for carrying OCP-MX block scales through FusedKerSpec (which has no scale-register slot).
As far as I can tell it'd make tract the first inference engine with any ACE-shaped matmul code.

What's included

All four ACE v1 datatypes as registered MMM kernels (INT8, BF16, MXFP8, MXINT8), the OCP-MX numeric layer (E8M0 / FP8 E4M3+E5M2 / FP4 E2M1 decodes + VUNPACKB/VPERM marshalling models, runnable on today's AVX-512), a has_ace() runtime gate, and a cfg(tract_ace) assembler-probe seam that fails closed today.

Scope / safety

Self-contained: everything lives under linalg/src/ace/ + linalg/x86_64/ace/. The only edits to existing code are pub mod ace; and an additive build.rs probe — no changes to shared kernels, packing, dispatch, or test infrastructure.
Opt-in: not wired into production dispatch (ace::plug is explicit); it can never be selected for a real matmul today.
Validated: 157 tests, no new warnings, builds and tests on this branch's CI targets. INT8 rides the standard proptest harness; bf16/MX use precision-matched differential tests (the harness's f32 tolerance is too tight for them), backed by exhaustive 256-entry FP8 decode tables and an independent E8M0 scale-byte oracle.
Remaining work is genuinely gated on the published spec / silicon and flagged TODO(ace): the exact CPUID bit, the ratified mnemonics/intrinsics + real .S kernels, and performance tuning.

Open questions for maintainers

Is speculative future-ISA scaffolding something tract wants in-tree at all, or would you prefer it live out-of-tree until silicon/toolchain exist?
If in-tree: keep it feature-gated (e.g. behind a cargo feature) so it's clearly experimental?
Is the emulation's value (reference oracle + frozen layout decisions) worth the maintenance surface to you, or only once there's a real compute path?

Happy to adjust scope, gate it behind a feature, trim to just the numeric/format layer, or close it if it's premature. Mainly looking for a steer.

🤖 Generated with Claude Code

Adds a portable, bit-exact software model of the AMD/Intel ACE x86 outer-product matrix ISA (x86 Ecosystem Advisory Group whitepaper v1.0; hardware ~2028, no assembler support yet). This lets tract's packing, kernel structure, dispatch, fused epilogues and numerics be built and validated today, with the inner compute swapping for the real instructions once compilers/assemblers support ACE. Datatypes (all registered MMM kernels, validated on the host): - INT8 outer product (top4bssd), reusing the existing PackedI8K4 packing unchanged on both A and B sides. - BF16 (top2bf16ps) with a portable K=2-inner packer. - MXFP8 / MXINT8 block-scaled (OCP MX). FusedKerSpec has no scale-register slot, so the per-block E8M0 scales travel inside the packed byte stream: a unified scaled-block packer segregates the scale strip to the panel tail, keeping the element region byte-identical to PackedI8K4 runs (each MX block is a clean 512-byte element run the kernel reads directly). Also: - format.rs: OCP MX numeric decodes (E8M0, FP8 E4M3/E5M2, FP4 E2M1, bf16 RNE), the canonical quantize_mx_block encoder, and VUNPACKB / VPERM(I2B) marshalling models (these run on today's AVX-512). - detect.rs: has_ace() runtime gate (CPUID placeholder + AMX-style XSAVE XCOMP-perm), false on all current hardware; reserved for the future real kernel, deliberately not gating the emulation. - dummy_ace.S + build.rs: a cfg(tract_ace) assembler-probe seam that fails closed today and lights up when binutils gains ACE. The single per-kernel compute call is the only swap point. Validation: INT8 rides tract's standard proptest harness; bf16/MX use dedicated precision-matched differential tests (exact ==) over a K/shape sweep including multi-panel and partial tiles, plus exhaustive 256-entry FP8 decode tables and an independent E8M0 scale-byte oracle. 157 tests, no warnings, builds and tests on aarch64. Self-contained under linalg/src/ace + linalg/x86_64/ace; the only edits to existing code are `pub mod ace;` in lib.rs and an additive build.rs probe. Not wired into production dispatch (ace::plug is opt-in). Remaining work is gated on the published ACE spec / silicon and flagged with TODO(ace): the exact CPUID bit, the real mnemonics/intrinsics and .S kernels, and performance tuning (blocked-register kernels). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

czoli1976 mentioned this pull request Jun 21, 2026

linalg/ace: future-looking emulated ACE (AI Compute Extensions) backend czoli1976/tract#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend#2397

RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend#2397
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:ace-upstream

czoli1976 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 21, 2026

What this is

Why it might be worth carrying

What's included

Scope / safety

Open questions for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant