Skip to content

RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend#2397

Draft
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:ace-upstream
Draft

RFC: future-looking emulated ACE (AI Compute Extensions) matrix backend#2397
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:ace-upstream

Conversation

@czoli1976

Copy link
Copy Markdown
Contributor

RFC / Draft — seeking maintainer guidance before investing further. Opening as a draft because this is deliberately forward-looking; I'd value your view on whether tract wants to carry it (and in what form) before polishing.

What this is

A portable, bit-exact software model of ACE (AI Compute Extensions) — the joint AMD/Intel x86 outer-product matrix ISA standardized via the x86 Ecosystem Advisory Group (whitepaper v1.0). ACE is an outer-product matrix unit (the x86 cousin of ARM SME's SMOPA / IBM Power10 MMA) with 16×16 accumulator tiles, ZMM-fed operands, and the OCP MX block-scaled low-precision formats.

There is no ACE silicon and no assembler/intrinsic support yet (hardware is not expected before ~2028). So this isn't a performance kernel — it's the integration scaffold: the packing layouts, tile geometry, dispatch wiring, fused epilogues and numerics are built and validated now, structured so that when toolchains gain ACE the inner compute is a one-line swap to the real instruction and nothing else changes.

Why it might be worth carrying

  • It establishes a bit-exact reference oracle for ACE semantics (the OCP MX / FP8 / FP4 / E8M0 numeric formats, the outer-product accumulation, the block-scale application) — the part that's most bug-prone during real ISA bring-up, validated today against external spec tables.
  • It pins the packing/data-layout decisions ahead of time — notably that ACE int8 reuses the existing PackedI8K4 layout unchanged, and a layout for carrying OCP-MX block scales through FusedKerSpec (which has no scale-register slot).
  • As far as I can tell it'd make tract the first inference engine with any ACE-shaped matmul code.

What's included

All four ACE v1 datatypes as registered MMM kernels (INT8, BF16, MXFP8, MXINT8), the OCP-MX numeric layer (E8M0 / FP8 E4M3+E5M2 / FP4 E2M1 decodes + VUNPACKB/VPERM marshalling models, runnable on today's AVX-512), a has_ace() runtime gate, and a cfg(tract_ace) assembler-probe seam that fails closed today.

Scope / safety

  • Self-contained: everything lives under linalg/src/ace/ + linalg/x86_64/ace/. The only edits to existing code are pub mod ace; and an additive build.rs probe — no changes to shared kernels, packing, dispatch, or test infrastructure.
  • Opt-in: not wired into production dispatch (ace::plug is explicit); it can never be selected for a real matmul today.
  • Validated: 157 tests, no new warnings, builds and tests on this branch's CI targets. INT8 rides the standard proptest harness; bf16/MX use precision-matched differential tests (the harness's f32 tolerance is too tight for them), backed by exhaustive 256-entry FP8 decode tables and an independent E8M0 scale-byte oracle.
  • Remaining work is genuinely gated on the published spec / silicon and flagged TODO(ace): the exact CPUID bit, the ratified mnemonics/intrinsics + real .S kernels, and performance tuning.

Open questions for maintainers

  1. Is speculative future-ISA scaffolding something tract wants in-tree at all, or would you prefer it live out-of-tree until silicon/toolchain exist?
  2. If in-tree: keep it feature-gated (e.g. behind a cargo feature) so it's clearly experimental?
  3. Is the emulation's value (reference oracle + frozen layout decisions) worth the maintenance surface to you, or only once there's a real compute path?

Happy to adjust scope, gate it behind a feature, trim to just the numeric/format layer, or close it if it's premature. Mainly looking for a steer.

🤖 Generated with Claude Code

Adds a portable, bit-exact software model of the AMD/Intel ACE x86
outer-product matrix ISA (x86 Ecosystem Advisory Group whitepaper v1.0;
hardware ~2028, no assembler support yet). This lets tract's packing,
kernel structure, dispatch, fused epilogues and numerics be built and
validated today, with the inner compute swapping for the real instructions
once compilers/assemblers support ACE.

Datatypes (all registered MMM kernels, validated on the host):
- INT8 outer product (top4bssd), reusing the existing PackedI8K4 packing
  unchanged on both A and B sides.
- BF16 (top2bf16ps) with a portable K=2-inner packer.
- MXFP8 / MXINT8 block-scaled (OCP MX). FusedKerSpec has no scale-register
  slot, so the per-block E8M0 scales travel inside the packed byte stream:
  a unified scaled-block packer segregates the scale strip to the panel
  tail, keeping the element region byte-identical to PackedI8K4 runs (each
  MX block is a clean 512-byte element run the kernel reads directly).

Also:
- format.rs: OCP MX numeric decodes (E8M0, FP8 E4M3/E5M2, FP4 E2M1, bf16
  RNE), the canonical quantize_mx_block encoder, and VUNPACKB / VPERM(I2B)
  marshalling models (these run on today's AVX-512).
- detect.rs: has_ace() runtime gate (CPUID placeholder + AMX-style XSAVE
  XCOMP-perm), false on all current hardware; reserved for the future real
  kernel, deliberately not gating the emulation.
- dummy_ace.S + build.rs: a cfg(tract_ace) assembler-probe seam that fails
  closed today and lights up when binutils gains ACE. The single per-kernel
  compute call is the only swap point.

Validation: INT8 rides tract's standard proptest harness; bf16/MX use
dedicated precision-matched differential tests (exact ==) over a K/shape
sweep including multi-panel and partial tiles, plus exhaustive 256-entry
FP8 decode tables and an independent E8M0 scale-byte oracle. 157 tests,
no warnings, builds and tests on aarch64.

Self-contained under linalg/src/ace + linalg/x86_64/ace; the only edits to
existing code are `pub mod ace;` in lib.rs and an additive build.rs probe.
Not wired into production dispatch (ace::plug is opt-in). Remaining work is
gated on the published ACE spec / silicon and flagged with TODO(ace):
the exact CPUID bit, the real mnemonics/intrinsics and .S kernels, and
performance tuning (blocked-register kernels).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant