TFR optimizations + MX suport#374
Conversation
NikhilRout
commented
Jun 19, 2026
- added fpnew FP16/BF16 FEDP backend
- fixed FPGA (fpnew related) makefiles
- optimized TFR FEDP uarch by moving part of max_exp identification to stage 2 + avoiding 2s complement with sign popcount trick in stage 3 acc
- added dual sparse mask (DSM) generation module before TFR FEDP for dynamic power reduction via clock-gating
- added TCU format support for MXFP8, MXBF8, MXFP4, NVFP4, and MXINT8 in simx + rtl + regression test
- added MX metadata handling through TCU_LD
- split SP/MX TCU SRAM, scoreboard reg, and host runtime/include helpers
- WMMA only for now (WGMMA wip)
- added ci/regression coverage for MX formats under tensor_mx()
- fixed submodule-related actions failures by adding recursive submodule checkout/update and invalidating third_party cache when submodule pins change
tinebp
left a comment
There was a problem hiding this comment.
Can you please add synthesis and performance results of mx benchmarks;
Synthesis: TCU, TCU+MX, TCU+SP+MX+WMMA with NT=NW=16
test results with perf=1 sgemm_tcu, sgemm_tcu_sp, sgemm_tcu_mx, sgemm_tcu_sp_mx with n=64, NT=8, NW=8 for simx and rtlsim, add reports to /perf/tcu/
: include full configuration string and perf1 output to each test result so that we can reproduce it
| op_args.tcu.is_first_uop = 1'b0; | ||
| op_args.tcu.is_last_uop = 1'b0; | ||
| wr_xregs[XREG_0] = 1'b1; | ||
| wr_xregs[rd[4] ? XREG_TCU_MX : XREG_TCU_SP] = 1'b1; |
There was a problem hiding this comment.
You need to generalize the special registers; we cannot afford to reserve them for the TCU only.
XREG_2, XREG_3
There was a problem hiding this comment.
makes sense. went back to XREG_0 (for SP scoreboard bits) and used XREG_1 (for MX) now
There was a problem hiding this comment.
- added tensor_mx() tests
- CI was caching third_party based on .gitmodules only instead of the pinned submodule SHAs as well. This was restoring stale third_party contents after a submodule version bump, thereby causing build failures. The new hash makes the cache key depend on the submodule commit pins, and forced recursive update ensures restored caches are corrected before build
There was a problem hiding this comment.
- added tensor_mx() tests
- CI was caching third_party based on .gitmodules only instead of the pinned submodule SHAs as well. This was restoring stale third_party contents after a submodule version bump, thereby causing build failures. The new hash makes the cache key depend on the submodule commit pins, and forced recursive update ensures restored caches are corrected before build
| .tcu_lmem_if (tcu_lmem_if), | ||
| `endif | ||
| `ifdef VX_CFG_TCU_SPARSE_ENABLE | ||
| `ifdef VX_CFG_TCU_META_ENABLE |
| parameter `STRING INSTANCE_ID = "", | ||
| parameter LATENCY = 0, | ||
| parameter N = TCU_TC_K | ||
| parameter N = 1, |
There was a problem hiding this comment.
was indicating how it could be a single-element FEDP (essentially FMA), while TFR is a "genuinely fused" dot product with N=2, similar to how you handled this in the tcu_fedp unittest.
but saying TCU_TC_K does improve readability when connecting fedp backends with VX_tcu_core. rolling back!