Skip to content

TFR optimizations + MX suport#374

Draft
NikhilRout wants to merge 6 commits into
masterfrom
feature_mx
Draft

TFR optimizations + MX suport#374
NikhilRout wants to merge 6 commits into
masterfrom
feature_mx

Conversation

@NikhilRout

Copy link
Copy Markdown
Collaborator
  • added fpnew FP16/BF16 FEDP backend
  • fixed FPGA (fpnew related) makefiles

  • optimized TFR FEDP uarch by moving part of max_exp identification to stage 2 + avoiding 2s complement with sign popcount trick in stage 3 acc
  • added dual sparse mask (DSM) generation module before TFR FEDP for dynamic power reduction via clock-gating

  • added TCU format support for MXFP8, MXBF8, MXFP4, NVFP4, and MXINT8 in simx + rtl + regression test
  • added MX metadata handling through TCU_LD
  • split SP/MX TCU SRAM, scoreboard reg, and host runtime/include helpers
  • WMMA only for now (WGMMA wip)

  • added ci/regression coverage for MX formats under tensor_mx()
  • fixed submodule-related actions failures by adding recursive submodule checkout/update and invalidating third_party cache when submodule pins change

@tinebp tinebp left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add synthesis and performance results of mx benchmarks;
Synthesis: TCU, TCU+MX, TCU+SP+MX+WMMA with NT=NW=16
test results with perf=1 sgemm_tcu, sgemm_tcu_sp, sgemm_tcu_mx, sgemm_tcu_sp_mx with n=64, NT=8, NW=8 for simx and rtlsim, add reports to /perf/tcu/
: include full configuration string and perf1 output to each test result so that we can reproduce it

Comment thread hw/rtl/core/VX_decode.sv Outdated
op_args.tcu.is_first_uop = 1'b0;
op_args.tcu.is_last_uop = 1'b0;
wr_xregs[XREG_0] = 1'b1;
wr_xregs[rd[4] ? XREG_TCU_MX : XREG_TCU_SP] = 1'b1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to generalize the special registers; we cannot afford to reserve them for the TCU only.
XREG_2, XREG_3

@NikhilRout NikhilRout Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. went back to XREG_0 (for SP scoreboard bits) and used XREG_1 (for MX) now

Comment thread .github/workflows/ci.yml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing this file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. added tensor_mx() tests
  2. CI was caching third_party based on .gitmodules only instead of the pinned submodule SHAs as well. This was restoring stale third_party contents after a submodule version bump, thereby causing build failures. The new hash makes the cache key depend on the submodule commit pins, and forced recursive update ensures restored caches are corrected before build

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing this file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. added tensor_mx() tests
  2. CI was caching third_party based on .gitmodules only instead of the pinned submodule SHAs as well. This was restoring stale third_party contents after a submodule version bump, thereby causing build failures. The new hash makes the cache key depend on the submodule commit pins, and forced recursive update ensures restored caches are corrected before build

Comment thread hw/rtl/core/VX_execute.sv
.tcu_lmem_if (tcu_lmem_if),
`endif
`ifdef VX_CFG_TCU_SPARSE_ENABLE
`ifdef VX_CFG_TCU_META_ENABLE

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent idea!!!

Comment thread hw/rtl/tcu/dpi/VX_tcu_fedp_dpi.sv Outdated
parameter `STRING INSTANCE_ID = "",
parameter LATENCY = 0,
parameter N = TCU_TC_K
parameter N = 1,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N = 1?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was indicating how it could be a single-element FEDP (essentially FMA), while TFR is a "genuinely fused" dot product with N=2, similar to how you handled this in the tcu_fedp unittest.

but saying TCU_TC_K does improve readability when connecting fedp backends with VX_tcu_core. rolling back!

@NikhilRout NikhilRout marked this pull request as draft June 19, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants