Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions VX_config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ VX_CFG_NUM_BARRIERS = 8
VX_CFG_MAX_BAR_EVENTS = 32

VX_CFG_IBUF_SIZE = 4
VX_CFG_DISPATCH_QUEUE_SIZE = 4
VX_CFG_ISSUE_WIDTH = "expr: up($VX_CFG_NUM_WARPS / 16)"
VX_CFG_SIMD_WIDTH = "expr: $VX_CFG_NUM_THREADS"
VX_CFG_NUM_OPCS = "expr: up($VX_CFG_NUM_WARPS / (4 * $VX_CFG_ISSUE_WIDTH))"
Expand Down
2 changes: 1 addition & 1 deletion docs/coding_guidelines_cpp.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,4 +211,4 @@ DT(2, "req: wid=" << wid << " pc=0x" << std::hex << pc << std::endl);

## 10. Comment Content & Intent

Comments describe what the adjacent code does and why, not the process that produced it. Prefer self-documenting code — good abstractions and consistent naming — and drop comments on code whose intent is already obvious; keep the rest brief, one or two lines per block as the norm (longer only where genuinely warranted, at the author's discretion), since over-detailed comments obscure the code and drift out of sync with later changes. Never embed development metadata or history (phase/step/version/part/feature/bug numbers, "proposal", "spec"), debugging or change narration ("fixing bug…", "was broken because…" — that is what commit messages are for), or references to design documents. SimX and other host-side models must not reference RTL details in comments or naming: the two evolve independently, so any such reference silently goes stale. These rules apply to every source file and script.
Comments describe what the adjacent code does and why, not the process that produced it. Prefer self-documenting code — good abstractions and consistent naming — and drop comments on code whose intent is already obvious; keep the rest brief, one or two lines per block as the norm (longer only where genuinely warranted, at the author's discretion), since over-detailed comments obscure the code and drift out of sync with later changes. Never embed development metadata or history (phase/step/version/part/feature/bug numbers, "proposal", "spec"), debugging or change narration ("fixing bug…", "was broken because…" — that is what commit messages are for), or references to design documents. Comments and names must not reference the other implementation layer's internals: host-side models (SimX, runtime, drivers) must not name RTL signals or parameters, and RTL must not name host-side/SimX details. The layers evolve independently, so any such reference silently goes stale. These rules apply to every source file and script.
2 changes: 1 addition & 1 deletion docs/coding_guidelines_verilog.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ Incorrect (space-separated entries):

## 9. Comment Content & Intent

Comments describe what the adjacent code does and why, not the process that produced it. Prefer self-documenting code — good abstractions and consistent naming — and drop comments on code whose intent is already obvious; keep the rest brief, one or two lines per block as the norm (longer only where genuinely warranted, at the author's discretion), since over-detailed comments obscure the code and drift out of sync with later changes. Never embed development metadata or history (phase/step/version/part/feature/bug numbers, "proposal", "spec"), debugging or change narration ("fixing bug…", "was broken because…" — that is what commit messages are for), or references to design documents. Do not anchor comments to volatile cross-component details (e.g., the SimX/software model) that evolve independently and will silently go stale. These rules apply to every source file and script.
Comments describe what the adjacent code does and why, not the process that produced it. Prefer self-documenting code — good abstractions and consistent naming — and drop comments on code whose intent is already obvious; keep the rest brief, one or two lines per block as the norm (longer only where genuinely warranted, at the author's discretion), since over-detailed comments obscure the code and drift out of sync with later changes. Never embed development metadata or history (phase/step/version/part/feature/bug numbers, "proposal", "spec"), debugging or change narration ("fixing bug…", "was broken because…" — that is what commit messages are for), or references to design documents. Comments and names must not reference the other implementation layer's internals: host-side models (SimX, runtime, drivers) must not name RTL signals or parameters, and RTL must not name host-side/SimX details. The layers evolve independently, so any such reference silently goes stale. These rules apply to every source file and script.

## 10. Combinational Logic Depth & Timing Closure

Expand Down
2 changes: 2 additions & 0 deletions hw/rtl/VX_gpu_pkg.sv
Original file line number Diff line number Diff line change
Expand Up @@ -542,6 +542,8 @@ package VX_gpu_pkg;
localparam ISSUE_WIS_BITS = `CLOG2(PER_ISSUE_WARPS);
localparam ISSUE_WIS_W = `UP(ISSUE_WIS_BITS);

localparam DISPATCH_QSIZE = `VX_CFG_DISPATCH_QUEUE_SIZE;

localparam PER_OPC_WARPS = PER_ISSUE_WARPS / `VX_CFG_NUM_OPCS;
localparam PER_OPC_NW_BITS = `CLOG2(PER_OPC_WARPS);
localparam PER_OPC_NW_W = `UP(PER_OPC_NW_BITS);
Expand Down
12 changes: 7 additions & 5 deletions hw/rtl/core/VX_dispatcher.sv
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ module VX_dispatcher import VX_gpu_pkg::*; #(
VX_operands_if.slave operands_if,

// outputs
output wire [NUM_EX_UNITS-1:0] dispatch_ready,
output wire [NUM_EX_UNITS-1:0] fu_release,
VX_dispatch_if.master dispatch_if [NUM_EX_UNITS]
);
`UNUSED_SPARAM (INSTANCE_ID)
Expand All @@ -39,15 +39,17 @@ module VX_dispatcher import VX_gpu_pkg::*; #(
wire [NUM_EX_UNITS-1:0] operands_ready_in;
assign operands_if.ready = operands_ready_in[operands_if.data.ex_type];

// FU-availability feedback to scoreboard to avoid head-of-line blocking
assign dispatch_ready = operands_ready_in;
// Per-FU dispatch credit returned to the scoreboard on FU accept.
for (genvar i = 0; i < NUM_EX_UNITS; ++i) begin : g_fu_release
assign fu_release[i] = dispatch_if[i].valid && dispatch_if[i].ready;
end

// Non-LSU execution units: pass operand data straight through
for (genvar i = 0; i < NUM_EX_UNITS; ++i) begin : g_buffers
if (i != EX_LSU) begin : g_non_lsu
VX_elastic_buffer #(
.DATAW (OUT_DATAW),
.SIZE (2),
.SIZE (DISPATCH_QSIZE),
.OUT_REG (1)
) buffer (
.clk (clk),
Expand Down Expand Up @@ -119,7 +121,7 @@ module VX_dispatcher import VX_gpu_pkg::*; #(
// LSU: substitute effective base address and cleared offset for bulk ops
VX_elastic_buffer #(
.DATAW (OUT_DATAW),
.SIZE (2),
.SIZE (DISPATCH_QSIZE),
.OUT_REG (1)
) lsu_buffer (
.clk (clk),
Expand Down
6 changes: 3 additions & 3 deletions hw/rtl/core/VX_issue_slice.sv
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ module VX_issue_slice import VX_gpu_pkg::*; #(
VX_scoreboard_if scoreboard_if();
VX_operands_if operands_if();

wire [NUM_EX_UNITS-1:0] dispatch_ready;
wire [NUM_EX_UNITS-1:0] fu_release;

VX_ibuffer #(
.INSTANCE_ID (`SFORMATF(("%s-ibuffer", INSTANCE_ID))),
Expand All @@ -60,7 +60,7 @@ module VX_issue_slice import VX_gpu_pkg::*; #(
`ifdef PERF_ENABLE
.perf_stalls (issue_perf.scb_stalls),
`endif
.dispatch_ready (dispatch_ready),
.fu_release (fu_release),
.writeback_if (writeback_if),
.ibuffer_if (ibuffer_if),
.scoreboard_if (scoreboard_if)
Expand Down Expand Up @@ -91,7 +91,7 @@ module VX_issue_slice import VX_gpu_pkg::*; #(
.perf_instrs (issue_perf.dispatch_instrs),
`endif
.operands_if (operands_if),
.dispatch_ready (dispatch_ready),
.fu_release (fu_release),
.dispatch_if (dispatch_if)
);

Expand Down
118 changes: 68 additions & 50 deletions hw/rtl/core/VX_scoreboard.sv
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
output reg [PERF_CTR_BITS-1:0] perf_stalls,
`endif

input wire [NUM_EX_UNITS-1:0] dispatch_ready,
input wire [NUM_EX_UNITS-1:0] fu_release,
VX_writeback_if.slave writeback_if,
VX_ibuffer_if.slave ibuffer_if [PER_ISSUE_WARPS],
VX_scoreboard_if.master scoreboard_if
Expand All @@ -38,6 +38,30 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
localparam OUT_DATAW = $bits(scoreboard_t) - ISSUE_WIS_W;
localparam OUT_BUF = 3; // Use skid buffer (SIZE=2, OUT_REG=1)

// Per-FU dispatch credits: spent at issue, reclaimed on FU accept, so a
// credit covers ops still in operand collection (not yet at the queue).
wire [NUM_EX_UNITS-1:0] fu_issue;
wire [NUM_EX_UNITS-1:0] fu_goingfull;

// going-full (not full): a 1-slot guard band keeps outstanding <= queue
// depth despite the registered suppress lag, so an issued op never stalls
// in the shared operand-collection path and HoL-blocks another FU.
for (genvar e = 0; e < NUM_EX_UNITS; ++e) begin : g_fu_goingfull
VX_pending_size #(
.SIZE (DISPATCH_QSIZE)
) fu_pending (
.clk (clk),
.reset (reset),
.incr (fu_issue[e]),
.decr (fu_release[e]),
`UNUSED_PIN (empty),
`UNUSED_PIN (alm_empty),
`UNUSED_PIN (full),
.alm_full (fu_goingfull[e]),
`UNUSED_PIN (size)
);
end

VX_ibuffer_if staging_if [PER_ISSUE_WARPS]();
wire [PER_ISSUE_WARPS-1:0] operands_ready;

Expand Down Expand Up @@ -105,10 +129,8 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
end
end

// Operand busy-check uses inuse after writeback release only, keeping the
// arbiter grant off the deep cone. The staging reserve goes into
// inuse_regs_n (which drives the registers) and is re-added to the busy
// check as a shallow conflict gate below.
// Writeback release feeds wb_inuse_regs; the staging reserve is added on
// top to form inuse_regs_n, which the busy check reads directly.
reg [NUM_REGS-1:0] wb_inuse_regs;
reg [NUM_XREGS-1:0] wb_inuse_xregs;
always @(*) begin
Expand All @@ -133,46 +155,47 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
end
end

// in_use_mask = inuse_regs_n masked by the operand-dependency set
// (the ibuffer instr on a fire, else the staging instr), shared by the
// regs_busy reduction and the per-operand operands_busy check.
wire [REG_TYPES-1:0][RV_REGS-1:0] in_use_mask;
wire [REG_TYPES-1:0] rd_resv_hit; // staged rd vs operand set
for (genvar i = 0; i < REG_TYPES; ++i) begin : g_in_use_mask
wire [RV_REGS-1:0] ibf_reg_mask = ibf_opd_mask[0][i] | ibf_opd_mask[1][i] | ibf_opd_mask[2][i] | ibf_opd_mask[3][i];
wire [RV_REGS-1:0] stg_reg_mask = stg_opd_mask[0][i] | stg_opd_mask[1][i] | stg_opd_mask[2][i] | stg_opd_mask[3][i];
wire [RV_REGS-1:0] regs_mask = ibuffer_fire ? ibf_reg_mask : stg_reg_mask;
assign in_use_mask[i] = wb_inuse_regs[i * RV_REGS +: RV_REGS] & regs_mask;
assign rd_resv_hit[i] = |(stg_opd_mask[0][i] & regs_mask);
assign in_use_mask[i] = inuse_regs_n[i * RV_REGS +: RV_REGS] & regs_mask;
end

wire [REG_TYPES-1:0] regs_busy;
for (genvar i = 0; i < REG_TYPES; ++i) begin : g_regs_busy
assign regs_busy[i] = (in_use_mask[i] != 0);
assign regs_busy[i] = (| in_use_mask[i]);
end

for (genvar i = 0; i < NUM_OPDS; ++i) begin : g_operands_busy
wire [REG_TYPE_BITS-1:0] rtype = get_reg_type(stg_opds[i]);
assign operands_busy[i] = (in_use_mask[rtype] & stg_opd_mask[i][rtype]) != 0;
assign operands_busy[i] = | (in_use_mask[rtype] & stg_opd_mask[i][rtype]);
end


wire [NUM_XREGS-1:0] xregs_mask = ibuffer_fire ? ibf_xregs_mask : stg_xregs_mask;
wire [NUM_XREGS-1:0] xregs_busy = wb_inuse_xregs & xregs_mask;

// Shallow gates equivalent to reserving rd / wr_xregs in the busy reduce.
wire rd_resv_conflict = staging_fire && staging_if[w].data.wb && (|rd_resv_hit);
wire x_resv_conflict = staging_fire && (|(staging_if[w].data.wr_xregs & xregs_mask));
wire xregs_busy = | (inuse_xregs_n & xregs_mask);

wire [EX_BITS-1:0] ex_sel = ibuffer_fire ? ibuffer_if[w].data.ex_type : staging_if[w].data.ex_type;
reg operands_ready_r;

// Readiness folds data hazards and FU-congestion into one flop; FU-lock
// is enforced downstream by masking the arbiter requests.
wire data_ready = ~((|regs_busy) || xregs_busy);
wire operands_ready_n = data_ready && ~fu_goingfull[ex_sel];

always @(posedge clk) begin
if (reset) begin
inuse_regs <= '0;
inuse_xregs <= '0;
end else begin
inuse_regs <= inuse_regs_n;
inuse_regs <= inuse_regs_n;
inuse_xregs <= inuse_xregs_n;
end
operands_ready_r <= (regs_busy == 0) && !rd_resv_conflict
&& (xregs_busy == 0) && !x_resv_conflict;
operands_ready_r <= operands_ready_n;
end

assign operands_ready[w] = operands_ready_r;
Expand Down Expand Up @@ -214,24 +237,18 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
end

wire [PER_ISSUE_WARPS-1:0] arb_valid_in;
wire [PER_ISSUE_WARPS-1:0] arb_suppress;
wire [PER_ISSUE_WARPS-1:0][OUT_DATAW-1:0] arb_data_in;
wire [PER_ISSUE_WARPS-1:0] arb_ready_in;

reg [NUM_EX_UNITS-1:0] fu_locked;

wire [PER_ISSUE_WARPS-1:0] fu_lock_block;
for (genvar w = 0; w < PER_ISSUE_WARPS; ++w) begin : g_fu_lock_block
wire [EX_BITS-1:0] w_ex = staging_if[w].data.ex_type;
// Block warps when FU is locked. fu_lock=1 means acquire request.
assign fu_lock_block[w] = fu_locked[w_ex] && staging_if[w].data.fu_lock;
end
// FU lock: a sequence must not interleave with another warp at the same FU.
// fu_locked ('1 = open, one-hot = locked) gates arb_valid_in so only the lock
// holder is requested while it holds the lock.
reg [PER_ISSUE_WARPS-1:0] fu_locked;

for (genvar w = 0; w < PER_ISSUE_WARPS; ++w) begin : g_arb_data_in
// valid: data-hazard + FU-lock check (drives age tracking in GTO)
assign arb_valid_in[w] = staging_if[w].valid && operands_ready[w] && ~fu_lock_block[w];
// suppress: FU-full check (skips selection without resetting age)
assign arb_suppress[w] = ~dispatch_ready[staging_if[w].data.ex_type];
// operands_ready carries data-hazard + FU-congestion; fu_locked adds the
// FU-lock gate so only the lock holder is requested during a sequence.
assign arb_valid_in[w] = staging_if[w].valid && operands_ready[w] && fu_locked[w];

assign arb_data_in[w] = {
staging_if[w].data.uuid,
Expand All @@ -253,28 +270,22 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
assign staging_if[w].ready = arb_ready_in[w] && operands_ready[w];
end

// GTO arbiter with suppress: FU-stalled warps keep aging but are
// skipped for selection, preserving their priority for when the FU drains.
// Only suppress when at least one warp can issue to a free FU; otherwise
// let all warps through so the pipeline buffers absorb transient stalls.

localparam LOG_NUM_REQS = `LOG2UP(PER_ISSUE_WARPS);

wire any_unsuppressed = |(arb_valid_in & ~arb_suppress);
wire [PER_ISSUE_WARPS-1:0] eff_suppress = any_unsuppressed ? arb_suppress : '0;

wire arb_valid;
wire [LOG_NUM_REQS-1:0] arb_index;
wire [PER_ISSUE_WARPS-1:0] arb_onehot;
wire arb_ready;

VX_gto_arbiter #(
.NUM_REQS (PER_ISSUE_WARPS)
// Matrix arbiter scales better past 8 requesters; RR is cheaper below that.
VX_generic_arbiter #(
.NUM_REQS (PER_ISSUE_WARPS),
.TYPE ((PER_ISSUE_WARPS > 8) ? "M" : "R"),
.STICKY (1) // Greedy
) out_arb (
.clk (clk),
.reset (reset),
.requests (arb_valid_in),
.suppress (eff_suppress),
.grant_valid (arb_valid),
.grant_index (arb_index),
.grant_onehot (arb_onehot),
Expand All @@ -294,8 +305,8 @@ module VX_scoreboard import VX_gpu_pkg::*; #(

assign arb_ready = ready_out_w;

// FU lock: prevent warp interleaving during multi-uop sequences.
// 10=acquire (first uop), 00=middle, 01=release (last uop), 11=default.
// A sequence carries fu_lock on its first uop (acquire) and fu_unlock on its
// last (release); 11 = single-uop default.

wire issue_fire = valid_out_w && ready_out_w;

Expand All @@ -308,18 +319,25 @@ module VX_scoreboard import VX_gpu_pkg::*; #(
assign staging_fu_unlock_vec[w] = staging_if[w].data.fu_unlock;
end

wire [EX_BITS-1:0] issue_ex = staging_ex_vec[arb_index];

for (genvar e = 0; e < NUM_EX_UNITS; ++e) begin : g_fu_issue
assign fu_issue[e] = issue_fire && (issue_ex == EX_BITS'(e));
end

// Lock to the granted warp on acquire, hold across its sequence, reopen on
// release. arb_onehot is the granted warp, registered into fu_locked.
wire issue_fu_lock = staging_fu_lock_vec[arb_index];
wire issue_fu_unlock = staging_fu_unlock_vec[arb_index];
wire [EX_BITS-1:0] issue_ex = staging_ex_vec[arb_index];

always @(posedge clk) begin
if (reset) begin
fu_locked <= '0;
fu_locked <= '1;
end else if (issue_fire) begin
if (issue_fu_lock && ~issue_fu_unlock) begin
fu_locked[issue_ex] <= 1'b1;
end else if (~issue_fu_lock && issue_fu_unlock) begin
fu_locked[issue_ex] <= 1'b0;
fu_locked <= arb_onehot;
end else if (issue_fu_unlock) begin
fu_locked <= '1;
end
end
end
Expand Down
1 change: 0 additions & 1 deletion hw/rtl/libs/VX_generic_arbiter.sv
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ module VX_generic_arbiter #(
.clk (clk),
.reset (reset),
.requests (requests),
.suppress ({NUM_REQS{1'b0}}),
.grant_valid (grant_valid),
.grant_index (grant_index),
.grant_onehot (grant_onehot),
Expand Down
Loading