Skip to content

Tracer ze universal#505

Open
TApplencourt wants to merge 46 commits into
develfrom
tracer_ze_universal
Open

Tracer ze universal#505
TApplencourt wants to merge 46 commits into
develfrom
tracer_ze_universal

Conversation

@TApplencourt

Copy link
Copy Markdown
Collaborator

Suppor:

  • re-submiting event
  • command list executed multiple time
  • Ze barrier event

What a nightmare but it's working.

Will push my test another PR. Need to clean-it up first.

TApplencourt and others added 30 commits May 29, 2026 15:52
When the env var is set, _get_profiling_event's successful
ZE_EVENT_CREATE_PTR calls increment a global counter; _lib_cleanup
prints the total to stderr ('THAPI: injected events: N') at process
exit.

Useful for the bats infra to lock in baseline per-test injection
counts: any change to the v2 path (e.g. the upcoming lazy fallback
for user-event reuse) that inflates injection beyond the baseline
fails its regression check.

Both the counter increment and the stderr print are gated by the
env var read at init time so the steady-state hot path pays nothing.
A single ze_event_handle_t can be the signal event of multiple Appends
on the same cmdlist — the tracer's lazy capture inserts a per-Append
Query for each occurrence, producing one event_profiling_results
tracepoint per use. Previously eventToBtxDesct[hEvent] held a single
tuple, so the second event_profiling overwrote the first, and all
subsequent results attributed to the last kernel paired with that
handle (typical symptom: 4 Appends signaling one event tallied as
4 calls of the last kernel, 0 calls of the other three).

Switch the map's value type to std::deque<btx_event_desct_t>:
  - event_profiling pushes_back metadata for that hEvent.
  - event_profiling_result pops_front, attributes to that metadata,
    then push_backs the same entry. Rotation keeps the deque shape
    stable across cmdlist resubmits — N Appends produce N deque
    entries at build time; each Execute generates N results that
    cycle through those N entries in FIFO order.
  - zeCommandQueueExecuteCommandLists's per-cl walk updates every
    entry in the deque (not just the first).

New regression test interval_profiling_shared_event covers the bug
directly: 4 Barriers signaling the same hEvent, 4 results — expected
output has 4 distinct device-side ts (post-fix); pre-fix would have
collapsed all 4 to the last entry's ts.
The per-hEvent FIFO deque previously rotated (pop_front + push_back) on
every event_profiling_result, intended to let a single set of build-time
pushes serve M*N results across M cmdlist resubmits. But for the common
case (1 push, 1 pop), rotation leaves stale entries in the deque, so the
next push reads the wrong metadata. Symptom: 3 distinct kernel runs
attributed as (busy_a=2, busy_b=1) instead of (busy_a=1, busy_b=1,
busy_c=1).

Switch to pop-and-discard: each Append pushes once, each result pops
once. Resubmit-without-rebuild of the same cmdlist is no longer
supported here; it is a deferred case on the tracer side as well.

Also add zeKernelDestroy_entry to erase kernelToDesct[hKernel]. Drivers
recycle freed kernel handle addresses; without the erase, the old entry
silently survives until overwritten by the next Create. Defensive even
though the rotation fix is what unblocks the busy-timing reproducer.
Every profiled Append (Launch*, Memcpy*, Barrier, SignalEvent, ...) is
rewritten to:

  inj_ev = inject_event(cl)         // fresh wrapper from per-context pool
  user_signal = hSignalEvent        // saved (may be NULL)
  <user Append, with hSignalEvent swapped to inj_ev>
  AppendQueryKernelTimestamps(cl, 1, &inj_ev, slab, &slot_off,
                              hSignalEvent=user_signal,
                              waitEvents=1 &inj_ev)

The Query waits on inj_ev (so it runs after the user op completes) and
signals the user's original event (so user wait semantics are preserved).
Per-cl slot list records (inj_ev, attribution_event, slab_offset). On
sync (queue / event / fence / cl-host) the slabs of every tracked cl
are drained — each slot emits one event_profiling_results to the
attribution event (user's, or inj_ev if user passed NULL).

Identical for in-order / OOO and regular / immediate cmdlists: 1
injected event per Append, 1 extra API call per Append. No host-side
KT read; no per-event-handle bookkeeping. Cross-cmdlist event reuse,
mid-chain re-signal, and shared signal events all work because each
Append has its own slab slot independent of which user event it
attributes to.

Resubmit of the same regular cmdlist without rebuild between Executes
is deferred: the slab slots get overwritten by the second Execute.
Reset followed by re-Append works (slot list resets on Reset).

This commit:
  - adds the slab pool + slot list to _ze_command_list_obj_data
  - adds _universal_record_append, _cl_drain, _on_sync_drain_{cl,all}
  - swaps the model profiling_prologue/epilogue to the universal path
  - wires drain hooks to sync APIs (queue/event/fence/cl-host)
  - removes the legacy per-Append injected-event-attached-to-cl path:
    _on_destroy_event, _on_reset_event, _unregister_ze_event,
    _dump_and_reset_our_event, _profile_event_results, _event_cleanup,
    _on_execute_command_lists, cl_data->events, _ZE_EXECUTED,
    _ZE_IMMEDIATE_CMD, _ze_events hash, FIND/ADD_ZE_EVENT macros
  - drops zeEventDestroy/zeEventHostReset/zeCommandQueueExecuteCommandLists
    model hooks (no consumers left)

The injected wrapper still goes through the per-context PUT_ZE_EVENT
free pool so the underlying ze_event_handle is reused — only the
slot-bookkeeping is freshly allocated per Append.

35/46 reproducer tests pass (was 24/46 with the prior v2 partial path).
The 11 remaining failures are all in the deferred resubmit-without-
rebuild class.
Replaces the previous "drain everything + status-gate" approach with
explicit happens-before tracking. Sync points walk the dependency graph
from the synced anchor and drain only the slots reachable.

Per-slot bookkeeping gains:
  waits[]     user wait events copied at Append, stable across Executes
  preds[]     pointers to predecessor slots, computed at instantiate
  live        per-run: true between instantiate and drain

Per-cl bookkeeping gains:
  in_flight_q queue of the most recent un-drained Execute (NULL if none)
  mtx         serializes Execute prologue (force-sync + instantiate)
  is_immediate, is_in_order

Global:
  _ze_latest: hash mapping event_handle -> most recent slot whose
              attr matches (used to resolve happens-before edges)

Algorithm:
  Append:    record slot, copy user waits, for immediate cls
             instantiate inline
  Execute:   if in_flight_q set, force-sync that queue + drain_cl,
             then instantiate every slot for the new run, stamp
             in_flight_q
  Sync(ev):  drain latest[ev] (recurses on preds)
  Sync(q):   drain every cl whose in_flight_q == q
  Sync(cl):  drain that cl

Drain reads the slab slot, emits event_profiling_results attributed to
the user's signal (or to our injected event if user passed NULL),
clears live + preds, removes from latest[] if still the head. Build-
time fields (inj, attr, off, waits) stay so the next Execute
re-instantiates without re-Appending.

zeCommandListReset/Destroy and zeContextDestroy hooks dropped: the user
must Sync before any of those, which means our slots are already
drained. zeFenceHostSynchronize hook deferred.

Removed status-gating, HostReset(inj) (driver does not un-signal
KERNEL_TIMESTAMP events anyway — see tests/bugs/host_reset_kts_event),
slot survival across drain (replaced by per-run live + dep walk).

On the matrix: 28/40 pass, 12 fail. The 12 are all TALLY MISMATCH:
tracer emits the right number of event_profiling_results (verified via
the new bats helper that runs iprof -t and iprof -j separately), but
the tally collapses duplicates with the same hEvent.
A single hEvent can be the signal event of N Appends in one cl build
phase, and the cl can be re-Executed M times, producing M*N result
events. Previous attempts (single-entry overwrite, FIFO deque + rotate,
FIFO + pop_front) each handled some cases and broke others.

The model that fits all four scenarios in the matrix is a ring with a
cursor:

  push:    if cursor > 0 (we consumed results since the last push),
           the prior ring belongs to a finished build phase — clear
           and reset cursor. Then append the new metadata.
  result:  read entries[cursor], advance cursor, wrap to 0 on overflow.

The wrap handles resubmit (1 push, N results re-read the same ring).
The phase-reset handles a rebuild after results consumed (new push
implies the old entries are stale).

New tally tests added in this directory:
  interval_profiling_resubmit_event           1 push, 2 results
  interval_profiling_shared_event_resubmit    2 pushes, 4 results
  interval_profiling_shared_event_xphase      2 pushes, 4 results, 1 push, 3 results

Existing interval_profiling_shared_event (4 pushes, 4 results) still
passes. End-to-end matrix on PVC: 41/41 pass.
Three new tally-side regression tests (raw tracepoint stream input,
expected output diff'd at make-check time):

  interval_profiling_resubmit_event           single Append re-fired
                                              by a closed cl on two
                                              Executes
  interval_profiling_shared_event_resubmit    two Appends share one
                                              event, cl re-executed
  interval_profiling_shared_event_xphase      shared event used in
                                              two distinct build
                                              phases (the cursor's
                                              new-phase reset path)

All three exercise the per-event cursor ring and would have failed
under the prior single-entry-overwrite or pop_front tallies.
Profiled Appends on a copy-only user cl previously triggered a driver
abort because AppendQueryKernelTimestamps is rejected on copy engines.
Move the Query op to a per-(context, device) tracer-owned immediate
ASYNC compute cl (lazy-created, first compute queue group). The user
cl body now just signals an injected event; the shadow cl's Query
waits on it and writes the timestamps to the slab.

To preserve the user's signal contract, AppendBarrier on the user cl
chains user_signal off inj. To drain correctly, each slot owns its
own shadow_done event the Query signals, and drain host-syncs on it
before reading the slab.

Execute-side bookkeeping (force-sync-prior + Append-Query +
instantiate + claim in_flight_q) all moves to the Execute *epilogue*
under cl_data->mtx as a single critical section. Two reasons:
* Append-Query on the shadow cl before the user cl is in flight
  deadlocks when both share the engine (only one compute group on
  this hardware) — the pending shadow Query holds the engine and the
  user cl can never dispatch.
* Concurrent Executes on the same cl from two threads need the
  force-sync-prior to observe a sibling's claim-in_flight_q
  atomically (regression test: inorder_reg_Event_11).
Slots stored raw pointers to each other in preds[] and were referenced
from the global latest[ev] -> slot map. The slot array was a realloc'd
buffer starting at cap 8 and doubling; on any realloc that moved the
buffer, every stored slot pointer became dangling, and the dep-graph
walks at drain time silently skipped any slot allocated before the
last realloc.

In practice this lost timing records non-uniformly: a sync that
walked preds back through the chain hit a dangling pointer and the
recursion ended early (or hit garbage that happened to look not-live).
Standalone reproducer in tests/bugs/missing_drain_dag/repro.c:
3 waves of 12 Appends each. Before this fix, drain emits 20/36
deterministically (slots 8..11 of each wave). After: 36/36.

Fix is to allocate the slot array once at first use, sized to the
slab cap (64). The slab itself is already hard-capped at 64, so
allowing slot growth past it gained nothing.
_slot_drain used to recurse with the caller's cl_data, but pred
slots can live in another cl when the user signal chain crosses
cls (cl_A signal=e1, cl_B wait=e1 signal=e2, sync(e2) only).
Recursing with caller's cl_data read cl_B->slab + slot_A->off and
emitted slot_A's tracepoint with cl_B's timestamps.

Each slot now carries an owner back-pointer set at _cl_slot_append.
_slot_drain drops its cl_data argument and reads from s->owner->slab,
so cross-cl preds resolve to the right slab. _on_sync_drain_event
collapses from the cl-walk to a direct s->owner->mtx lock.

Reproducer: tests/matrix/inorder_reg_Event_15 (busy_b/busy_a>=3
ratio); fails before this commit, passes after. Full matrix 43/43.
Code review flagged the unguarded recursion. In practice cycles are
impossible: in-order preds index strictly downward, and latest[]
preds always point to slots published earlier. Forming a cycle
requires the user to declare two Appends each waiting on the other's
signal event, which deadlocks L0 itself before we ever drain.

The live-clear-before-recurse would also stop a cycle if one
appeared, so we are belt-and-suspenders by accident. Comment only,
no code change.
The previous commit said live-clear-before-recurse would stop a
cycle. live is cleared AFTER the recursion (see below), not before.
The construction argument alone is the reason no guard is needed;
remove the misleading sentence.
Two single-use helpers (_cl_slab_ensure, _cl_slots_grow) are inlined
into their only caller. The cap-check and the two lazy allocations
read better adjacent than dispatched through helpers.

Side-effect fix: previous code bumped n_slots BEFORE attempting the
slab alloc, so a failing first alloc left a permanent hole at index 0
(n_slots=1 but no slab). New order: cap-check, slots-alloc,
slab-alloc, slot fill-in, THEN n_slots++. An OOM at any allocation
returns NULL with cl_data unchanged.

Rollbacks in _universal_record_append still apply: they undo a fully
successful append after a downstream Barrier/shadow Append fails,
which is correct under either ordering.

Net -10 lines. Full matrix 43/43.
The helper had two callers (immediate-Append shadow lookup, Execute
epilogue shadow lookup) and just lazy-cached the device handle.
Inlined both with the same lazy-cache idiom; the helper and
forward-declared cached_device-read disappear.

The two call sites pass different command_list values to the L0 call:
the Append epilogue uses its local `command_list` parameter, the
Execute epilogue uses `phCommandLists[i]`. The old helper read
cl_data->ptr (set at create time, same handle), so behavior matches.

Full matrix 43/43.
Four near-identical early-return blocks shared a cleanup pattern that
varied only in how much state had been claimed (event(s), cl from
hash, slot). Two labels capture the two boundaries:

  fail_locked: a slot was allocated AND/OR the cl hash entry was
               claimed AND mtx is held. Rollback slot if present,
               unlock, ADD_ZE_CL, fall through.
  fail:       events still owned, no cl claim. PUT them back.

Net -12 lines.

Full matrix 43/43.
C99 forbids an empty __VA_ARGS__ which forced the no-args fork. GCC's
`, ##__VA_ARGS__` extension swallows the dangling comma — already used
elsewhere in THAPI (utils/tracepoint_gen.rb).

Verified clean compile with -DTHAPI_DEBUG (which actually exercises
the variadic body) in addition to the default no-op branch.

Full matrix 43/43.
Added in d5603e4 as a regression guard for a planned bats assertion
that never landed. No test, internal code, or external doc references
it. Easy to reintroduce on demand.

Full matrix 43/43.
cap_slots was set once in _cl_slot_append to _ZE_SLAB_SLOTS_INITIAL
and never read. The old _on_sync_drain_event walked the cls array
using `s >= slots && s < slots + cap_slots` to find a slot's owner,
but that walk went away when slots gained an owner back-pointer
(cbb928b). Field has been dead since.

Full matrix 43/43.
After THAPI_REPORT_INJECTED_EVENTS went away (322402d), _lib_cleanup
had an empty body. Its registration plumbing (_do_cleanup,
THAPI_ATTRIBUTE_DESTRUCTOR, the atexit call, the THAPI_USE_DESTRUCTORS
ifdef pair) was all in service of running it. The CUDA backend still
has a non-empty _lib_cleanup, so its copy of the same plumbing stays.

Full matrix 43/43.
The lazy cached_device / cached_context resolve dance appeared 4 times
(2 callsites x 2 fields) with subtle variations. Two tiny helpers
collapse all four. Walks back commit 83191b9's inline-_cl_get_device
with a better factoring that handles both fields together.

The Execute epilogue's body was one 30-line for-loop with 5 distinct
phases per cl. Extracting _on_execute_one_cl makes the loop a
one-liner and lets the per-cl prose live with the code that
implements it. Matches the granularity of _on_sync_drain_cl /
_on_sync_drain_queue.

Full matrix 43/43.
Previously _universal_record_append wired user_signal -> inj via
AppendBarrier AFTER reserving a slot. If slot reservation failed
(e.g. _ZE_SLAB_SLOTS_INITIAL=64 cap hit by the 65th Append on an
immediate cl), the function jumped to fail_locked and skipped the
barrier — meaning user_signal was never signaled by anything on the
cl, and the user's Sync(user_signal) hung forever.

Reorder so the barrier Append runs first. We may still lose the
timing tracepoint when the slot can't be reserved, but the user's
event still fires and Sync no longer hangs.

Reproducer: tests/matrix/ooo_imm_Event_06. Previously: timed out
after 20s, 0 tracepoints. Now: 64 tracepoints out of expected 65
(cap still loses one slot; the actual cap removal is the next
commit's job).
Slots gain a `refs` field: the count of downstream slots whose
preds[] points here AND that have not yet been drained.
_slot_instantiate atomically increments each pred's refs;
_slot_drain decrements them at cleanup. When a slot reaches
live==0 && refs==0 it is reclaimable.

Reclaim is implemented for immediate cls only. Regular cls bake inj
into the cl body — recycling inj would corrupt the next Execute
round, so their events are reclaimed only at cl destroy (Phase-3).
For immediate slots, _slot_release PUTs inj and shadow_done back to
the per-context pool and frees waits. Slot storage itself stays for
now; Phase-3 will free it along with chunked slot/slab storage that
lifts the 64-Append cap for imm cls.

Full matrix 43/43 (Event_06 still fails on cap until Phase-3).
Imm cls used to deadlock the user program at 65 Appends in one phase
(_cl_slot_append returned NULL above _ZE_SLAB_SLOTS_INITIAL=64 and
_universal_record_append's fail path stopped chaining user_signal off
inj). The previous commit unblocked the chain; this commit lifts the
cap altogether for imm cls and reclaims slot/event storage as ops
complete.

Storage: cl_data->slots/slab pair is replaced by a utlist DL chain of
_ze_slab_chunk{}, each holding 64 slots and a 64-ts slab. Imm cls
grow chunk-by-chunk on demand; regular cls stop at one chunk (their
inj events are baked into the closed cl body, so adding a chunk
post-Close would create slots the body doesn't address).

Lifetime: slots gain a `refs` atomic counter, incremented at
_slot_instantiate per pred edge, decremented at _slot_drain. A slot
is reclaimable iff live==0 AND refs==0. _slot_release PUTs inj +
shadow_done back to the per-context pool, frees waits, and
decrements its chunk's n_held. A chunk frees itself when n_held
reaches 0 AND it isn't the active tail. Regular-cl slots are not
released at drain — their events come back at cl destroy.

Destroy: zeCommandListDestroy hook walks chunks, PUTs events back,
frees chunks + cl_data. Re-added because the chunked layout makes
the leak-on-destroy bound by per-cl lifetime instead of program
lifetime.

Memory shape: O(in-flight slots) at any time, matching the typical
"one long-lived imm cl with many syncs" use case.

Full matrix 45/45.
The tracer was leaving its own L0 objects (event pools/events used as
the recycle pool, shadow command lists keyed by (ctx, device)) alive
inside the user's context. When the user app destroyed the context
before destroying a command list that had our injected events baked
into it (the Intel libomptarget L0 plugin does this at _dl_fini), the
following zeCommandListDestroy segfaulted inside libze_intel_gpu.so.

Add a zeContextDestroy prologue that, while the context is still
valid, walks _ze_cls / _ze_shadow_cls / _ze_event_pools and frees
everything we own keyed to the dying context.
_slot_drain on the last slot of a chunk calls _slot_release, which decrements
n_held to 0 and frees the chunk. The next iteration of the inner for loop then
reads c->n_used and accesses &c->slots[i] on freed memory.

Pin the chunk with an extra n_held bump during the inner loop. Drop the ref
after the loop and free the chunk in _cl_drain itself if it was the last
holder. The chunk-free condition (n_held == 0 && c != tail) is unchanged.

Found by ASAN running ooo_imm_Event_07 (cross-cl chunk-mutation stress).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The L0 driver records each zeCommandListAppendQueryKernelTimestamps in
cl-internal storage (~10 KB per call) and only releases it on
zeCommandListReset/Destroy — never lazily, even after the device has
completed the op. The shadow cl is per-(context, device) and lives for
the context's lifetime, so an app that profiles many Appends grows
unbounded driver memory.

Track live_queries (under sh->mtx): bumped in _shadow_append_query,
decremented in _slot_drain after the shadow_done host-sync. When the
counter hits 0 inside _slot_drain, the shadow cl is idle from our
perspective and we Reset it to reclaim the driver state. Cross-cl case
is handled: with cls A and B each holding a Query in flight, A.sync()
decrements 2→1 (no Reset), B.sync() decrements 1→0 (Reset).

The slot stores its shadow cl so drain knows where to decrement; same
sh* is set in both Append paths (immediate cl and Execute prologue for
regular cls).

Drive-by: split two existing `if (--x == 0 && ...)` sites in
_slot_release and _cl_drain into two statements for consistent style.

Validated with mem_const_steady (added in tests repo): heap delta drops
from 1.93 MB/iter to 32 B/iter over 100 iters of (create cl + 200
Appends + sync + destroy cl). ASAN sweep across 46 correctness tests:
clean (no UAF/OOB/leaks attributable to the tracer). Full correctness
matrix: 46/46 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the user's command list is on a COMPUTE queue group, the
AppendQueryKernelTimestamps now lives in the user cl body itself,
signaling user_signal directly. This collapses two Appends per
profiled op (Barrier + shadow QKT) into one, removes the per-Append
shadow_done fence event allocation and its drain-time host-sync, and
drops the per-Execute shadow re-Append loop for regular compute cls.

Detection runs at zeCommandListCreate{,Immediate} via the desc's
commandQueueGroupOrdinal against a generalized per-device queue-group
flag cache. cl_data->is_compute=0 (the shadow path) is the safe
fallback for copy-only cls and any case where the group flags can't
be determined.

Also: add _ZE_MUST() and apply it to every tracer-issued operational
L0 call (Barrier, QKT Append, EventHostSync/Reset, CommandListReset,
QueueSync, pool event reset). These are calls the tracer adds on the
user's behalf where failure means either a user hang (Barrier chain)
or a non-self-consistent trace. Defensive: print + abort so the bug
surfaces under sanitizers/CI rather than ship bad data. Driver query
calls (Get*Handle, GetCommandQueueGroupProperties) keep their graceful
fallbacks — they can fail transiently during teardown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Top-of-file algorithm sketch now names the two QKT placements and
shows their ASCII shape. The previous claim that "we use the shadow
cl uniformly for all engines so the code path is identical" is gone —
that was true before the inline path and misleading afterwards.

- Algorithm section: split "place a Query" off; spell out per-path
  behavior at Execute (shadow re-Appends; inline already in cl body)
  and at drain (shadow host-syncs shadow_done; inline does not).
- New "QKT placement" section with the two diagrams, anchored from
  cl_data->is_compute.
- struct _ze_slot: sh and shadow_done documented as shadow-path only.
- struct _ze_shadow_cl preamble: re-cast as the copy-only path.
- _universal_record_append: redundant function-level diagram replaced
  with a pointer to the top-of-file section.
- Slab-chunk regular-cl note: mention the QKT is also baked into the
  closed cl body on the inline path (additional reason chunks can't
  grow after Close).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TApplencourt and others added 16 commits June 9, 2026 21:06
…ent_multithreaded_01

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The inline QKT path (compute user cls) instantiates each slot at
Append time so the QKT can be baked into the cl body. The Execute
hook then re-instantiated the same slot unconditionally, re-running
the in-order pred walk and picking up later-appended live siblings
as predecessors. Two siblings on an in-order cl ended up mutually
referencing each other in preds[], and _slot_drain's recursive
walk infinite-looped, crashing the user with SIGSEGV from stack
overflow.

Gate the Execute-time _slot_instantiate on !slot->live. Inline
slots arrive live=1 from Append and are skipped; shadow-path
slots always arrive live=0 (regular cls aren't instantiated at
Append at all; second+ Execute rounds see slots reset to live=0
by the preceding _cl_drain), so the guard is a no-op for them.

Tests added to thapi_ze_test that fail before / pass after:
  inorder_reg_Event_02 (2 Appends share one user signal event)
  inorder_reg_Event_05 (Event_02 + HostReset between 2 Executes)
  inorder_reg_Event_10 (Event_02 + cl resubmit N=2)
  inorder_reg_Event_11 (3 cls / 3 queues, 2 Appends each)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Defensive cleanup on top of the previous double-instantiate fix. Three
related changes, one commit:

1. Introduce _THAPI_LOG (always-on; THAPI(func:line) prefix; flushes
   stderr so the line lands before abort) and _THAPI_ASSERT (logs +
   aborts unconditionally — not gated on NDEBUG, since silently
   dropping the check would let the bug ship bad data). _ZE_MUST is
   rewritten in terms of _THAPI_ASSERT, and the existing THAPI_DBGLOG
   now routes through _THAPI_LOG when THAPI_DEBUG is set.

2. _slot_instantiate asserts !s->live at entry. Re-instantiating a
   live slot leaks its preds[] and lets the in-order pred walk pick
   up later-appended siblings as predecessors, which infinite-loops
   _slot_drain. The prior fix added a guard at the one known caller;
   the assert turns the rule into an invariant of the function so the
   next caller can't trip the same trap silently.

3. Drop the !slot->live guard around _slot_instantiate in
   _on_execute_one_cl in favor of a same-condition `continue` at the
   top of the per-slot body. Already-live slots have nothing left to
   do this Execute (dep-graph entry still valid; inline-path QKT is
   baked into the cl body and re-fires automatically), so they
   shouldn't be running the shadow_append_query branch either. Cleaner
   and removes the if/assert contradiction at the call site.

All 51 correctness + 1 bench tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e + dead macros

_slot_publish(cl_data, s, sh) routes on s->shadow_done — the slot data
itself is now the source of truth for "shadow vs inline". The Execute-time
is_compute branch and the per-slot is_compute branch in _on_execute_one_cl
collapse to one data-driven publish. _universal_record_append's shadow-
immediate tail collapses to the same call. The inline-vs-shadow fork at
Append remains: chaining user_signal via the QKT itself avoids an extra
zeCommandListAppendBarrier per kernel.

cached_device + _cl_cache_device go away — the only call sites are on the
shadow path (immediate-copy Append, regular-copy Execute), both cold; the
extra ZE_COMMAND_LIST_GET_DEVICE_HANDLE_PTR call is acceptable for one
fewer field and one fewer cache to reason about. cached_context stays:
load-bearing for _on_destroy_context's per-cl sweep.

Delete _ZE_ERROR_MSG / _ZE_ERROR_MSG_NOTERMINATE / _ERROR_MSG: zero
callers, and _ERROR_MSG was syntactically broken (orphan do/while).

All 52 correctness + bench tests pass.
The map is "event -> the most recent slot that signaled it". The old
"latest" name didn't say what it was the latest OF. The new name does.

Mechanical rename across the global, struct, mutex, three helpers, and
the algorithm/struct/comment shorthand. No semantic change. All 53 tests
pass.
cl_data was being yanked out of the global hash at the top of every
profiled Append and reinserted at every exit, taking _ze_cls_mutex three
times per Append (DEL + ADD on each return path + the cleanup ADD on
fail_with_cl). Replace with a single FIND_ZE_CL: cl_data stays in the
hash, the global mutex is touched once instead of three times.

Safe because L0 forbids racing Append against Destroy on the same cl
handle (both carry the not-thread-safe-per-cl-handle restriction), so
cl_data cannot be torn out by a concurrent _on_destroy_command_list.
cl_data->mtx still serializes us against another Append/Execute on
this same cl.

This was the single biggest contention point in the N-threads-N-CLs
workload (inorder_imm_Event_multithreaded_01 and friends).
…e->epilogue

Previously each profiled Append issued zeCommandListGetContextHandle three
times: once inside the generated prologue's _get_profiling_event, once at
the top of _universal_record_append, and once at the first Execute via
_cl_cache_context. Same handle every time.

Fetch it once in the generated prologue, pass it to _get_profiling_event,
also pass it to _universal_record_append. _universal_record_append now
publishes cached_context up front so the first _on_execute_one_cl hits
the cache too. Net: one driver call per Append instead of three.

The cached_context write is unlocked. Safe: every writer stores the same
value (the cl's true context), and the only reader (_on_destroy_context's
per-cl sweep) is gated on a user contract forbidding concurrent Append +
DestroyContext on the same context.

All 51 correctness tests pass.
…context

Since cached_context is now published at the top of _universal_record_append
(before any slot exists), the L0-fetch fallback inside _cl_cache_context is
provably dead — _on_execute_one_cl's only caller path is gated on
slot->shadow_done, which can only be set by a prior shadow-path Append
that already populated cached_context.

Replace the call with a direct field read and delete the helper. One
fewer indirection, one fewer "what if the cache is empty here?" branch
to think about.

All 51 correctness tests pass.
…de pools mutex

Two changes, both targeting work that was happening with global mutexes held:

1. PUT_ZE_EVENT was a 20+ line macro spanning hash lookup, allocation,
   error-path L0 destroys, HostReset, and prepend. Convert to a function
   (debugger can step in, callers no longer pay the macro-expansion
   readability tax). Rename all 7 call sites (helpers + ze_model.rb
   generator); no shim macro left behind.

2. Move two pieces of work out from under the locks they shouldn't be
   under:
   - _put_ze_event: HostReset is thread-safe — issue it before taking
     _ze_event_pools_mutex instead of serializing an L0 round-trip
     behind a global lock. Pre-allocate the bucket entry outside the
     lock too, freeing the unused copy if we lose the publish race
     (same pattern _qgroup_cache_get already uses).
   - _event_latest_signaled_set: calloc outside the lock for the same
     reason; same lose-the-race-free-our-copy handling.

All 51 correctness tests pass.
FIND_ZE_CL / ADD_ZE_CL / FIND_AND_DEL_ZE_CL  -> _cl_find / _cl_add / _cl_find_and_del
GET_ZE_EVENT                                  -> _get_ze_event
GET_ZE_EVENT_WRAPPER / PUT_ZE_EVENT_WRAPPER  -> _get_ze_event_wrapper / _put_ze_event_wrapper

All six were thin lock-wrapped hash/list ops in macro form, with the
output threaded through an `out` parameter. Convert to ordinary
functions returning the value; debugger steps in, callers don't pay the
macro-expansion readability tax, and the lock-held region is easier to
audit when it's a function body instead of a macro you have to mentally
expand at each call site. No shims left behind — every call site updated.

While here: _get_ze_event_wrapper now allocates the fresh wrapper
outside the wrappers mutex (matches the [pre-alloc | lock | publish |
free-our-copy-on-race] pattern already established by _put_ze_event and
_event_latest_signaled_set), so the freelist mutex never wraps a heap
call.

All 51 correctness tests pass.
The "DL_DELETE the chunk + zeMemFree the slab (sometimes) + free the
chunk struct" body was open-coded in 5 places: _universal_record_append's
fail_locked rollback, _slot_release, _cl_drain, _on_destroy_command_list,
and _on_destroy_context step 1. They differed only in whether to issue
zeMemFree on the slab (skipped in _on_destroy_context because the ctx
is dying and the driver reclaims).

Extract _cl_chunk_free(cl_data, c, free_slab). Cleaner site-by-site,
and the `free_slab=0` call site documents the "ctx-dying skip-zeMemFree"
invariant in one place (the helper's comment) instead of relying on
readers to spot the missing zeMemFree at a single open-coded site.

No behavior change. All 51 correctness tests pass.
_on_destroy_command_list and _on_destroy_context step 1 were ~20 lines
of near-identical code each: walk all chunks, free per-slot events/waits/
preds, free the chunk, then free cl_data itself. They differed only in
two ctx-scoped resource decisions that move in lockstep:

  - event wrappers: recycle to the per-ctx pool (ctx alive) vs
    recycle wrapper struct only (ctx dying, pool is about to die too)
  - chunk slab: zeMemFree (ctx alive) vs skip (ctx dying)

Extract _cl_data_destroy(cl_data, int ctx_dying) that captures both
choices. _on_destroy_command_list passes 0; _on_destroy_context step 1
passes 1. The "why dying differs from alive" reasoning now lives in one
place (the helper's comment) instead of being split across two sites
where future readers had to diff them to spot the invariant.

Caller still owns hash removal (single-cl: _cl_find_and_del; per-ctx:
HASH_DEL inside the iter) since those are genuinely different. Lock
order unchanged.

All 51 correctness tests pass.
The dep-graph caches cross-cl slot pointers in s->preds[], so any drain
may mutate ANY cl's chunks. A per-cl mtx scheme requires cross-cl lock
acquisition with ordering rules. One mutex covering all tracer state
makes that go away — drain freely follows pred pointers; Append on
different cls serializes through the same lock.

Once we had a single mutex for cl_data state, every other per-domain
lock (event freelists, event-pool registry, qgroup cache, shadow-cl
registry, per-shadow-cl L0 Append serializer, latest-signaled map) was
guarding its own little island while every caller already held
_ze_state_mutex. Folded them all in.

Removed:
  cl_data->mtx          (per-cl)
  _ze_cls_mutex         (cl registry)
  _ze_event_wrappers_mutex
  _ze_event_pools_mutex
  _ze_event_latest_signaled_mutex
  _ze_qgroup_cache_mutex
  _ze_shadow_cls_mutex
  sh->mtx               (per shadow cl)

Kept:
  _ze_state_mutex       (the one)
  ze_closures_mutex     (separate domain — ffi closures)

Fixes two TSan-confirmed bugs in the process:

  R3: _on_destroy_context read cl_data->cached_context under
      _ze_cls_mutex while _universal_record_append wrote it without
      any lock. Different mutexes, observable race.

  Bug 2: _slot_drain recursed through s->preds[i] into another cl's
      slots and mutated chunk->n_held / slot->live while
      _cl_slot_append on that cl mutated the same bytes. Reproduced
      by ooo_imm_Event_multithreaded_01 / _04.

All the _cl_*, _get_*, _put_*, _event_latest_signaled_*, _qgroup_cache_get,
_get_shadow_cl, _shadow_append_query, _get_profiling_event helpers run
under the assumption that the caller holds _ze_state_mutex. Entry points
that used to call these without the lock (_on_create_command_list calling
_ordinal_is_compute, _on_destroy_context's three steps, the generated
profiling_prologue calling _get_profiling_event) take it now.

Other simplifications that fall out:
  - _Atomic on cached_context reverts to plain ze_context_handle_t.
  - __atomic_* on slot->refs reverts to plain ++/--.
  - The "allocate outside lock, race-publish inside" pattern in
    _event_latest_signaled_set / _put_ze_event / _get_ze_event_wrapper
    / _get_shadow_cl all collapse to straight HASH/DL ops.
  - sh->live_queries no longer needs its own mtx — manipulated under
    the state mutex like everything else.

Perf trade: Append on different cls and the various freelist accesses
serialize through one mutex. The L0 calls inside the critical section
(zeCommandListAppendBarrier or AppendQueryKernelTimestamps) are short —
the GPU just queues, doesn't execute. Worth it for the dramatic
simplification: one mutex, zero lock-ordering rules, zero atomics, zero
cross-cl acquisitions.

Net -81 lines on the helpers file. 10/10 multithreaded tests under TSan
report 0 races. 51/51 correctness tests pass; the
imm_Event_multithreaded_01 case that exposed an unlocked-prologue
regression mid-development passes 5/5 on loop.
Add a "Concurrency" section to the algorithm header that explains why a
single global mutex covers all tracer state (cross-cl pred edges in the
dep graph would force any per-cl scheme into multi-mutex acquisition
with ordering rules; one global mutex sidesteps that entirely; the perf
trade is bounded because the held region is just short L0 queue ops).

With that in place, scrub per-function "Caller holds _ze_state_mutex"
boilerplate that the section comment now covers, fix the one stale
"lock cl.mtx" line in the algorithm pseudocode, and tighten the longer
docstrings on _cl_data_destroy / _on_destroy_context / _on_execute_one_cl
/ _universal_record_append / shadow-cl helpers.

Pure comments diff: -44 lines, no code change. 52/52 correctness tests
still pass.
zeCommandListAppendSignalEvent's payload field is `hEvent`, not
`hSignalEvent`, so it never qualifies for the hSignalEvent_* matching
sets in btx_zematching_model.yaml and hSignalEvent_rest_entry_callback
never fires for it. threadToLastLaunchInfo therefore retains whatever
prior Append last populated it (typically MemoryFill / Kernel / Memcpy
on the same thread). The next event_profiling — which IS emitted for
AppendSignalEvent — pushes a ring entry tagged with that stale
commandName, and the downstream device tally counts the signal's
profiling result as one more MemoryFill/Kernel/etc.

Symptom in the wild: a trace with 12000 host AppendMemoryFill calls
reports 16002 device MemoryFill(D) — the extra 4002 are AppendSignalEvent
results mis-attributed to the prior fill on each of two reused events.

Add a dedicated zeCommandListAppendSignalEvent_entry_callback that
refreshes threadToLastLaunchInfo with the real command name and a new
btx_event_t::SIGNAL tag, then short-circuit event_profiling_result_callback
on SIGNAL so no device-tally record is emitted (AppendSignalEvent is a
host-side signal and does no GPU work to time). The ring entry itself is
still pushed and the cursor still advances, so subsequent profiling
results for the same event handle land on the correct slot.

Verified against the full bats correctness suite (53/53) and a new
reproducer in thapi_ze_test (inorder_imm_Event_08).
The four shared_event / resubmit BTX tests are synthetic inputs fed
straight into the interval filter — they don't include Create/Execute
because the test_wrapper converter can't synthesize struct values, and
that omission makes the trace read as physically impossible (4 results
from 2 Appends with no Execute in between). Add a short header to each
explaining what ring/cursor branch it exercises.

interval_profiling_fast assumed event_profiling_results could arrive
inside the Append_exit window. With the new tracer, results are only
emitted at sync time, so that interleaving is no longer reachable —
delete the test and drop it from TRACE_COMMON.

To allow header comments in the .thapi_text_pretty files, teach
thapi_log_to_bt_source_component.rb to skip blank/# lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant