Tracer ze universal by TApplencourt · Pull Request #505 · argonne-lcf/THAPI

TApplencourt · 2026-06-12T21:51:41Z

Suppor:

re-submiting event
command list executed multiple time
Ze barrier event

What a nightmare but it's working.

Will push my test another PR. Need to clean-it up first.

When the env var is set, _get_profiling_event's successful ZE_EVENT_CREATE_PTR calls increment a global counter; _lib_cleanup prints the total to stderr ('THAPI: injected events: N') at process exit. Useful for the bats infra to lock in baseline per-test injection counts: any change to the v2 path (e.g. the upcoming lazy fallback for user-event reuse) that inflates injection beyond the baseline fails its regression check. Both the counter increment and the stderr print are gated by the env var read at init time so the steady-state hot path pays nothing.

A single ze_event_handle_t can be the signal event of multiple Appends on the same cmdlist — the tracer's lazy capture inserts a per-Append Query for each occurrence, producing one event_profiling_results tracepoint per use. Previously eventToBtxDesct[hEvent] held a single tuple, so the second event_profiling overwrote the first, and all subsequent results attributed to the last kernel paired with that handle (typical symptom: 4 Appends signaling one event tallied as 4 calls of the last kernel, 0 calls of the other three). Switch the map's value type to std::deque<btx_event_desct_t>: - event_profiling pushes_back metadata for that hEvent. - event_profiling_result pops_front, attributes to that metadata, then push_backs the same entry. Rotation keeps the deque shape stable across cmdlist resubmits — N Appends produce N deque entries at build time; each Execute generates N results that cycle through those N entries in FIFO order. - zeCommandQueueExecuteCommandLists's per-cl walk updates every entry in the deque (not just the first). New regression test interval_profiling_shared_event covers the bug directly: 4 Barriers signaling the same hEvent, 4 results — expected output has 4 distinct device-side ts (post-fix); pre-fix would have collapsed all 4 to the last entry's ts.

The per-hEvent FIFO deque previously rotated (pop_front + push_back) on every event_profiling_result, intended to let a single set of build-time pushes serve M*N results across M cmdlist resubmits. But for the common case (1 push, 1 pop), rotation leaves stale entries in the deque, so the next push reads the wrong metadata. Symptom: 3 distinct kernel runs attributed as (busy_a=2, busy_b=1) instead of (busy_a=1, busy_b=1, busy_c=1). Switch to pop-and-discard: each Append pushes once, each result pops once. Resubmit-without-rebuild of the same cmdlist is no longer supported here; it is a deferred case on the tracer side as well. Also add zeKernelDestroy_entry to erase kernelToDesct[hKernel]. Drivers recycle freed kernel handle addresses; without the erase, the old entry silently survives until overwritten by the next Create. Defensive even though the rotation fix is what unblocks the busy-timing reproducer.

Every profiled Append (Launch*, Memcpy*, Barrier, SignalEvent, ...) is rewritten to: inj_ev = inject_event(cl) // fresh wrapper from per-context pool user_signal = hSignalEvent // saved (may be NULL) <user Append, with hSignalEvent swapped to inj_ev> AppendQueryKernelTimestamps(cl, 1, &inj_ev, slab, &slot_off, hSignalEvent=user_signal, waitEvents=1 &inj_ev) The Query waits on inj_ev (so it runs after the user op completes) and signals the user's original event (so user wait semantics are preserved). Per-cl slot list records (inj_ev, attribution_event, slab_offset). On sync (queue / event / fence / cl-host) the slabs of every tracked cl are drained — each slot emits one event_profiling_results to the attribution event (user's, or inj_ev if user passed NULL). Identical for in-order / OOO and regular / immediate cmdlists: 1 injected event per Append, 1 extra API call per Append. No host-side KT read; no per-event-handle bookkeeping. Cross-cmdlist event reuse, mid-chain re-signal, and shared signal events all work because each Append has its own slab slot independent of which user event it attributes to. Resubmit of the same regular cmdlist without rebuild between Executes is deferred: the slab slots get overwritten by the second Execute. Reset followed by re-Append works (slot list resets on Reset). This commit: - adds the slab pool + slot list to _ze_command_list_obj_data - adds _universal_record_append, _cl_drain, _on_sync_drain_{cl,all} - swaps the model profiling_prologue/epilogue to the universal path - wires drain hooks to sync APIs (queue/event/fence/cl-host) - removes the legacy per-Append injected-event-attached-to-cl path: _on_destroy_event, _on_reset_event, _unregister_ze_event, _dump_and_reset_our_event, _profile_event_results, _event_cleanup, _on_execute_command_lists, cl_data->events, _ZE_EXECUTED, _ZE_IMMEDIATE_CMD, _ze_events hash, FIND/ADD_ZE_EVENT macros - drops zeEventDestroy/zeEventHostReset/zeCommandQueueExecuteCommandLists model hooks (no consumers left) The injected wrapper still goes through the per-context PUT_ZE_EVENT free pool so the underlying ze_event_handle is reused — only the slot-bookkeeping is freshly allocated per Append. 35/46 reproducer tests pass (was 24/46 with the prior v2 partial path). The 11 remaining failures are all in the deferred resubmit-without- rebuild class.

Replaces the previous "drain everything + status-gate" approach with explicit happens-before tracking. Sync points walk the dependency graph from the synced anchor and drain only the slots reachable. Per-slot bookkeeping gains: waits[] user wait events copied at Append, stable across Executes preds[] pointers to predecessor slots, computed at instantiate live per-run: true between instantiate and drain Per-cl bookkeeping gains: in_flight_q queue of the most recent un-drained Execute (NULL if none) mtx serializes Execute prologue (force-sync + instantiate) is_immediate, is_in_order Global: _ze_latest: hash mapping event_handle -> most recent slot whose attr matches (used to resolve happens-before edges) Algorithm: Append: record slot, copy user waits, for immediate cls instantiate inline Execute: if in_flight_q set, force-sync that queue + drain_cl, then instantiate every slot for the new run, stamp in_flight_q Sync(ev): drain latest[ev] (recurses on preds) Sync(q): drain every cl whose in_flight_q == q Sync(cl): drain that cl Drain reads the slab slot, emits event_profiling_results attributed to the user's signal (or to our injected event if user passed NULL), clears live + preds, removes from latest[] if still the head. Build- time fields (inj, attr, off, waits) stay so the next Execute re-instantiates without re-Appending. zeCommandListReset/Destroy and zeContextDestroy hooks dropped: the user must Sync before any of those, which means our slots are already drained. zeFenceHostSynchronize hook deferred. Removed status-gating, HostReset(inj) (driver does not un-signal KERNEL_TIMESTAMP events anyway — see tests/bugs/host_reset_kts_event), slot survival across drain (replaced by per-run live + dep walk). On the matrix: 28/40 pass, 12 fail. The 12 are all TALLY MISMATCH: tracer emits the right number of event_profiling_results (verified via the new bats helper that runs iprof -t and iprof -j separately), but the tally collapses duplicates with the same hEvent.

A single hEvent can be the signal event of N Appends in one cl build phase, and the cl can be re-Executed M times, producing M*N result events. Previous attempts (single-entry overwrite, FIFO deque + rotate, FIFO + pop_front) each handled some cases and broke others. The model that fits all four scenarios in the matrix is a ring with a cursor: push: if cursor > 0 (we consumed results since the last push), the prior ring belongs to a finished build phase — clear and reset cursor. Then append the new metadata. result: read entries[cursor], advance cursor, wrap to 0 on overflow. The wrap handles resubmit (1 push, N results re-read the same ring). The phase-reset handles a rebuild after results consumed (new push implies the old entries are stale). New tally tests added in this directory: interval_profiling_resubmit_event 1 push, 2 results interval_profiling_shared_event_resubmit 2 pushes, 4 results interval_profiling_shared_event_xphase 2 pushes, 4 results, 1 push, 3 results Existing interval_profiling_shared_event (4 pushes, 4 results) still passes. End-to-end matrix on PVC: 41/41 pass.

Three new tally-side regression tests (raw tracepoint stream input, expected output diff'd at make-check time): interval_profiling_resubmit_event single Append re-fired by a closed cl on two Executes interval_profiling_shared_event_resubmit two Appends share one event, cl re-executed interval_profiling_shared_event_xphase shared event used in two distinct build phases (the cursor's new-phase reset path) All three exercise the per-event cursor ring and would have failed under the prior single-entry-overwrite or pop_front tallies.

Profiled Appends on a copy-only user cl previously triggered a driver abort because AppendQueryKernelTimestamps is rejected on copy engines. Move the Query op to a per-(context, device) tracer-owned immediate ASYNC compute cl (lazy-created, first compute queue group). The user cl body now just signals an injected event; the shadow cl's Query waits on it and writes the timestamps to the slab. To preserve the user's signal contract, AppendBarrier on the user cl chains user_signal off inj. To drain correctly, each slot owns its own shadow_done event the Query signals, and drain host-syncs on it before reading the slab. Execute-side bookkeeping (force-sync-prior + Append-Query + instantiate + claim in_flight_q) all moves to the Execute *epilogue* under cl_data->mtx as a single critical section. Two reasons: * Append-Query on the shadow cl before the user cl is in flight deadlocks when both share the engine (only one compute group on this hardware) — the pending shadow Query holds the engine and the user cl can never dispatch. * Concurrent Executes on the same cl from two threads need the force-sync-prior to observe a sibling's claim-in_flight_q atomically (regression test: inorder_reg_Event_11).

Slots stored raw pointers to each other in preds[] and were referenced from the global latest[ev] -> slot map. The slot array was a realloc'd buffer starting at cap 8 and doubling; on any realloc that moved the buffer, every stored slot pointer became dangling, and the dep-graph walks at drain time silently skipped any slot allocated before the last realloc. In practice this lost timing records non-uniformly: a sync that walked preds back through the chain hit a dangling pointer and the recursion ended early (or hit garbage that happened to look not-live). Standalone reproducer in tests/bugs/missing_drain_dag/repro.c: 3 waves of 12 Appends each. Before this fix, drain emits 20/36 deterministically (slots 8..11 of each wave). After: 36/36. Fix is to allocate the slot array once at first use, sized to the slab cap (64). The slab itself is already hard-capped at 64, so allowing slot growth past it gained nothing.

_slot_drain used to recurse with the caller's cl_data, but pred slots can live in another cl when the user signal chain crosses cls (cl_A signal=e1, cl_B wait=e1 signal=e2, sync(e2) only). Recursing with caller's cl_data read cl_B->slab + slot_A->off and emitted slot_A's tracepoint with cl_B's timestamps. Each slot now carries an owner back-pointer set at _cl_slot_append. _slot_drain drops its cl_data argument and reads from s->owner->slab, so cross-cl preds resolve to the right slab. _on_sync_drain_event collapses from the cl-walk to a direct s->owner->mtx lock. Reproducer: tests/matrix/inorder_reg_Event_15 (busy_b/busy_a>=3 ratio); fails before this commit, passes after. Full matrix 43/43.

Code review flagged the unguarded recursion. In practice cycles are impossible: in-order preds index strictly downward, and latest[] preds always point to slots published earlier. Forming a cycle requires the user to declare two Appends each waiting on the other's signal event, which deadlocks L0 itself before we ever drain. The live-clear-before-recurse would also stop a cycle if one appeared, so we are belt-and-suspenders by accident. Comment only, no code change.

The previous commit said live-clear-before-recurse would stop a cycle. live is cleared AFTER the recursion (see below), not before. The construction argument alone is the reason no guard is needed; remove the misleading sentence.

Two single-use helpers (_cl_slab_ensure, _cl_slots_grow) are inlined into their only caller. The cap-check and the two lazy allocations read better adjacent than dispatched through helpers. Side-effect fix: previous code bumped n_slots BEFORE attempting the slab alloc, so a failing first alloc left a permanent hole at index 0 (n_slots=1 but no slab). New order: cap-check, slots-alloc, slab-alloc, slot fill-in, THEN n_slots++. An OOM at any allocation returns NULL with cl_data unchanged. Rollbacks in _universal_record_append still apply: they undo a fully successful append after a downstream Barrier/shadow Append fails, which is correct under either ordering. Net -10 lines. Full matrix 43/43.

The helper had two callers (immediate-Append shadow lookup, Execute epilogue shadow lookup) and just lazy-cached the device handle. Inlined both with the same lazy-cache idiom; the helper and forward-declared cached_device-read disappear. The two call sites pass different command_list values to the L0 call: the Append epilogue uses its local `command_list` parameter, the Execute epilogue uses `phCommandLists[i]`. The old helper read cl_data->ptr (set at create time, same handle), so behavior matches. Full matrix 43/43.

Four near-identical early-return blocks shared a cleanup pattern that varied only in how much state had been claimed (event(s), cl from hash, slot). Two labels capture the two boundaries: fail_locked: a slot was allocated AND/OR the cl hash entry was claimed AND mtx is held. Rollback slot if present, unlock, ADD_ZE_CL, fall through. fail: events still owned, no cl claim. PUT them back. Net -12 lines. Full matrix 43/43.

C99 forbids an empty __VA_ARGS__ which forced the no-args fork. GCC's `, ##__VA_ARGS__` extension swallows the dangling comma — already used elsewhere in THAPI (utils/tracepoint_gen.rb). Verified clean compile with -DTHAPI_DEBUG (which actually exercises the variadic body) in addition to the default no-op branch. Full matrix 43/43.

Added in d5603e4 as a regression guard for a planned bats assertion that never landed. No test, internal code, or external doc references it. Easy to reintroduce on demand. Full matrix 43/43.

cap_slots was set once in _cl_slot_append to _ZE_SLAB_SLOTS_INITIAL and never read. The old _on_sync_drain_event walked the cls array using `s >= slots && s < slots + cap_slots` to find a slot's owner, but that walk went away when slots gained an owner back-pointer (cbb928b). Field has been dead since. Full matrix 43/43.

After THAPI_REPORT_INJECTED_EVENTS went away (322402d), _lib_cleanup had an empty body. Its registration plumbing (_do_cleanup, THAPI_ATTRIBUTE_DESTRUCTOR, the atexit call, the THAPI_USE_DESTRUCTORS ifdef pair) was all in service of running it. The CUDA backend still has a non-empty _lib_cleanup, so its copy of the same plumbing stays. Full matrix 43/43.

The lazy cached_device / cached_context resolve dance appeared 4 times (2 callsites x 2 fields) with subtle variations. Two tiny helpers collapse all four. Walks back commit 83191b9's inline-_cl_get_device with a better factoring that handles both fields together. The Execute epilogue's body was one 30-line for-loop with 5 distinct phases per cl. Extracting _on_execute_one_cl makes the loop a one-liner and lets the per-cl prose live with the code that implements it. Matches the granularity of _on_sync_drain_cl / _on_sync_drain_queue. Full matrix 43/43.

Previously _universal_record_append wired user_signal -> inj via AppendBarrier AFTER reserving a slot. If slot reservation failed (e.g. _ZE_SLAB_SLOTS_INITIAL=64 cap hit by the 65th Append on an immediate cl), the function jumped to fail_locked and skipped the barrier — meaning user_signal was never signaled by anything on the cl, and the user's Sync(user_signal) hung forever. Reorder so the barrier Append runs first. We may still lose the timing tracepoint when the slot can't be reserved, but the user's event still fires and Sync no longer hangs. Reproducer: tests/matrix/ooo_imm_Event_06. Previously: timed out after 20s, 0 tracepoints. Now: 64 tracepoints out of expected 65 (cap still loses one slot; the actual cap removal is the next commit's job).

Slots gain a `refs` field: the count of downstream slots whose preds[] points here AND that have not yet been drained. _slot_instantiate atomically increments each pred's refs; _slot_drain decrements them at cleanup. When a slot reaches live==0 && refs==0 it is reclaimable. Reclaim is implemented for immediate cls only. Regular cls bake inj into the cl body — recycling inj would corrupt the next Execute round, so their events are reclaimed only at cl destroy (Phase-3). For immediate slots, _slot_release PUTs inj and shadow_done back to the per-context pool and frees waits. Slot storage itself stays for now; Phase-3 will free it along with chunked slot/slab storage that lifts the 64-Append cap for imm cls. Full matrix 43/43 (Event_06 still fails on cap until Phase-3).

Imm cls used to deadlock the user program at 65 Appends in one phase (_cl_slot_append returned NULL above _ZE_SLAB_SLOTS_INITIAL=64 and _universal_record_append's fail path stopped chaining user_signal off inj). The previous commit unblocked the chain; this commit lifts the cap altogether for imm cls and reclaims slot/event storage as ops complete. Storage: cl_data->slots/slab pair is replaced by a utlist DL chain of _ze_slab_chunk{}, each holding 64 slots and a 64-ts slab. Imm cls grow chunk-by-chunk on demand; regular cls stop at one chunk (their inj events are baked into the closed cl body, so adding a chunk post-Close would create slots the body doesn't address). Lifetime: slots gain a `refs` atomic counter, incremented at _slot_instantiate per pred edge, decremented at _slot_drain. A slot is reclaimable iff live==0 AND refs==0. _slot_release PUTs inj + shadow_done back to the per-context pool, frees waits, and decrements its chunk's n_held. A chunk frees itself when n_held reaches 0 AND it isn't the active tail. Regular-cl slots are not released at drain — their events come back at cl destroy. Destroy: zeCommandListDestroy hook walks chunks, PUTs events back, frees chunks + cl_data. Re-added because the chunked layout makes the leak-on-destroy bound by per-cl lifetime instead of program lifetime. Memory shape: O(in-flight slots) at any time, matching the typical "one long-lived imm cl with many syncs" use case. Full matrix 45/45.

The tracer was leaving its own L0 objects (event pools/events used as the recycle pool, shadow command lists keyed by (ctx, device)) alive inside the user's context. When the user app destroyed the context before destroying a command list that had our injected events baked into it (the Intel libomptarget L0 plugin does this at _dl_fini), the following zeCommandListDestroy segfaulted inside libze_intel_gpu.so. Add a zeContextDestroy prologue that, while the context is still valid, walks _ze_cls / _ze_shadow_cls / _ze_event_pools and frees everything we own keyed to the dying context.

_slot_drain on the last slot of a chunk calls _slot_release, which decrements n_held to 0 and frees the chunk. The next iteration of the inner for loop then reads c->n_used and accesses &c->slots[i] on freed memory. Pin the chunk with an extra n_held bump during the inner loop. Drop the ref after the loop and free the chunk in _cl_drain itself if it was the last holder. The chunk-free condition (n_held == 0 && c != tail) is unchanged. Found by ASAN running ooo_imm_Event_07 (cross-cl chunk-mutation stress). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The L0 driver records each zeCommandListAppendQueryKernelTimestamps in cl-internal storage (~10 KB per call) and only releases it on zeCommandListReset/Destroy — never lazily, even after the device has completed the op. The shadow cl is per-(context, device) and lives for the context's lifetime, so an app that profiles many Appends grows unbounded driver memory. Track live_queries (under sh->mtx): bumped in _shadow_append_query, decremented in _slot_drain after the shadow_done host-sync. When the counter hits 0 inside _slot_drain, the shadow cl is idle from our perspective and we Reset it to reclaim the driver state. Cross-cl case is handled: with cls A and B each holding a Query in flight, A.sync() decrements 2→1 (no Reset), B.sync() decrements 1→0 (Reset). The slot stores its shadow cl so drain knows where to decrement; same sh* is set in both Append paths (immediate cl and Execute prologue for regular cls). Drive-by: split two existing `if (--x == 0 && ...)` sites in _slot_release and _cl_drain into two statements for consistent style. Validated with mem_const_steady (added in tests repo): heap delta drops from 1.93 MB/iter to 32 B/iter over 100 iters of (create cl + 200 Appends + sync + destroy cl). ASAN sweep across 46 correctness tests: clean (no UAF/OOB/leaks attributable to the tracer). Full correctness matrix: 46/46 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When the user's command list is on a COMPUTE queue group, the AppendQueryKernelTimestamps now lives in the user cl body itself, signaling user_signal directly. This collapses two Appends per profiled op (Barrier + shadow QKT) into one, removes the per-Append shadow_done fence event allocation and its drain-time host-sync, and drops the per-Execute shadow re-Append loop for regular compute cls. Detection runs at zeCommandListCreate{,Immediate} via the desc's commandQueueGroupOrdinal against a generalized per-device queue-group flag cache. cl_data->is_compute=0 (the shadow path) is the safe fallback for copy-only cls and any case where the group flags can't be determined. Also: add _ZE_MUST() and apply it to every tracer-issued operational L0 call (Barrier, QKT Append, EventHostSync/Reset, CommandListReset, QueueSync, pool event reset). These are calls the tracer adds on the user's behalf where failure means either a user hang (Barrier chain) or a non-self-consistent trace. Defensive: print + abort so the bug surfaces under sanitizers/CI rather than ship bad data. Driver query calls (Get*Handle, GetCommandQueueGroupProperties) keep their graceful fallbacks — they can fail transiently during teardown. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Top-of-file algorithm sketch now names the two QKT placements and shows their ASCII shape. The previous claim that "we use the shadow cl uniformly for all engines so the code path is identical" is gone — that was true before the inline path and misleading afterwards. - Algorithm section: split "place a Query" off; spell out per-path behavior at Execute (shadow re-Appends; inline already in cl body) and at drain (shadow host-syncs shadow_done; inline does not). - New "QKT placement" section with the two diagrams, anchored from cl_data->is_compute. - struct _ze_slot: sh and shadow_done documented as shadow-path only. - struct _ze_shadow_cl preamble: re-cast as the copy-only path. - _universal_record_append: redundant function-level diagram replaced with a pointer to the top-of-file section. - Slab-chunk regular-cl note: mention the QKT is also baked into the closed cl body on the inline path (additional reason chunks can't grow after Close). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ent_multithreaded_01 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The inline QKT path (compute user cls) instantiates each slot at Append time so the QKT can be baked into the cl body. The Execute hook then re-instantiated the same slot unconditionally, re-running the in-order pred walk and picking up later-appended live siblings as predecessors. Two siblings on an in-order cl ended up mutually referencing each other in preds[], and _slot_drain's recursive walk infinite-looped, crashing the user with SIGSEGV from stack overflow. Gate the Execute-time _slot_instantiate on !slot->live. Inline slots arrive live=1 from Append and are skipped; shadow-path slots always arrive live=0 (regular cls aren't instantiated at Append at all; second+ Execute rounds see slots reset to live=0 by the preceding _cl_drain), so the guard is a no-op for them. Tests added to thapi_ze_test that fail before / pass after: inorder_reg_Event_02 (2 Appends share one user signal event) inorder_reg_Event_05 (Event_02 + HostReset between 2 Executes) inorder_reg_Event_10 (Event_02 + cl resubmit N=2) inorder_reg_Event_11 (3 cls / 3 queues, 2 Appends each) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Defensive cleanup on top of the previous double-instantiate fix. Three related changes, one commit: 1. Introduce _THAPI_LOG (always-on; THAPI(func:line) prefix; flushes stderr so the line lands before abort) and _THAPI_ASSERT (logs + aborts unconditionally — not gated on NDEBUG, since silently dropping the check would let the bug ship bad data). _ZE_MUST is rewritten in terms of _THAPI_ASSERT, and the existing THAPI_DBGLOG now routes through _THAPI_LOG when THAPI_DEBUG is set. 2. _slot_instantiate asserts !s->live at entry. Re-instantiating a live slot leaks its preds[] and lets the in-order pred walk pick up later-appended siblings as predecessors, which infinite-loops _slot_drain. The prior fix added a guard at the one known caller; the assert turns the rule into an invariant of the function so the next caller can't trip the same trap silently. 3. Drop the !slot->live guard around _slot_instantiate in _on_execute_one_cl in favor of a same-condition `continue` at the top of the per-slot body. Already-live slots have nothing left to do this Execute (dep-graph entry still valid; inline-path QKT is baked into the cl body and re-fires automatically), so they shouldn't be running the shadow_append_query branch either. Cleaner and removes the if/assert contradiction at the call site. All 51 correctness + 1 bench tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e + dead macros _slot_publish(cl_data, s, sh) routes on s->shadow_done — the slot data itself is now the source of truth for "shadow vs inline". The Execute-time is_compute branch and the per-slot is_compute branch in _on_execute_one_cl collapse to one data-driven publish. _universal_record_append's shadow- immediate tail collapses to the same call. The inline-vs-shadow fork at Append remains: chaining user_signal via the QKT itself avoids an extra zeCommandListAppendBarrier per kernel. cached_device + _cl_cache_device go away — the only call sites are on the shadow path (immediate-copy Append, regular-copy Execute), both cold; the extra ZE_COMMAND_LIST_GET_DEVICE_HANDLE_PTR call is acceptable for one fewer field and one fewer cache to reason about. cached_context stays: load-bearing for _on_destroy_context's per-cl sweep. Delete _ZE_ERROR_MSG / _ZE_ERROR_MSG_NOTERMINATE / _ERROR_MSG: zero callers, and _ERROR_MSG was syntactically broken (orphan do/while). All 52 correctness + bench tests pass.

The map is "event -> the most recent slot that signaled it". The old "latest" name didn't say what it was the latest OF. The new name does. Mechanical rename across the global, struct, mutex, three helpers, and the algorithm/struct/comment shorthand. No semantic change. All 53 tests pass.

cl_data was being yanked out of the global hash at the top of every profiled Append and reinserted at every exit, taking _ze_cls_mutex three times per Append (DEL + ADD on each return path + the cleanup ADD on fail_with_cl). Replace with a single FIND_ZE_CL: cl_data stays in the hash, the global mutex is touched once instead of three times. Safe because L0 forbids racing Append against Destroy on the same cl handle (both carry the not-thread-safe-per-cl-handle restriction), so cl_data cannot be torn out by a concurrent _on_destroy_command_list. cl_data->mtx still serializes us against another Append/Execute on this same cl. This was the single biggest contention point in the N-threads-N-CLs workload (inorder_imm_Event_multithreaded_01 and friends).

…e->epilogue Previously each profiled Append issued zeCommandListGetContextHandle three times: once inside the generated prologue's _get_profiling_event, once at the top of _universal_record_append, and once at the first Execute via _cl_cache_context. Same handle every time. Fetch it once in the generated prologue, pass it to _get_profiling_event, also pass it to _universal_record_append. _universal_record_append now publishes cached_context up front so the first _on_execute_one_cl hits the cache too. Net: one driver call per Append instead of three. The cached_context write is unlocked. Safe: every writer stores the same value (the cl's true context), and the only reader (_on_destroy_context's per-cl sweep) is gated on a user contract forbidding concurrent Append + DestroyContext on the same context. All 51 correctness tests pass.

…context Since cached_context is now published at the top of _universal_record_append (before any slot exists), the L0-fetch fallback inside _cl_cache_context is provably dead — _on_execute_one_cl's only caller path is gated on slot->shadow_done, which can only be set by a prior shadow-path Append that already populated cached_context. Replace the call with a direct field read and delete the helper. One fewer indirection, one fewer "what if the cache is empty here?" branch to think about. All 51 correctness tests pass.

…de pools mutex Two changes, both targeting work that was happening with global mutexes held: 1. PUT_ZE_EVENT was a 20+ line macro spanning hash lookup, allocation, error-path L0 destroys, HostReset, and prepend. Convert to a function (debugger can step in, callers no longer pay the macro-expansion readability tax). Rename all 7 call sites (helpers + ze_model.rb generator); no shim macro left behind. 2. Move two pieces of work out from under the locks they shouldn't be under: - _put_ze_event: HostReset is thread-safe — issue it before taking _ze_event_pools_mutex instead of serializing an L0 round-trip behind a global lock. Pre-allocate the bucket entry outside the lock too, freeing the unused copy if we lose the publish race (same pattern _qgroup_cache_get already uses). - _event_latest_signaled_set: calloc outside the lock for the same reason; same lose-the-race-free-our-copy handling. All 51 correctness tests pass.

FIND_ZE_CL / ADD_ZE_CL / FIND_AND_DEL_ZE_CL -> _cl_find / _cl_add / _cl_find_and_del GET_ZE_EVENT -> _get_ze_event GET_ZE_EVENT_WRAPPER / PUT_ZE_EVENT_WRAPPER -> _get_ze_event_wrapper / _put_ze_event_wrapper All six were thin lock-wrapped hash/list ops in macro form, with the output threaded through an `out` parameter. Convert to ordinary functions returning the value; debugger steps in, callers don't pay the macro-expansion readability tax, and the lock-held region is easier to audit when it's a function body instead of a macro you have to mentally expand at each call site. No shims left behind — every call site updated. While here: _get_ze_event_wrapper now allocates the fresh wrapper outside the wrappers mutex (matches the [pre-alloc | lock | publish | free-our-copy-on-race] pattern already established by _put_ze_event and _event_latest_signaled_set), so the freelist mutex never wraps a heap call. All 51 correctness tests pass.

The "DL_DELETE the chunk + zeMemFree the slab (sometimes) + free the chunk struct" body was open-coded in 5 places: _universal_record_append's fail_locked rollback, _slot_release, _cl_drain, _on_destroy_command_list, and _on_destroy_context step 1. They differed only in whether to issue zeMemFree on the slab (skipped in _on_destroy_context because the ctx is dying and the driver reclaims). Extract _cl_chunk_free(cl_data, c, free_slab). Cleaner site-by-site, and the `free_slab=0` call site documents the "ctx-dying skip-zeMemFree" invariant in one place (the helper's comment) instead of relying on readers to spot the missing zeMemFree at a single open-coded site. No behavior change. All 51 correctness tests pass.

_on_destroy_command_list and _on_destroy_context step 1 were ~20 lines of near-identical code each: walk all chunks, free per-slot events/waits/ preds, free the chunk, then free cl_data itself. They differed only in two ctx-scoped resource decisions that move in lockstep: - event wrappers: recycle to the per-ctx pool (ctx alive) vs recycle wrapper struct only (ctx dying, pool is about to die too) - chunk slab: zeMemFree (ctx alive) vs skip (ctx dying) Extract _cl_data_destroy(cl_data, int ctx_dying) that captures both choices. _on_destroy_command_list passes 0; _on_destroy_context step 1 passes 1. The "why dying differs from alive" reasoning now lives in one place (the helper's comment) instead of being split across two sites where future readers had to diff them to spot the invariant. Caller still owns hash removal (single-cl: _cl_find_and_del; per-ctx: HASH_DEL inside the iter) since those are genuinely different. Lock order unchanged. All 51 correctness tests pass.

The dep-graph caches cross-cl slot pointers in s->preds[], so any drain may mutate ANY cl's chunks. A per-cl mtx scheme requires cross-cl lock acquisition with ordering rules. One mutex covering all tracer state makes that go away — drain freely follows pred pointers; Append on different cls serializes through the same lock. Once we had a single mutex for cl_data state, every other per-domain lock (event freelists, event-pool registry, qgroup cache, shadow-cl registry, per-shadow-cl L0 Append serializer, latest-signaled map) was guarding its own little island while every caller already held _ze_state_mutex. Folded them all in. Removed: cl_data->mtx (per-cl) _ze_cls_mutex (cl registry) _ze_event_wrappers_mutex _ze_event_pools_mutex _ze_event_latest_signaled_mutex _ze_qgroup_cache_mutex _ze_shadow_cls_mutex sh->mtx (per shadow cl) Kept: _ze_state_mutex (the one) ze_closures_mutex (separate domain — ffi closures) Fixes two TSan-confirmed bugs in the process: R3: _on_destroy_context read cl_data->cached_context under _ze_cls_mutex while _universal_record_append wrote it without any lock. Different mutexes, observable race. Bug 2: _slot_drain recursed through s->preds[i] into another cl's slots and mutated chunk->n_held / slot->live while _cl_slot_append on that cl mutated the same bytes. Reproduced by ooo_imm_Event_multithreaded_01 / _04. All the _cl_*, _get_*, _put_*, _event_latest_signaled_*, _qgroup_cache_get, _get_shadow_cl, _shadow_append_query, _get_profiling_event helpers run under the assumption that the caller holds _ze_state_mutex. Entry points that used to call these without the lock (_on_create_command_list calling _ordinal_is_compute, _on_destroy_context's three steps, the generated profiling_prologue calling _get_profiling_event) take it now. Other simplifications that fall out: - _Atomic on cached_context reverts to plain ze_context_handle_t. - __atomic_* on slot->refs reverts to plain ++/--. - The "allocate outside lock, race-publish inside" pattern in _event_latest_signaled_set / _put_ze_event / _get_ze_event_wrapper / _get_shadow_cl all collapse to straight HASH/DL ops. - sh->live_queries no longer needs its own mtx — manipulated under the state mutex like everything else. Perf trade: Append on different cls and the various freelist accesses serialize through one mutex. The L0 calls inside the critical section (zeCommandListAppendBarrier or AppendQueryKernelTimestamps) are short — the GPU just queues, doesn't execute. Worth it for the dramatic simplification: one mutex, zero lock-ordering rules, zero atomics, zero cross-cl acquisitions. Net -81 lines on the helpers file. 10/10 multithreaded tests under TSan report 0 races. 51/51 correctness tests pass; the imm_Event_multithreaded_01 case that exposed an unlocked-prologue regression mid-development passes 5/5 on loop.

Add a "Concurrency" section to the algorithm header that explains why a single global mutex covers all tracer state (cross-cl pred edges in the dep graph would force any per-cl scheme into multi-mutex acquisition with ordering rules; one global mutex sidesteps that entirely; the perf trade is bounded because the held region is just short L0 queue ops). With that in place, scrub per-function "Caller holds _ze_state_mutex" boilerplate that the section comment now covers, fix the one stale "lock cl.mtx" line in the algorithm pseudocode, and tighten the longer docstrings on _cl_data_destroy / _on_destroy_context / _on_execute_one_cl / _universal_record_append / shadow-cl helpers. Pure comments diff: -44 lines, no code change. 52/52 correctness tests still pass.

zeCommandListAppendSignalEvent's payload field is `hEvent`, not `hSignalEvent`, so it never qualifies for the hSignalEvent_* matching sets in btx_zematching_model.yaml and hSignalEvent_rest_entry_callback never fires for it. threadToLastLaunchInfo therefore retains whatever prior Append last populated it (typically MemoryFill / Kernel / Memcpy on the same thread). The next event_profiling — which IS emitted for AppendSignalEvent — pushes a ring entry tagged with that stale commandName, and the downstream device tally counts the signal's profiling result as one more MemoryFill/Kernel/etc. Symptom in the wild: a trace with 12000 host AppendMemoryFill calls reports 16002 device MemoryFill(D) — the extra 4002 are AppendSignalEvent results mis-attributed to the prior fill on each of two reused events. Add a dedicated zeCommandListAppendSignalEvent_entry_callback that refreshes threadToLastLaunchInfo with the real command name and a new btx_event_t::SIGNAL tag, then short-circuit event_profiling_result_callback on SIGNAL so no device-tally record is emitted (AppendSignalEvent is a host-side signal and does no GPU work to time). The ring entry itself is still pushed and the cursor still advances, so subsequent profiling results for the same event handle land on the correct slot. Verified against the full bats correctness suite (53/53) and a new reproducer in thapi_ze_test (inorder_imm_Event_08).

The four shared_event / resubmit BTX tests are synthetic inputs fed straight into the interval filter — they don't include Create/Execute because the test_wrapper converter can't synthesize struct values, and that omission makes the trace read as physically impossible (4 results from 2 Appends with no Execute in between). Add a short header to each explaining what ring/cursor branch it exercises. interval_profiling_fast assumed event_profiling_results could arrive inside the Append_exit window. With the new tracer, results are only emitted at sync time, so that interleaving is no longer reachable — delete the test and drop it from TRACE_COMMON. To allow header comments in the .thapi_text_pretty files, teach thapi_log_to_bt_source_component.rb to skip blank/# lines. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

TApplencourt and others added 30 commits May 29, 2026 15:52

ze: drop wrong claim about live-clear stopping cycles

6a855b8

The previous commit said live-clear-before-recurse would stop a cycle. live is cleared AFTER the recursion (see below), not before. The construction argument alone is the reason no guard is needed; remove the misleading sentence.

ze: drop THAPI_REPORT_INJECTED_EVENTS debug counter

322402d

Added in d5603e4 as a regression guard for a planned bats assertion that never landed. No test, internal code, or external doc references it. Easy to reintroduce on demand. Full matrix 43/43.

rubocopt

8810022

Clang-format

48e3843

TApplencourt and others added 16 commits June 9, 2026 21:06

ze: include <alloca.h>; rename inorder_reg_Event_11 -> inorder_reg_Ev…

c83b597

…ent_multithreaded_01 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracer ze universal#505

Tracer ze universal#505
TApplencourt wants to merge 46 commits into
develfrom
tracer_ze_universal

TApplencourt commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TApplencourt commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant