CI 2.0 — catalog-driven, driver-sliceable test architecture (pytest) by tinebp · Pull Request #375 · vortexgpgpu/vortex

tinebp · 2026-06-19T13:55:15Z

What

Introduces CI 2.0: Vortex tests become declarative YAML data run by pytest, replacing the imperative, driver-pinned bash in ci/regression.sh (1382 lines, 401 driver-pinned invocations). blackbox.sh stays the untouched executor. Lands alongside the legacy ci.yml (manual-dispatch only) so nothing breaks during migration.

Design docs: docs/proposals/ci_2.0_architecture.md (workflow layer) + docs/proposals/regression_sh_2.0.md (engine).

Why

The driver (simx/rtlsim/xrtsim/opaesim) is hard-coded into every line, so you can't run "simx only" without editing 401 lines — yet rtlsim (~168 runs, the Verilator long pole) dominates cost. With CI 2.0 the driver slice is just pytest -m "simx"; push runs simx, PR adds rtlsim, nightly runs everything.

What's here

Engine

ci/vxcatalog.py — catalog core (load/expand/filter/render, (driver,configs) build-key dedup)
ci/catalog/*.yaml — all 29 categories, 388 specs (22 extracted from the bash; 7 script/build categories via via: script)
ci/conftest.py / ci/test_vortex.py — pytest harness (one marker per value, ambient-XLEN filter, build-once sim_build fixture, needs-provisioning skip)
ci/catalog_query.py — planner (matrix / select / lint)
ci/extract_catalog.py — drafts catalogs from regression.sh.in
ci/run-tests — friendly wrapper → pytest marker expressions

Workflow

.github/workflows/ci-v2.yml — plan(catalog_query) → build(per xlen) → tests(pytest -m per cell, JUnit) → complete
.github/actions/setup-vortex — composite action (profile-scoped cache + deps)
configure — copies the catalog + pytest config into the build tree

Validation

Structural (no toolchain): catalog lint OK (388 specs / 29 categories); pytest collects 387 (cupbop is xlen-64-only); marker slicing correct (push 40 / PR 74 / schedule 119 cells). One smoke run executed 4 simx amo specs end-to-end green (both blackbox + make-run styles). Real per-category sim execution / parity-vs-legacy runs on CI.

Migration status

22 categories are extractor-drafts (faithful to the bash, worth a review); 7 script/build categories (unittest, synthesis, vector, dtm, sst, gem5, cupbop) currently delegate to legacy via via: script — native per-spec migration is Phase D. ci-v2.yml is manual-dispatch until validated on a runner.

🤖 Generated with Claude Code

Vortex tests become declarative YAML data run by pytest, replacing the imperative, driver-pinned bash in ci/regression.sh (1382 lines, 401 driver- pinned invocations). blackbox.sh stays the untouched executor. The driver slice the whole effort was chasing is now just `pytest ci -m "simx"`. Engine (ci/) — the conventional pytest layout, no config file: - testcase.py model + planner CLI (lint | matrix | select); no pytest dep - conftest.py hooks/fixtures: markers registered dynamically from the data, parametrize + ambient-XLEN filter, build-once sim_build fixture - test_runner.py the single test_case (shells out; needs-provisioning skip) - testcases/*.yaml all 29 categories, 388 cases (22 transcribed from the bash; 7 script/build categories via via:script -> legacy) Workflow (.github/) - workflows/ci.yml catalog-driven: plan(testcase.py matrix by event) -> build(per xlen) -> tests(pytest ci -m per cell, JUnit) -> complete - actions/setup-vortex composite action (profile-scoped cache + deps + pip) - workflows/apptainer-ci.yml separate minimal env-smoke: composite action + in-container `pytest ci -m "regression and simx"`, weekly offset (Wed) + container-path triggers configure: copies ci/testcases/ into the build tree (harness .py + conftest ride the existing ci/ copy). Design: docs/designs/continuous_integration.md. Markers register dynamically in conftest.py (--strict-markers catches -m typos); test_runner.py is auto-discovered by the test_ prefix; the run passes `ci` as the path — so no pyproject.toml/pytest.ini is needed. Validated (structural, no toolchain): lint OK (388 cases / 29 categories); pytest collects 387 (cupbop xlen64-only); marker slicing correct (push 40 / PR 74 / schedule 119 cells); all workflow YAML valid. Real sim execution / parity runs on CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The divide-free per-thread coordinate ripple (lane 0 = warp base, each later lane steps +1 along X with single wrap into Y/Z) was a single combinational chain feeding the cta_warp_ram write -- 37 logic levels that cannot close at 300 MHz once the launch grid is a real runtime value (it was only hidden in the core unit-test DUT, which constant-folds the grid). Pipeline the ripple TID_STEP lanes per cycle. CTA dispatch is infrequent and cta_warp_ram is read many cycles after a warp launches, so the added write latency is hidden. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the per-CTA/per-warp context tables (cta_ctx_ram, cta_warp_ram), the divide-free thread-coordinate expansion pipeline, the wid->cta_id map, and the CTA-CSR read-back out of VX_scheduler and into VX_cta_dispatch, so all CTA launch and context state has a single owner. VX_scheduler keeps only the launch handshake (warp activation + mscratch latch via cta_param) and wires the dispatcher read-back into sched_csr_if. cta_csrs is demoted to an internal signal; a narrow cta_param output replaces it on the module boundary. Pure structural change: same RAMs, same pipeline depth, same launch handshake; no functional/timing/IPC change. rtlsim vecadd/sgemm/sgemm_tcu_wg pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Drop needs:-based auto-skip from the pytest runner: a missing dependency is now a real (red) failure, not a silent skip, and every build warning escalated to an error stays a failure. Clean the sim build dir before each new CONFIGS so a stale Verilator obj_dir can't produce spurious lint errors. Update the CI 2.0 design doc to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 36956fc3f3e453b2543295e69685ad6fd8900d27)

Add VX_fanout_buffer (a combinational counterpart to VX_reset_relay) plus FANOUT_BUFFER/FANOUT_BUFFER_EX macros, and use them to give each FMA/div/sqrt IP its own preserved clock-enable copy across every backend (Quartus en, Vivado aclken, RTL enable). This keeps the high-fanout enable as local distributed routing instead of being merged onto a single global buffer, which on the U55C was stretched into a congested cross-die path at NT16. Also refactors the FMA is_d selector in VX_fpu_std to use VX_shift_register. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 4572fa5f452769c7c40777f500fb9f3aa79b6851)

Model per-FU dispatch-queue back-pressure with credits: a credit is spent when a uop issues into operand collection and returned when the FU accepts it, so warp suppression now counts in-flight ops still in operand collection (matching the RTL scoreboard) instead of only what already reached the queue. Size the per-FU dispatch queues by VX_CFG_DISPATCH_QUEUE_SIZE rather than a hardcoded 2, and wire that depth into the dispatcher's output channels (the buf_size arg was stored but unused; the dead member is removed). Also clarify comments only in cache.cpp (dirty_mask), opae_sim.cpp and xrt_sim.cpp (host-priority backoff) -- no behavior change in those. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 21359c6f55842beea757ffeda194a4cee9e2c20c)

The K-major (transposing) DXA load enumerated one GMEM read PER ELEMENT, re-reading the same cache line up to 8x (the model even counted the waste as gmem_dedup). Coalesce the read span to the cache line, matching the RTL addr_gen which reads per line: one cache-line read fans out to its scattered SMEM destinations. On the write side, gather the scattered K-major elements that land in the same LMEM block into one byte-masked block write per beat (the per-core LMEM port accepts a full block/cycle, banked), so the engine drains at SMEM bandwidth instead of one element per beat. This models achievable write bandwidth ahead of the current RTL smem_wr (1 elem/beat) -- a known SimX-ahead-of-RTL gap to be matched on the RTL side. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit ae97cef300dfd5845c258f5e3589c1983399039f)

…platforms Migrate the OpenCL stack from PoCL-as-libOpenCL (direct-linked) to PoCL built ICD-only, so the system ocl-icd loader discovers the Vortex platform via a vendor .icd and Vortex can run alongside other OpenCL platforms (resolves the ICD-mode request). Device drivers are linked statically into libpocl and the install tree is relocatable. - docs/building_toolchain.md: PoCL recipe now ENABLE_ICD=ON, POCL_ICD_ABSOLUTE_PATH=OFF, INSTALL_OPENCL_HEADERS=ON (keeps ENABLE_LOADABLE_DRIVERS=OFF). Drops the manual CL-header copy; documents the ICD layout, static driver, relocatable kernel-lib lookup, and OCL_ICD_VENDORS. - ci/toolchain_install.sh.in: after extracting the PoCL bundle, regenerate the vendor .icd to the relocated libpocl path. - tests/opencl/common.mk: link the system ocl-icd loader (-lOpenCL) and set OCL_ICD_VENDORS at run time; pin OCL_ICD_LIB_DIR ahead of any other vendor loader (e.g. CUDA). Validated end-to-end: vecadd run-simx PASSED through the loader against a static ICD install. - tests/hip/common.mk: remove the LD_PRELOAD=libOpenCL.so shim (no longer needed now that the loader sees PoCL via the .icd); discover Vortex via OCL_ICD_VENDORS. chipStar already links the system loader. Note: shipping this requires rebuilding/re-hosting the prebuilt PoCL bundle ICD-only; the local changes take effect once that bundle is in place. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit e98d908ef1053cae78cdb7c2e1f84ee60d23afa0)

…sudo Support the portable, cross-loader registration path for real deployments while keeping the test harness/CI sudo-free. - ci/register_icd.sh: optional helper (run by the user with sudo) that installs/removes /etc/OpenCL/vendors/pocl-vortex.icd pointing at the relocated libpocl. Standard /etc/OpenCL/vendors convention -> works with both ocl-icd and the Khronos loader, and lets any app discover Vortex alongside other platforms with no per-process env var. Not invoked by CI. - docs/building_toolchain.md: document the two paths -- per-user OCL_ICD_VENDORS (no sudo, ocl-icd-specific, used by the harness/CI) vs. system-wide sudo registration (portable, recommended for deployment). Notes that OCL_ICD_VENDORS is an ocl-icd extension, not OpenCL-spec, and replaces the system vendor scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit ac16321ced84e8a9ae3a8b2e07cf5efa54883ae9)

detect_osversion() only recognized Ubuntu and CentOS 7, so RHEL-family hosts (e.g. RHEL 8.10 on CRNCH Rogues-Gallery FPGA nodes) fell through to "unsupported" and configure aborted before generating config.mk. Map the RHEL family (rhel/redhat/rocky/almalinux) and CentOS Stream 8/9 to the centos/7 prebuilt bundle, whose glibc 2.17 binaries run on these newer glibc releases. The --osversion override remains available. Verified detect_osversion against synthetic os-release files for RHEL 8.10, Rocky 9.3, AlmaLinux 8.9 and CentOS Stream 8 (all -> centos/7), with Ubuntu/CentOS-7 detection and the unsupported fallback unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit ee6147ba4a388ba6accb73d13f0bbe3f1b9b37b5)

The OPAE flow targets discontinued Intel PAC cards (Arria 10 / Stratix 10), depends on Intel-supplied platform files (e.g. platform_if.vh from the OPAE PIM), and is no longer maintained or CI-tested, so its platform/memory config can be broken on current toolchains. Add a deprecation banner pointing users to the supported Xilinx Alveo / XRT flow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 5e2ee808a5b544e6e6a6c01c738f73fde745d63a)

The hardcoded per-core PERF/IPC examples in simulation.md were taken from an older microarchitecture and could not be reproduced by users (reported IPC was ~2x lower), and recent simx revisions print a single aggregate PERF line rather than the per-core breakdown shown. Add a note that the instruction/cycle/IPC figures are illustrative and depend on configuration, input size, and revision, and document the current single-line format, so they are not treated as fixed targets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit e88f96ba6266acf1b6082cd5f68aa599a5e2f49e)

tinebp force-pushed the ci_v2_clean branch 3 times, most recently from c2daeed to 2d04263 Compare June 20, 2026 00:16

tinebp force-pushed the ci_v2_clean branch from 2d04263 to 157b88e Compare June 20, 2026 01:03

tinebp and others added 13 commits June 19, 2026 18:33

Merge remote-tracking branch 'origin/master' into ci_v2_clean

2eaa4e6

code cleanup

718c643

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 36956fc3f3e453b2543295e69685ad6fd8900d27)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI 2.0 — catalog-driven, driver-sliceable test architecture (pytest)#375

CI 2.0 — catalog-driven, driver-sliceable test architecture (pytest)#375
tinebp wants to merge 14 commits into
masterfrom
ci_v2_clean

tinebp commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tinebp commented Jun 19, 2026

What

Why

What's here

Validation

Migration status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant