Skip to content

CI 2.0 — catalog-driven, driver-sliceable test architecture (pytest)#375

Open
tinebp wants to merge 14 commits into
masterfrom
ci_v2_clean
Open

CI 2.0 — catalog-driven, driver-sliceable test architecture (pytest)#375
tinebp wants to merge 14 commits into
masterfrom
ci_v2_clean

Conversation

@tinebp

@tinebp tinebp commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

What

Introduces CI 2.0: Vortex tests become declarative YAML data run by pytest, replacing the imperative, driver-pinned bash in ci/regression.sh (1382 lines, 401 driver-pinned invocations). blackbox.sh stays the untouched executor. Lands alongside the legacy ci.yml (manual-dispatch only) so nothing breaks during migration.

Design docs: docs/proposals/ci_2.0_architecture.md (workflow layer) + docs/proposals/regression_sh_2.0.md (engine).

Why

The driver (simx/rtlsim/xrtsim/opaesim) is hard-coded into every line, so you can't run "simx only" without editing 401 lines — yet rtlsim (~168 runs, the Verilator long pole) dominates cost. With CI 2.0 the driver slice is just pytest -m "simx"; push runs simx, PR adds rtlsim, nightly runs everything.

What's here

Engine

  • ci/vxcatalog.py — catalog core (load/expand/filter/render, (driver,configs) build-key dedup)
  • ci/catalog/*.yamlall 29 categories, 388 specs (22 extracted from the bash; 7 script/build categories via via: script)
  • ci/conftest.py / ci/test_vortex.py — pytest harness (one marker per value, ambient-XLEN filter, build-once sim_build fixture, needs-provisioning skip)
  • ci/catalog_query.py — planner (matrix / select / lint)
  • ci/extract_catalog.py — drafts catalogs from regression.sh.in
  • ci/run-tests — friendly wrapper → pytest marker expressions

Workflow

  • .github/workflows/ci-v2.ymlplan(catalog_query) → build(per xlen) → tests(pytest -m per cell, JUnit) → complete
  • .github/actions/setup-vortex — composite action (profile-scoped cache + deps)
  • configure — copies the catalog + pytest config into the build tree

Validation

Structural (no toolchain): catalog lint OK (388 specs / 29 categories); pytest collects 387 (cupbop is xlen-64-only); marker slicing correct (push 40 / PR 74 / schedule 119 cells). One smoke run executed 4 simx amo specs end-to-end green (both blackbox + make-run styles). Real per-category sim execution / parity-vs-legacy runs on CI.

Migration status

22 categories are extractor-drafts (faithful to the bash, worth a review); 7 script/build categories (unittest, synthesis, vector, dtm, sst, gem5, cupbop) currently delegate to legacy via via: script — native per-spec migration is Phase D. ci-v2.yml is manual-dispatch until validated on a runner.

🤖 Generated with Claude Code

@tinebp tinebp force-pushed the ci_v2_clean branch 3 times, most recently from c2daeed to 2d04263 Compare June 20, 2026 00:16
Vortex tests become declarative YAML data run by pytest, replacing the
imperative, driver-pinned bash in ci/regression.sh (1382 lines, 401 driver-
pinned invocations). blackbox.sh stays the untouched executor. The driver
slice the whole effort was chasing is now just `pytest ci -m "simx"`.

Engine (ci/) — the conventional pytest layout, no config file:
- testcase.py     model + planner CLI (lint | matrix | select); no pytest dep
- conftest.py     hooks/fixtures: markers registered dynamically from the data,
                  parametrize + ambient-XLEN filter, build-once sim_build fixture
- test_runner.py  the single test_case (shells out; needs-provisioning skip)
- testcases/*.yaml  all 29 categories, 388 cases (22 transcribed from the bash;
                  7 script/build categories via via:script -> legacy)

Workflow (.github/)
- workflows/ci.yml         catalog-driven: plan(testcase.py matrix by event)
                           -> build(per xlen) -> tests(pytest ci -m per cell,
                           JUnit) -> complete
- actions/setup-vortex     composite action (profile-scoped cache + deps + pip)
- workflows/apptainer-ci.yml  separate minimal env-smoke: composite action +
                           in-container `pytest ci -m "regression and simx"`,
                           weekly offset (Wed) + container-path triggers

configure: copies ci/testcases/ into the build tree (harness .py + conftest
ride the existing ci/ copy). Design: docs/designs/continuous_integration.md.

Markers register dynamically in conftest.py (--strict-markers catches -m typos);
test_runner.py is auto-discovered by the test_ prefix; the run passes `ci` as
the path — so no pyproject.toml/pytest.ini is needed.

Validated (structural, no toolchain): lint OK (388 cases / 29 categories);
pytest collects 387 (cupbop xlen64-only); marker slicing correct (push 40 /
PR 74 / schedule 119 cells); all workflow YAML valid. Real sim execution /
parity runs on CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tinebp and others added 13 commits June 19, 2026 18:33
The divide-free per-thread coordinate ripple (lane 0 = warp base, each later lane steps +1 along X with single wrap into Y/Z) was a single combinational chain feeding the cta_warp_ram write -- 37 logic levels that cannot close at 300 MHz once the launch grid is a real runtime value (it was only hidden in the core unit-test DUT, which constant-folds the grid). Pipeline the ripple TID_STEP lanes per cycle. CTA dispatch is infrequent and cta_warp_ram is read many cycles after a warp launches, so the added write latency is hidden.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the per-CTA/per-warp context tables (cta_ctx_ram, cta_warp_ram), the
divide-free thread-coordinate expansion pipeline, the wid->cta_id map, and
the CTA-CSR read-back out of VX_scheduler and into VX_cta_dispatch, so all
CTA launch and context state has a single owner. VX_scheduler keeps only the
launch handshake (warp activation + mscratch latch via cta_param) and wires
the dispatcher read-back into sched_csr_if. cta_csrs is demoted to an internal
signal; a narrow cta_param output replaces it on the module boundary.

Pure structural change: same RAMs, same pipeline depth, same launch handshake;
no functional/timing/IPC change. rtlsim vecadd/sgemm/sgemm_tcu_wg pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop needs:-based auto-skip from the pytest runner: a missing dependency is now
a real (red) failure, not a silent skip, and every build warning escalated to an
error stays a failure. Clean the sim build dir before each new CONFIGS so a stale
Verilator obj_dir can't produce spurious lint errors. Update the CI 2.0 design
doc to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 36956fc3f3e453b2543295e69685ad6fd8900d27)
Add VX_fanout_buffer (a combinational counterpart to VX_reset_relay) plus FANOUT_BUFFER/FANOUT_BUFFER_EX macros, and use them to give each FMA/div/sqrt IP its own preserved clock-enable copy across every backend (Quartus en, Vivado aclken, RTL enable). This keeps the high-fanout enable as local distributed routing instead of being merged onto a single global buffer, which on the U55C was stretched into a congested cross-die path at NT16. Also refactors the FMA is_d selector in VX_fpu_std to use VX_shift_register.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 4572fa5f452769c7c40777f500fb9f3aa79b6851)
Model per-FU dispatch-queue back-pressure with credits: a credit is spent
when a uop issues into operand collection and returned when the FU accepts
it, so warp suppression now counts in-flight ops still in operand
collection (matching the RTL scoreboard) instead of only what already
reached the queue. Size the per-FU dispatch queues by
VX_CFG_DISPATCH_QUEUE_SIZE rather than a hardcoded 2, and wire that depth
into the dispatcher's output channels (the buf_size arg was stored but
unused; the dead member is removed).

Also clarify comments only in cache.cpp (dirty_mask), opae_sim.cpp and
xrt_sim.cpp (host-priority backoff) -- no behavior change in those.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 21359c6f55842beea757ffeda194a4cee9e2c20c)
The K-major (transposing) DXA load enumerated one GMEM read PER ELEMENT,
re-reading the same cache line up to 8x (the model even counted the waste
as gmem_dedup). Coalesce the read span to the cache line, matching the RTL
addr_gen which reads per line: one cache-line read fans out to its
scattered SMEM destinations.

On the write side, gather the scattered K-major elements that land in the
same LMEM block into one byte-masked block write per beat (the per-core
LMEM port accepts a full block/cycle, banked), so the engine drains at
SMEM bandwidth instead of one element per beat. This models achievable
write bandwidth ahead of the current RTL smem_wr (1 elem/beat) -- a known
SimX-ahead-of-RTL gap to be matched on the RTL side.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit ae97cef300dfd5845c258f5e3589c1983399039f)
…platforms

Migrate the OpenCL stack from PoCL-as-libOpenCL (direct-linked) to PoCL built
ICD-only, so the system ocl-icd loader discovers the Vortex platform via a
vendor .icd and Vortex can run alongside other OpenCL platforms (resolves the
ICD-mode request). Device drivers are linked statically into libpocl and the
install tree is relocatable.

- docs/building_toolchain.md: PoCL recipe now ENABLE_ICD=ON,
  POCL_ICD_ABSOLUTE_PATH=OFF, INSTALL_OPENCL_HEADERS=ON (keeps
  ENABLE_LOADABLE_DRIVERS=OFF). Drops the manual CL-header copy; documents the
  ICD layout, static driver, relocatable kernel-lib lookup, and OCL_ICD_VENDORS.
- ci/toolchain_install.sh.in: after extracting the PoCL bundle, regenerate the
  vendor .icd to the relocated libpocl path.
- tests/opencl/common.mk: link the system ocl-icd loader (-lOpenCL) and set
  OCL_ICD_VENDORS at run time; pin OCL_ICD_LIB_DIR ahead of any other vendor
  loader (e.g. CUDA). Validated end-to-end: vecadd run-simx PASSED through the
  loader against a static ICD install.
- tests/hip/common.mk: remove the LD_PRELOAD=libOpenCL.so shim (no longer
  needed now that the loader sees PoCL via the .icd); discover Vortex via
  OCL_ICD_VENDORS. chipStar already links the system loader.

Note: shipping this requires rebuilding/re-hosting the prebuilt PoCL bundle
ICD-only; the local changes take effect once that bundle is in place.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit e98d908ef1053cae78cdb7c2e1f84ee60d23afa0)
…sudo

Support the portable, cross-loader registration path for real deployments
while keeping the test harness/CI sudo-free.

- ci/register_icd.sh: optional helper (run by the user with sudo) that
  installs/removes /etc/OpenCL/vendors/pocl-vortex.icd pointing at the
  relocated libpocl. Standard /etc/OpenCL/vendors convention -> works with
  both ocl-icd and the Khronos loader, and lets any app discover Vortex
  alongside other platforms with no per-process env var. Not invoked by CI.
- docs/building_toolchain.md: document the two paths -- per-user OCL_ICD_VENDORS
  (no sudo, ocl-icd-specific, used by the harness/CI) vs. system-wide sudo
  registration (portable, recommended for deployment). Notes that
  OCL_ICD_VENDORS is an ocl-icd extension, not OpenCL-spec, and replaces the
  system vendor scan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit ac16321ced84e8a9ae3a8b2e07cf5efa54883ae9)
detect_osversion() only recognized Ubuntu and CentOS 7, so RHEL-family
hosts (e.g. RHEL 8.10 on CRNCH Rogues-Gallery FPGA nodes) fell through to
"unsupported" and configure aborted before generating config.mk. Map the
RHEL family (rhel/redhat/rocky/almalinux) and CentOS Stream 8/9 to the
centos/7 prebuilt bundle, whose glibc 2.17 binaries run on these newer
glibc releases. The --osversion override remains available.

Verified detect_osversion against synthetic os-release files for RHEL 8.10,
Rocky 9.3, AlmaLinux 8.9 and CentOS Stream 8 (all -> centos/7), with
Ubuntu/CentOS-7 detection and the unsupported fallback unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit ee6147ba4a388ba6accb73d13f0bbe3f1b9b37b5)
The OPAE flow targets discontinued Intel PAC cards (Arria 10 / Stratix 10),
depends on Intel-supplied platform files (e.g. platform_if.vh from the OPAE
PIM), and is no longer maintained or CI-tested, so its platform/memory config
can be broken on current toolchains. Add a deprecation banner pointing users to
the supported Xilinx Alveo / XRT flow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit 5e2ee808a5b544e6e6a6c01c738f73fde745d63a)
The hardcoded per-core PERF/IPC examples in simulation.md were taken from an
older microarchitecture and could not be reproduced by users (reported IPC was
~2x lower), and recent simx revisions print a single aggregate PERF line rather
than the per-core breakdown shown. Add a note that the instruction/cycle/IPC
figures are illustrative and depend on configuration, input size, and revision,
and document the current single-line format, so they are not treated as fixed
targets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
(cherry picked from commit e88f96ba6266acf1b6082cd5f68aa599a5e2f49e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant