diff --git a/.agents/skills/test-release-canary/SKILL.md b/.agents/skills/test-release-canary/SKILL.md new file mode 100644 index 000000000..d0826536a --- /dev/null +++ b/.agents/skills/test-release-canary/SKILL.md @@ -0,0 +1,119 @@ +--- +name: test-release-canary +description: Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary. +--- + +# Test Release Canary + +The Release Canary (`.github/workflows/release-canary.yml`) smoke-tests the artifacts a `Release Dev` run just published. It is the last automated checkpoint before tagging a public release: if the canary is red, the published `dev` artifacts do not install on a stock environment. + +## What the canary verifies + +| Job | Runner | Verifies | +|---|---|---| +| `macos` | `macos-latest-xlarge` | `install.sh` resolves the Homebrew formula, brew installs the cask, and `openshell status` reaches the brew-services–backed local gateway with the VM driver. | +| `ubuntu` | `ubuntu-latest` | `install.sh` installs the Debian package, the post-install systemd user service starts, and `openshell status` reaches the local gateway with the Docker driver. | +| `fedora` | `fedora:latest` container | `install.sh` installs the RPM packages, the local gateway starts under Podman, and `openshell status` succeeds. | +| `kubernetes` | `ubuntu-latest` + kind | `helm install oci://ghcr.io/nvidia/openshell/helm-chart --version 0.0.0-dev` succeeds in a kind cluster, the gateway pod becomes Ready, port-forward exposes 8080, and the released CLI registers the in-cluster gateway and runs `openshell status` against it. | + +`install.sh` defaults to the *latest tagged* release — the canary is therefore checking that the most recent public release still installs, not the just-published `dev` build. The `kubernetes` job is the exception: it pins to `0.0.0-dev` chart + `:dev` images. + +## Trigger paths + +The workflow has two triggers: + +```yaml +on: + workflow_dispatch: + workflow_run: + workflows: ["Release Dev"] + types: [completed] +``` + +- **Automatic.** Every successful `Release Dev` run (on `main` or a manual dispatch of Release Dev) fires the canary. Each job gates on `github.event.workflow_run.conclusion == 'success'` so a failed Release Dev does not run the canary. +- **Manual.** `workflow_dispatch` lets you run the canary on demand against any branch's workflow definition. + +When dispatched manually, `github.event.workflow_run.head_sha` is empty and the workflow falls back to `github.sha` (the branch tip) for the `install.sh` URL. + +## Manual dispatch + +Run the canary as-is on the current branch: + +```shell +gh workflow run release-canary.yml --ref "$(git branch --show-current)" +``` + +Watch the run that starts: + +```shell +sleep 5 # let GitHub register the dispatch +gh run list --workflow release-canary.yml --limit 1 +gh run watch "$(gh run list --workflow release-canary.yml --limit 1 --json databaseId --jq '.[0].databaseId')" +``` + +View only failed jobs after completion: + +```shell +gh run view --log-failed +``` + +## Iterating on the canary itself + +When you change `release-canary.yml` on a branch, a manual dispatch on that branch tests *your branch's workflow logic* against *main's published artifacts* (`0.0.0-dev` chart, `:dev` images, latest tagged install.sh assets). This is what you want for iterating on the canary — you're validating that the canary still works against known-good artifacts. + +Note `install.sh` is pulled from `raw.githubusercontent.com/NVIDIA/OpenShell/${head_sha}/install.sh`, so changes to `install.sh` on your branch *are* exercised even though the binaries it downloads are from the latest public tag. + +## Testing artifacts from a specific SHA + +`Release Dev` publishes two chart versions for every dev build (see `.github/actions/release-helm-oci/action.yml:89-102`): + +- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` — floating, overwritten on every main push. +- `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev.` — immutable, `appVersion` set to the same SHA so it pulls `ghcr.io/nvidia/openshell/gateway:` and `:supervisor:`. + +To smoke-test the chart for a specific dev build, dispatch `Release Dev` on the branch first, then run the kind canary steps locally pointed at the SHA-pinned chart (see "Local kind reproduction" below). The release-canary workflow itself does not currently expose `chart_version` / `image_tag` inputs. + +## Local kind reproduction + +The `kubernetes` job can be reproduced on any machine with Docker and `mise install`-provided `kubectl` + `helm`: + +```shell +kind create cluster --name release-canary-local + +helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart \ + --version 0.0.0-dev \ + --namespace openshell --create-namespace \ + --set server.disableTls=true \ + --set pkiInitJob.enabled=false \ + --wait --timeout 5m + +kubectl wait --namespace openshell \ + --for=condition=Ready pod \ + --selector="app.kubernetes.io/name=openshell,app.kubernetes.io/instance=openshell" \ + --timeout=300s + +kubectl port-forward --namespace openshell svc/openshell 8080:8080 & +openshell gateway add http://127.0.0.1:8080 --local --name kind +openshell status +``` + +Swap `0.0.0-dev` for `0.0.0-dev.` to pin to a specific dev build. Tear down with `kind delete cluster --name release-canary-local`. + +Loopback registration auto-derives the gateway name to `openshell` if `--name` is omitted, which collides with the `install.sh`-installed local gateway — always pass `--name kind` (or another distinct name) when registering in addition to a local install. + +## Diagnosing failures + +| Symptom | Likely cause | Where to look | +|---|---|---| +| `macos`/`ubuntu`/`fedora` job fails on `install.sh` | Latest tagged release missing an asset, checksum mismatch, or `install.sh` regression on this branch. | Job log around the `curl … install.sh \| sh` step. | +| `macos`/`ubuntu`/`fedora` job fails on `openshell status` | Local gateway service did not start (systemd/brew/podman). Often a driver issue. | Service logs in the job log; `OPENSHELL_DRIVERS` env in the "Ensure …" step. | +| `kubernetes` job fails on `helm install --wait` | Chart did not deploy in 5 min — usually image pull failure or readiness probe failing. | "Diagnostics on failure" step dumps `helm status`, manifest, pod describe, pod logs. | +| `kubernetes` job fails on `kubectl wait` | Gateway pod stuck `CrashLoopBackOff` or `ImagePullBackOff`. | Diagnostics dump; check `:dev` image existence at `ghcr.io/nvidia/openshell/gateway`. | +| `kubernetes` job fails on `openshell gateway add` or `status` | Port-forward not reachable, or CLI/gateway proto mismatch. | `port-forward.log` and `openshell gateway list` in the diagnostics dump. | + +The `kubernetes` job's diagnostics step (only runs `if: failure()`) emits, in order: helm status, rendered manifest, `kubectl get all`, pod descriptions, pod logs (200 lines per container), port-forward log, gateway list, CLI version. Read it top-to-bottom — most failures fall out by the manifest or pod logs. + +## Related + +- `helm-dev-environment` skill — local k3d-based dev environment (more featureful than the canary's kind cluster, but uses Skaffold-built local images, not published artifacts). +- `watch-github-actions` skill — generic `gh run` workflow monitoring. +- `debug-openshell-cluster` skill — runtime gateway/sandbox diagnostics that pair with the kind job's diagnostics dump. diff --git a/CI.md b/CI.md index d1b4fd176..aad9ee835 100644 --- a/CI.md +++ b/CI.md @@ -117,6 +117,16 @@ The bot's full administrator documentation is internal to NVIDIA. The only comma | `.github/workflows/e2e-gate-check.yml` | Reusable gate logic shared by E2E and GPU E2E. | | `.github/workflows/e2e-label-help.yml` | When a `test:e2e*` label is applied, posts a PR comment telling the maintainer the next manual step (re-run an existing workflow run, or `/ok to test ` to refresh the mirror). | +## Release workflows + +These workflows run after merge to publish dev/tagged artifacts and verify them. They are not PR-gated. + +| File | Role | +|---|---| +| `.github/workflows/release-dev.yml` | Publishes the rolling `dev` build on every push to `main`. Builds gateway/supervisor images and binaries, packages, wheels, and pushes the Helm chart as `oci://ghcr.io/nvidia/openshell/helm-chart:0.0.0-dev` (plus an immutable `0.0.0-dev.` pin). Also dispatchable manually. | +| `.github/workflows/release-tag.yml` | Publishes a tagged public release. | +| `.github/workflows/release-canary.yml` | Smoke-tests published artifacts on `macos`, `ubuntu`, `fedora`, and `kubernetes` (kind + Helm) runners. Triggers automatically when `Release Dev` succeeds, and via `workflow_dispatch` on any branch (`gh workflow run release-canary.yml --ref `). The `kubernetes` job pins to `0.0.0-dev` artifacts; the other jobs install the latest tagged release via `install.sh`. See the `test-release-canary` skill for the manual-dispatch playbook and local kind reproduction. | + ## Required status contexts Require these statuses in the branch ruleset for push-based CI: diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index ef07c4495..00c1a6713 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -75,6 +75,7 @@ Skills live in `.agents/skills/`. Your agent's harness can discover and load the | Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions | | Reviewing | `review-security-issue` | Assess security issues for severity and remediation | | Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs | +| Reviewing | `test-release-canary` | Dispatch and iterate on the Release Canary workflow that smoke-tests published artifacts | | Triage | `triage-issue` | Assess, classify, and route community-filed issues | | Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs | | Platform | `tui-development` | Development guide for the ratatui-based terminal UI |