Skip to content

99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices#271

Open
elordahl wants to merge 1 commit into
NVIDIA:mainfrom
elordahl:fix/mellanox-hook-array-skew
Open

99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices#271
elordahl wants to merge 1 commit into
NVIDIA:mainfrom
elordahl:fix/mellanox-hook-array-skew

Conversation

@elordahl

@elordahl elordahl commented Jun 8, 2026

Copy link
Copy Markdown

Problem

conf/hooks/99-mellanox.sh builds drivers/devices, ifaces, and umads/issms from three independent sysfs globs and assumes they are equal-length and index-aligned. When a Mellanox PCI function exposes an infiniband_verbs entry but no infiniband/ class device — a BlueField DPU, an SF/SR-IOV representor, a down port, or (commonly on RoCE + Kubernetes nodes) a VF whose RDMA device has been moved into a pod's network namespace by rdma-cniifaces[] ends up shorter than devices[]. The mount loop only range-checks id against ${#devices[@]}, so it dereferences an unset ifaces[id] and, under set -euo pipefail, aborts:

/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
[ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

This blocks every container launch on the affected node (and, before the abort, silently mis-pairs devices[] with the wrong ifaces[] entry for indices past the first gap). Observed breaking NCCL all_reduce_perf_mpi, alltoall_perf_mpi, etc. On SR-IOV/RoCE Kubernetes nodes this is the normal steady state whenever a pod holds a VF.

Fix

  • Enumerate per PCI function, anchored on infiniband_verbs, and resolve the interface and management nodes from the same <bdf> directory — so the arrays stay index-aligned regardless of which sysfs sub-entries exist.
  • A device with no InfiniBand interface in the current namespace is skipped with a warning (common::log WARN … ; continue) rather than aborting the launch. Such a device is normally a tenant-owned VF (its ib_device lives in another netns), a DPU, or an SF. umad/issm entries are likewise guarded.

Open question (feedback welcome)

This makes the hook fail-open: a genuinely degraded NIC is skipped (and logged) rather than halting the launch. The hook can't distinguish a real fault from a benign tenant VF/DPU/SF (all present as verbs-without-interface), so enforcing "I require N HCAs" is left to the caller (e.g. NCCL via NCCL_IB_HCA). A stricter physfn-aware variant (skip VFs, but error on a PF with no interface) is one alternative.

@elordahl

Copy link
Copy Markdown
Author

@flx42 to review

Three independent sysfs globs (infiniband_verbs, infiniband,
infiniband_mad) built the parallel arrays assuming equal counts and
aligned ordering. When a PCI function exposed a verbs device but no
infiniband/ class entry, ifaces[] ended up shorter than devices[]. The
mount loop only range-checked against ${#devices[@]}, so it dereferenced
an unset ifaces[id] and, under set -euo pipefail, aborted the hook:

  /etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
  [ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

Before the abort, the skew also silently mis-paired devices[] with the
wrong ifaces[] entry for every id past the first gap.

On SR-IOV/RoCE nodes this is the normal steady state, not a hardware
fault: when a pod starts, the SR-IOV + ovs-cni + rdma-cni chain moves an
assigned VF's RDMA device into the pod's network namespace. The VF's
infiniband_verbs char node still enumerates on the host while its
ib_device leaves the host /sys/class/infiniband and the per-function
infiniband/ directory, so devices[] > ifaces[] for as long as the pod
holds the VF. Observed on a Blackwell Spectrum-X node where a training
pod claimed all eight port-0 VFs and any concurrent
MELLANOX_VISIBLE_DEVICES=all launch hit the crash; originally seen
breaking NCCL alltoall_perf_mpi, all_gather_perf_mpi, all_reduce_perf_mpi,
and reduce_scatter_perf_mpi.

Fix: enumerate per PCI function anchored on infiniband_verbs and resolve
the iface and management nodes from the same <bdf> directory, so the
arrays are always index-aligned regardless of which sysfs sub-entries are
present. A device whose interface is absent in the current namespace is
skipped with a warning rather than treated as a fatal error: an
interfaceless verbs node cannot be mounted (its /sys/class/infiniband
entry does not exist here), and on shared SR-IOV nodes its absence is
expected. umad/issm entries are guarded with [ -n ] since their absence
is less critical.

Signed-off-by: Eric Lordahl <elordahl@nvidia.com>
@elordahl elordahl force-pushed the fix/mellanox-hook-array-skew branch from 916ae2d to 86f78d2 Compare June 12, 2026 15:20
@elordahl elordahl changed the title 99-mellanox: fix array skew and abort on degraded NIC 99-mellanox: fix verbs/iface array skew, skip interfaceless devices Jun 12, 2026
@elordahl elordahl changed the title 99-mellanox: fix verbs/iface array skew, skip interfaceless devices 99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant