99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices#271
Open
elordahl wants to merge 1 commit into
Open
99-mellanox: fix "ifaces[id]: unbound variable" — align arrays, skip interface-less devices#271elordahl wants to merge 1 commit into
elordahl wants to merge 1 commit into
Conversation
Author
|
@flx42 to review |
Three independent sysfs globs (infiniband_verbs, infiniband,
infiniband_mad) built the parallel arrays assuming equal counts and
aligned ordering. When a PCI function exposed a verbs device but no
infiniband/ class entry, ifaces[] ended up shorter than devices[]. The
mount loop only range-checked against ${#devices[@]}, so it dereferenced
an unset ifaces[id] and, under set -euo pipefail, aborted the hook:
/etc/enroot/hooks.d/99-mellanox.sh: line 88: ifaces[id]: unbound variable
[ERROR] /etc/enroot/hooks.d/99-mellanox.sh exited with return code 1
Before the abort, the skew also silently mis-paired devices[] with the
wrong ifaces[] entry for every id past the first gap.
On SR-IOV/RoCE nodes this is the normal steady state, not a hardware
fault: when a pod starts, the SR-IOV + ovs-cni + rdma-cni chain moves an
assigned VF's RDMA device into the pod's network namespace. The VF's
infiniband_verbs char node still enumerates on the host while its
ib_device leaves the host /sys/class/infiniband and the per-function
infiniband/ directory, so devices[] > ifaces[] for as long as the pod
holds the VF. Observed on a Blackwell Spectrum-X node where a training
pod claimed all eight port-0 VFs and any concurrent
MELLANOX_VISIBLE_DEVICES=all launch hit the crash; originally seen
breaking NCCL alltoall_perf_mpi, all_gather_perf_mpi, all_reduce_perf_mpi,
and reduce_scatter_perf_mpi.
Fix: enumerate per PCI function anchored on infiniband_verbs and resolve
the iface and management nodes from the same <bdf> directory, so the
arrays are always index-aligned regardless of which sysfs sub-entries are
present. A device whose interface is absent in the current namespace is
skipped with a warning rather than treated as a fatal error: an
interfaceless verbs node cannot be mounted (its /sys/class/infiniband
entry does not exist here), and on shared SR-IOV nodes its absence is
expected. umad/issm entries are guarded with [ -n ] since their absence
is less critical.
Signed-off-by: Eric Lordahl <elordahl@nvidia.com>
916ae2d to
86f78d2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
conf/hooks/99-mellanox.shbuildsdrivers/devices,ifaces, andumads/issmsfrom three independent sysfs globs and assumes they are equal-length and index-aligned. When a Mellanox PCI function exposes aninfiniband_verbsentry but noinfiniband/class device — a BlueField DPU, an SF/SR-IOV representor, a down port, or (commonly on RoCE + Kubernetes nodes) a VF whose RDMA device has been moved into a pod's network namespace byrdma-cni—ifaces[]ends up shorter thandevices[]. The mount loop only range-checksidagainst${#devices[@]}, so it dereferences an unsetifaces[id]and, underset -euo pipefail, aborts:This blocks every container launch on the affected node (and, before the abort, silently mis-pairs
devices[]with the wrongifaces[]entry for indices past the first gap). Observed breaking NCCLall_reduce_perf_mpi,alltoall_perf_mpi, etc. On SR-IOV/RoCE Kubernetes nodes this is the normal steady state whenever a pod holds a VF.Fix
infiniband_verbs, and resolve the interface and management nodes from the same<bdf>directory — so the arrays stay index-aligned regardless of which sysfs sub-entries exist.common::log WARN … ; continue) rather than aborting the launch. Such a device is normally a tenant-owned VF (itsib_devicelives in another netns), a DPU, or an SF.umad/issmentries are likewise guarded.Open question (feedback welcome)
This makes the hook fail-open: a genuinely degraded NIC is skipped (and logged) rather than halting the launch. The hook can't distinguish a real fault from a benign tenant VF/DPU/SF (all present as verbs-without-interface), so enforcing "I require N HCAs" is left to the caller (e.g. NCCL via
NCCL_IB_HCA). A stricterphysfn-aware variant (skip VFs, but error on a PF with no interface) is one alternative.