Describe the bug
When the nvidia-driver-daemonset pod is terminated on a node while a GPU has been hot-removed and then re-attached at the PCI level, the replacement driver pod fails to install the driver correctly. The nvidia-installer step inside the new pod aborts with:
ERROR: Failure creating directory '/lib/firmware/nvidia' : (No such file or directory)
WARNING: Unable to find installed file '/lib/firmware/nvidia/580.126.20/gsp_tu10x.bin' (No such file or directory).
WARNING: Unable to find installed file '/lib/firmware/nvidia/580.126.20/gsp_ga10x.bin' (No such file or directory).
…and the kernel's firmware_class search path is also left in a stale state from the previous pod:
Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware
WARNING: A search path is already configured in /sys/module/firmware_class/parameters/path
Retaining the current configuration
The possible root cause is visible in the outgoing pod's logs — its _shutdown → _unmount_rootfs step fails:
Caught signal
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
umount: ???: umount failed: No such file or directory.
_unmount_rootfs in ubuntu24.04/nvidia-driver runs umount -l -R ${RUN_DIR}/driver. Because the GPU was hot-removed at the PCI level before pod termination, the kernel has orphaned device dentries under /run/nvidia/driver/dev/..., and umount -R aborts on the first one with ENOENT, leaving the rest of the rbind tree mounted on the host.
The new pod then inherits a partially-mounted, partially-stale /run/nvidia/driver tree on the host, which the kubelet binds back into the new pod as /lib/firmware (the daemonset has nv-firmware mounted from hostPath /run/nvidia/driver/lib/firmware). The bind ends up pointing at a "ghost" directory inside the leftover rbind, so mkdir /lib/firmware/nvidia inside the new pod returns ENOENT. The leftover firmware search path in /sys/module/firmware_class/parameters/path from the previous pod is the second symptom of the same incomplete teardown.
set -e is suspended inside _shutdown's if _unload_driver; then … block, so the failed umount does not cause a non-zero exit and there is no retry, fallback, or host-side cleanup.
Notably this reproduces with stock GPU Operator (no third-party composable-resource integration in the driver-pod path, no script modifications). The hot-detach/re-attach was performed via an external mechanism and the daemonset termination/recreation was triggered using the operator's own deploy label.
To Reproduce
On a worker node with a GPU attached and stock GPU Operator running healthy:
- Confirm baseline:
kubectl -n gpu-operator get pod -l app=nvidia-driver-daemonset -o wide shows the driver pod Running, and kubectl -n gpu-operator logs <pod> ends with Done, now waiting for signal.
- Hot-detach the GPU from the worker. In our environment this was done via an external composable-disaggregation orchestrator.
- Force the driver daemonset pod to terminate by removing the deploy label on the node. This triggers exactly the same operator-driven SIGTERM that NFD-driven label removal would trigger, without waiting for NFD's poll interval:
kubectl label node <worker> nvidia.com/gpu.deploy.driver=false --overwrite
- Watch the pod terminate:
kubectl -n gpu-operator logs <pod> -f. Final lines are:
Caught signal
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
umount: ???: umount failed: No such file or directory.
The pod terminates without completing cleanup.
- Hot-re-attach the GPU (external orchestrator in our case).
- A new
nvidia-driver-daemonset-* pod is created. Tail its logs — nvidia-installer fails as quoted above and the firmware search path is reported as already configured.
On the host, before step 5, the leftover state is directly observable:
mount | grep /run/nvidia/driver # rbind tree still present despite pod 1 termination
cat /sys/module/firmware_class/parameters/path # still set to /run/nvidia/driver/lib/firmware
ls /run/nvidia/nvidia-driver.state /run/nvidia/nvidia-driver.pid 2>/dev/null
A manual host-side cleanup before step 5 makes the new pod start cleanly, confirming the host-state leak is the cause:
awk '$5 ~ "^/run/nvidia/driver"{print $5}' /proc/1/mountinfo | sort -r \
| xargs -r -n1 umount -l -f 2>/dev/null
rm -rf /run/nvidia/driver
rm -f /run/nvidia/nvidia-driver.state /run/nvidia/nvidia-driver.pid
: > /sys/module/firmware_class/parameters/path
Expected behavior
After a GPU hot-detach + driver-pod termination + hot-re-attach + driver-pod recreation cycle, the new driver pod should install and load the driver successfully, the same way it does on a fresh boot. Specifically:
- The terminating pod's
_shutdown should leave the host's /run/nvidia/driver tree fully unmounted and removed, and should clear /sys/module/firmware_class/parameters/path, regardless of whether the underlying GPU is still present at the PCI level.
- The new pod's
mkdir /lib/firmware/nvidia (inside nvidia-installer) must succeed.
Environment (please provide the following information):
- GPU Operator Version: v26.3.2 (driver image 580.126.20)
- OS: Ubuntu 24.04.4 LTS
- Kernel Version: 6.8.0-117-generic
- Container Runtime Version: containerd containerd.io v2.2.3
- Kubernetes Distro and Version: Kubernetes 1.36.1, CNI: flannel
Values used for gpu-operator deployment:
USER-SUPPLIED VALUES:
dcgmExporter:
enabled: false
devicePlugin:
enabled: false
driver:
enabled: true
sandboxDevicePlugin:
enabled: false
toolkit:
enabled: true
env:
- name: CONTAINERD_SOCKET
value: /var/run/containerd/containerd.sock
- name: CONTAINERD_CONFIG
value: /etc/containerd/config.toml
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
Describe the bug
When the
nvidia-driver-daemonsetpod is terminated on a node while a GPU has been hot-removed and then re-attached at the PCI level, the replacement driver pod fails to install the driver correctly. Thenvidia-installerstep inside the new pod aborts with:…and the kernel's
firmware_classsearch path is also left in a stale state from the previous pod:The possible root cause is visible in the outgoing pod's logs — its
_shutdown→_unmount_rootfsstep fails:_unmount_rootfsinubuntu24.04/nvidia-driverrunsumount -l -R ${RUN_DIR}/driver. Because the GPU was hot-removed at the PCI level before pod termination, the kernel has orphaned device dentries under/run/nvidia/driver/dev/..., andumount -Raborts on the first one with ENOENT, leaving the rest of the rbind tree mounted on the host.The new pod then inherits a partially-mounted, partially-stale
/run/nvidia/drivertree on the host, which the kubelet binds back into the new pod as/lib/firmware(the daemonset hasnv-firmwaremounted from hostPath/run/nvidia/driver/lib/firmware). The bind ends up pointing at a "ghost" directory inside the leftover rbind, somkdir /lib/firmware/nvidiainside the new pod returns ENOENT. The leftover firmware search path in/sys/module/firmware_class/parameters/pathfrom the previous pod is the second symptom of the same incomplete teardown.set -eis suspended inside_shutdown'sif _unload_driver; then …block, so the failedumountdoes not cause a non-zero exit and there is no retry, fallback, or host-side cleanup.Notably this reproduces with stock GPU Operator (no third-party composable-resource integration in the driver-pod path, no script modifications). The hot-detach/re-attach was performed via an external mechanism and the daemonset termination/recreation was triggered using the operator's own deploy label.
To Reproduce
On a worker node with a GPU attached and stock GPU Operator running healthy:
kubectl -n gpu-operator get pod -l app=nvidia-driver-daemonset -o wideshows the driver podRunning, andkubectl -n gpu-operator logs <pod>ends withDone, now waiting for signal.kubectl -n gpu-operator logs <pod> -f. Final lines are:nvidia-driver-daemonset-*pod is created. Tail its logs —nvidia-installerfails as quoted above and the firmware search path is reported as already configured.On the host, before step 5, the leftover state is directly observable:
A manual host-side cleanup before step 5 makes the new pod start cleanly, confirming the host-state leak is the cause:
Expected behavior
After a GPU hot-detach + driver-pod termination + hot-re-attach + driver-pod recreation cycle, the new driver pod should install and load the driver successfully, the same way it does on a fresh boot. Specifically:
_shutdownshould leave the host's/run/nvidia/drivertree fully unmounted and removed, and should clear/sys/module/firmware_class/parameters/path, regardless of whether the underlying GPU is still present at the PCI level.mkdir /lib/firmware/nvidia(insidenvidia-installer) must succeed.Environment (please provide the following information):
Values used for gpu-operator deployment: