[Bug]: Driver pod fails to recreate after GPU hot-detach/re-attach: `/lib/firmware/nvidia` ENOENT and stale `firmware_class` search path

**Describe the bug**

When the `nvidia-driver-daemonset` pod is terminated on a node while a GPU has been hot-removed and then re-attached at the PCI level, the **replacement** driver pod fails to install the driver correctly. The `nvidia-installer` step inside the new pod aborts with:

```
ERROR: Failure creating directory '/lib/firmware/nvidia' : (No such file or directory)
WARNING: Unable to find installed file '/lib/firmware/nvidia/580.126.20/gsp_tu10x.bin' (No such file or directory).
WARNING: Unable to find installed file '/lib/firmware/nvidia/580.126.20/gsp_ga10x.bin' (No such file or directory).
```

…and the kernel's `firmware_class` search path is also left in a stale state from the previous pod:

```
Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware
WARNING: A search path is already configured in /sys/module/firmware_class/parameters/path
         Retaining the current configuration
```

The possible root cause is visible in the **outgoing** pod's logs — its `_shutdown` → `_unmount_rootfs` step fails:

```
Caught signal
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
umount: ???: umount failed: No such file or directory.
```

`_unmount_rootfs` in `ubuntu24.04/nvidia-driver` runs `umount -l -R ${RUN_DIR}/driver`. Because the GPU was hot-removed at the PCI level before pod termination, the kernel has orphaned device dentries under `/run/nvidia/driver/dev/...`, and `umount -R` aborts on the first one with ENOENT, leaving the rest of the rbind tree mounted on the host.

The new pod then inherits a partially-mounted, partially-stale `/run/nvidia/driver` tree on the host, which the kubelet binds back into the new pod as `/lib/firmware` (the daemonset has `nv-firmware` mounted from hostPath `/run/nvidia/driver/lib/firmware`). The bind ends up pointing at a "ghost" directory inside the leftover rbind, so `mkdir /lib/firmware/nvidia` inside the new pod returns ENOENT. The leftover firmware search path in `/sys/module/firmware_class/parameters/path` from the previous pod is the second symptom of the same incomplete teardown.

`set -e` is suspended inside `_shutdown`'s `if _unload_driver; then …` block, so the failed `umount` does not cause a non-zero exit and there is no retry, fallback, or host-side cleanup.

Notably this reproduces with **stock GPU Operator** (no third-party composable-resource integration in the driver-pod path, no script modifications). The hot-detach/re-attach was performed via an external mechanism and the daemonset termination/recreation was triggered using the operator's own deploy label.

**To Reproduce**

On a worker node with a GPU attached and stock GPU Operator running healthy:

1. Confirm baseline: `kubectl -n gpu-operator get pod -l app=nvidia-driver-daemonset -o wide` shows the driver pod `Running`, and `kubectl -n gpu-operator logs <pod>` ends with `Done, now waiting for signal`.
2. Hot-detach the GPU from the worker. In our environment this was done via an external composable-disaggregation orchestrator.
3. Force the driver daemonset pod to terminate by removing the deploy label on the node. This triggers exactly the same operator-driven SIGTERM that NFD-driven label removal would trigger, without waiting for NFD's poll interval:
   ```
   kubectl label node <worker> nvidia.com/gpu.deploy.driver=false --overwrite
   ```
4. Watch the pod terminate: `kubectl -n gpu-operator logs <pod> -f`. Final lines are:
   ```
   Caught signal
   Stopping NVIDIA persistence daemon...
   Unloading NVIDIA driver kernel modules...
   Unmounting NVIDIA driver rootfs...
   umount: ???: umount failed: No such file or directory.
   ```
   The pod terminates without completing cleanup.
5. Hot-re-attach the GPU (external orchestrator in our case).
6. A new `nvidia-driver-daemonset-*` pod is created. Tail its logs — `nvidia-installer` fails as quoted above and the firmware search path is reported as already configured.

On the host, before step 5, the leftover state is directly observable:

```
mount | grep /run/nvidia/driver                  # rbind tree still present despite pod 1 termination
cat /sys/module/firmware_class/parameters/path   # still set to /run/nvidia/driver/lib/firmware
ls /run/nvidia/nvidia-driver.state /run/nvidia/nvidia-driver.pid 2>/dev/null
```

A manual host-side cleanup before step 5 makes the new pod start cleanly, confirming the host-state leak is the cause:

```
awk '$5 ~ "^/run/nvidia/driver"{print $5}' /proc/1/mountinfo | sort -r \
    | xargs -r -n1 umount -l -f 2>/dev/null
rm -rf /run/nvidia/driver
rm -f /run/nvidia/nvidia-driver.state /run/nvidia/nvidia-driver.pid
: > /sys/module/firmware_class/parameters/path
```

**Expected behavior**

After a GPU hot-detach + driver-pod termination + hot-re-attach + driver-pod recreation cycle, the new driver pod should install and load the driver successfully, the same way it does on a fresh boot. Specifically:

- The terminating pod's `_shutdown` should leave the host's `/run/nvidia/driver` tree fully unmounted and removed, and should clear `/sys/module/firmware_class/parameters/path`, regardless of whether the underlying GPU is still present at the PCI level.
- The new pod's `mkdir /lib/firmware/nvidia` (inside `nvidia-installer`) must succeed.

**Environment (please provide the following information):**

- GPU Operator Version: v26.3.2 (driver image 580.126.20)
- OS: Ubuntu 24.04.4 LTS
- Kernel Version: 6.8.0-117-generic
- Container Runtime Version: containerd containerd.io v2.2.3
- Kubernetes Distro and Version: Kubernetes 1.36.1, CNI: flannel

Values used for gpu-operator deployment:

```
USER-SUPPLIED VALUES:
dcgmExporter:
  enabled: false
devicePlugin:
  enabled: false
driver:
  enabled: true
sandboxDevicePlugin:
  enabled: false
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_SOCKET
    value: /var/run/containerd/containerd.sock
  - name: CONTAINERD_CONFIG
    value: /etc/containerd/config.toml
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Driver pod fails to recreate after GPU hot-detach/re-attach: `/lib/firmware/nvidia` ENOENT and stale `firmware_class` search path #2559

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Driver pod fails to recreate after GPU hot-detach/re-attach: /lib/firmware/nvidia ENOENT and stale firmware_class search path #2559

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: Driver pod fails to recreate after GPU hot-detach/re-attach: `/lib/firmware/nvidia` ENOENT and stale `firmware_class` search path #2559