Skip to content

[Bug]: Driver pod fails to recreate after GPU hot-detach/re-attach: /lib/firmware/nvidia ENOENT and stale firmware_class search path #2559

@wilkowskia

Description

@wilkowskia

Describe the bug

When the nvidia-driver-daemonset pod is terminated on a node while a GPU has been hot-removed and then re-attached at the PCI level, the replacement driver pod fails to install the driver correctly. The nvidia-installer step inside the new pod aborts with:

ERROR: Failure creating directory '/lib/firmware/nvidia' : (No such file or directory)
WARNING: Unable to find installed file '/lib/firmware/nvidia/580.126.20/gsp_tu10x.bin' (No such file or directory).
WARNING: Unable to find installed file '/lib/firmware/nvidia/580.126.20/gsp_ga10x.bin' (No such file or directory).

…and the kernel's firmware_class search path is also left in a stale state from the previous pod:

Configuring the following firmware search path in '/sys/module/firmware_class/parameters/path': /run/nvidia/driver/lib/firmware
WARNING: A search path is already configured in /sys/module/firmware_class/parameters/path
         Retaining the current configuration

The possible root cause is visible in the outgoing pod's logs — its _shutdown_unmount_rootfs step fails:

Caught signal
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
umount: ???: umount failed: No such file or directory.

_unmount_rootfs in ubuntu24.04/nvidia-driver runs umount -l -R ${RUN_DIR}/driver. Because the GPU was hot-removed at the PCI level before pod termination, the kernel has orphaned device dentries under /run/nvidia/driver/dev/..., and umount -R aborts on the first one with ENOENT, leaving the rest of the rbind tree mounted on the host.

The new pod then inherits a partially-mounted, partially-stale /run/nvidia/driver tree on the host, which the kubelet binds back into the new pod as /lib/firmware (the daemonset has nv-firmware mounted from hostPath /run/nvidia/driver/lib/firmware). The bind ends up pointing at a "ghost" directory inside the leftover rbind, so mkdir /lib/firmware/nvidia inside the new pod returns ENOENT. The leftover firmware search path in /sys/module/firmware_class/parameters/path from the previous pod is the second symptom of the same incomplete teardown.

set -e is suspended inside _shutdown's if _unload_driver; then … block, so the failed umount does not cause a non-zero exit and there is no retry, fallback, or host-side cleanup.

Notably this reproduces with stock GPU Operator (no third-party composable-resource integration in the driver-pod path, no script modifications). The hot-detach/re-attach was performed via an external mechanism and the daemonset termination/recreation was triggered using the operator's own deploy label.

To Reproduce

On a worker node with a GPU attached and stock GPU Operator running healthy:

  1. Confirm baseline: kubectl -n gpu-operator get pod -l app=nvidia-driver-daemonset -o wide shows the driver pod Running, and kubectl -n gpu-operator logs <pod> ends with Done, now waiting for signal.
  2. Hot-detach the GPU from the worker. In our environment this was done via an external composable-disaggregation orchestrator.
  3. Force the driver daemonset pod to terminate by removing the deploy label on the node. This triggers exactly the same operator-driven SIGTERM that NFD-driven label removal would trigger, without waiting for NFD's poll interval:
    kubectl label node <worker> nvidia.com/gpu.deploy.driver=false --overwrite
    
  4. Watch the pod terminate: kubectl -n gpu-operator logs <pod> -f. Final lines are:
    Caught signal
    Stopping NVIDIA persistence daemon...
    Unloading NVIDIA driver kernel modules...
    Unmounting NVIDIA driver rootfs...
    umount: ???: umount failed: No such file or directory.
    
    The pod terminates without completing cleanup.
  5. Hot-re-attach the GPU (external orchestrator in our case).
  6. A new nvidia-driver-daemonset-* pod is created. Tail its logs — nvidia-installer fails as quoted above and the firmware search path is reported as already configured.

On the host, before step 5, the leftover state is directly observable:

mount | grep /run/nvidia/driver                  # rbind tree still present despite pod 1 termination
cat /sys/module/firmware_class/parameters/path   # still set to /run/nvidia/driver/lib/firmware
ls /run/nvidia/nvidia-driver.state /run/nvidia/nvidia-driver.pid 2>/dev/null

A manual host-side cleanup before step 5 makes the new pod start cleanly, confirming the host-state leak is the cause:

awk '$5 ~ "^/run/nvidia/driver"{print $5}' /proc/1/mountinfo | sort -r \
    | xargs -r -n1 umount -l -f 2>/dev/null
rm -rf /run/nvidia/driver
rm -f /run/nvidia/nvidia-driver.state /run/nvidia/nvidia-driver.pid
: > /sys/module/firmware_class/parameters/path

Expected behavior

After a GPU hot-detach + driver-pod termination + hot-re-attach + driver-pod recreation cycle, the new driver pod should install and load the driver successfully, the same way it does on a fresh boot. Specifically:

  • The terminating pod's _shutdown should leave the host's /run/nvidia/driver tree fully unmounted and removed, and should clear /sys/module/firmware_class/parameters/path, regardless of whether the underlying GPU is still present at the PCI level.
  • The new pod's mkdir /lib/firmware/nvidia (inside nvidia-installer) must succeed.

Environment (please provide the following information):

  • GPU Operator Version: v26.3.2 (driver image 580.126.20)
  • OS: Ubuntu 24.04.4 LTS
  • Kernel Version: 6.8.0-117-generic
  • Container Runtime Version: containerd containerd.io v2.2.3
  • Kubernetes Distro and Version: Kubernetes 1.36.1, CNI: flannel

Values used for gpu-operator deployment:

USER-SUPPLIED VALUES:
dcgmExporter:
  enabled: false
devicePlugin:
  enabled: false
driver:
  enabled: true
sandboxDevicePlugin:
  enabled: false
toolkit:
  enabled: true
  env:
  - name: CONTAINERD_SOCKET
    value: /var/run/containerd/containerd.sock
  - name: CONTAINERD_CONFIG
    value: /etc/containerd/config.toml
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions