Skip to content

Support Red Hat NVIDIA Container Toolkit path#1350

Merged
michael-balint merged 1 commit into
masterfrom
dholt/container-toolkit-rhel
May 28, 2026
Merged

Support Red Hat NVIDIA Container Toolkit path#1350
michael-balint merged 1 commit into
masterfrom
dholt/container-toolkit-rhel

Conversation

@dholt
Copy link
Copy Markdown
Contributor

@dholt dholt commented May 28, 2026

Summary

  • add Red Hat-family support to the native nvidia_container_toolkit role
  • route RHEL/Rocky 8+ hosts through NVIDIA Container Toolkit instead of legacy nvidia.nvidia_docker
  • refresh RPM airgap guidance for EL9/Rocky-era CUDA and Container Toolkit repositories

Validation

  • git diff --check
  • YAML parse of changed Ansible files
  • ansible-playbook -i <artifact inventory> playbooks/container/nvidia-docker.yml --syntax-check -e hostlist=all -e ansible_python_interpreter=/usr/bin/python3
  • ./scripts/deepops/ansible-lint-roles.sh
  • Live Rocky Linux 9.7 validation on Colossus ipp1-1946 (A100X x2):
    • Docker install passed after --flush-cache; first run exposed stale Ansible fact cache from a reused inventory alias, fixed in private maintenance harness commit a66fb18
    • playbooks/nvidia-software/nvidia-driver.yml installed NVIDIA driver 580.159.04; the role updated the Rocky kernel to 5.14.0-687.10.1.el9_8.0.1 after exact headers for the provisioned kernel were unavailable, rebooted, and nvidia-smi passed
    • playbooks/container/nvidia-docker.yml selected nvidia_container_toolkit on Rocky 9.7 and skipped legacy nvidia.nvidia_docker
    • idempotence rerun: ok=16 changed=0 unreachable=0 failed=0 skipped=4
    • nvidia-container-cli --version: 1.19.1
    • /etc/docker/daemon.json sets default-runtime: nvidia
    • docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.2-base-ubuntu24.04 nvidia-smi saw both NVIDIA A100X GPUs with driver 580.159.04

Artifact directory: /home/dholt/deepops-release-artifacts/26.07/20260528-161023-container-toolkit-rhel-live-validation

@michael-balint michael-balint merged commit b3f1510 into master May 28, 2026
31 checks passed
@dholt dholt deleted the dholt/container-toolkit-rhel branch May 28, 2026 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants