Improve GPU-aware section in the docs (#927)

luraess · web-flow · commit 7dabd918eaaf · 2026-01-13T13:23:10.000-08:00
diff --git a/docs/examples/alltoall_test_cuda.jl b/docs/examples/alltoall_test_cuda.jl
@@ -0,0 +1,27 @@
+# This example demonstrates your MPI implementation to have the CUDA support enabled.
+
+using MPI
+using CUDA
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank, size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = CuArray{Float64}(undef, N)
+recv_mesg = CuArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+CUDA.synchronize()
+
+println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_cuda_multigpu.jl b/docs/examples/alltoall_test_cuda_multigpu.jl
@@ -0,0 +1,38 @@
+confirm
+# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank)
+
+using MPI
+using CUDA
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+# select device (specifically relevant if >1 GPU per node)
+# using node-local communicator to retrieve node-local rank
+comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_l = MPI.Comm_rank(comm_l)
+
+# select device
+gpu_id = CUDA.device!(rank_l)
+# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
+# gpu_id = CUDA.device!(0)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = CuArray{Float64}(undef, N)
+recv_mesg = CuArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+CUDA.synchronize()
+
+rank==0 && println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank_l: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_rocm.jl b/docs/examples/alltoall_test_rocm.jl
@@ -0,0 +1,27 @@
+# This example demonstrates your MPI implementation to have the ROCm support enabled.
+
+using MPI
+using AMDGPU
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank, size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = ROCArray{Float64}(undef, N)
+recv_mesg = ROCArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+AMDGPU.synchronize()
+
+println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/examples/alltoall_test_rocm_multigpu.jl b/docs/examples/alltoall_test_rocm_multigpu.jl
@@ -0,0 +1,38 @@
+# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank)
+
+using MPI
+using AMDGPU
+
+MPI.Init()
+
+comm = MPI.COMM_WORLD
+rank = MPI.Comm_rank(comm)
+
+# select device (specifically relevant if >1 GPU per node)
+# using node-local communicator to retrieve node-local rank
+comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_l = MPI.Comm_rank(comm_l)
+
+# select device
+device = AMDGPU.device_id!(rank_l+1)
+# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
+# device = AMDGPU.device_id!(1)
+gpu_id = AMDGPU.device_id(AMDGPU.device())
+
+size = MPI.Comm_size(comm)
+dst  = mod(rank+1, size)
+src  = mod(rank-1, size)
+println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")
+
+N = 4
+
+send_mesg = ROCArray{Float64}(undef, N)
+recv_mesg = ROCArray{Float64}(undef, N)
+
+fill!(send_mesg, Float64(rank))
+AMDGPU.synchronize()
+
+rank==0 && println("start sending...")
+MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
+println("recv_mesg on proc $rank: $recv_mesg")
+rank==0 && println("done.")
diff --git a/docs/src/usage.md b/docs/src/usage.md
@@ -74,33 +74,49 @@ with:
 $ mpiexecjl --project=/path/to/project -n 20 julia script.jl
 ```
 
-## CUDA-aware MPI support
+## GPU-aware MPI support
 
-If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the
-[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as
-send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
+If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
+[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
+send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
 
-Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2) 
-should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the 
-[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm 
+!!! note "Preloads"
+    On Cray machines, you may need to ensure the following preloads to be set in the preferences:
+    ```
+    preloads = ["libmpi_gtl_hsa.so"]
+    preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
+    ```
+
+### CUDA
+
+Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl)
+should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
+[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm
 your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
 
 If using OpenMPI, the status of CUDA support can be checked via the
 [`MPI.has_cuda()`](@ref) function.
 
-## ROCm-aware MPI support
+### ROCm
 
-If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the
-[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
-
-Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c) 
-should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the 
-[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm 
+Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl)
+should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
+[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm
 your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
+### Multiple GPUs per node
+
+In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
+For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
+```
+comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
+rank_loc = MPI.Comm_rank(comm_loc)
+```
+For (2), one can use the default device but make sure to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
+
 ## Writing MPI tests
 
 It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`: