Skip to content

Commit 7dabd91

Browse files
authored
Improve GPU-aware section in the docs (#927)
1 parent 4a0bcc0 commit 7dabd91

5 files changed

Lines changed: 160 additions & 14 deletions

File tree

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# This example demonstrates your MPI implementation to have the CUDA support enabled.
2+
3+
using MPI
4+
using CUDA
5+
6+
MPI.Init()
7+
8+
comm = MPI.COMM_WORLD
9+
rank = MPI.Comm_rank(comm)
10+
11+
size = MPI.Comm_size(comm)
12+
dst = mod(rank+1, size)
13+
src = mod(rank-1, size)
14+
println("rank=$rank, size=$size, dst=$dst, src=$src")
15+
16+
N = 4
17+
18+
send_mesg = CuArray{Float64}(undef, N)
19+
recv_mesg = CuArray{Float64}(undef, N)
20+
21+
fill!(send_mesg, Float64(rank))
22+
CUDA.synchronize()
23+
24+
println("start sending...")
25+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
26+
println("recv_mesg on proc $rank: $recv_mesg")
27+
rank==0 && println("done.")
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
confirm
2+
# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank)
3+
4+
using MPI
5+
using CUDA
6+
7+
MPI.Init()
8+
9+
comm = MPI.COMM_WORLD
10+
rank = MPI.Comm_rank(comm)
11+
12+
# select device (specifically relevant if >1 GPU per node)
13+
# using node-local communicator to retrieve node-local rank
14+
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
15+
rank_l = MPI.Comm_rank(comm_l)
16+
17+
# select device
18+
gpu_id = CUDA.device!(rank_l)
19+
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
20+
# gpu_id = CUDA.device!(0)
21+
22+
size = MPI.Comm_size(comm)
23+
dst = mod(rank+1, size)
24+
src = mod(rank-1, size)
25+
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
26+
27+
N = 4
28+
29+
send_mesg = CuArray{Float64}(undef, N)
30+
recv_mesg = CuArray{Float64}(undef, N)
31+
32+
fill!(send_mesg, Float64(rank))
33+
CUDA.synchronize()
34+
35+
rank==0 && println("start sending...")
36+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
37+
println("recv_mesg on proc $rank_l: $recv_mesg")
38+
rank==0 && println("done.")
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# This example demonstrates your MPI implementation to have the ROCm support enabled.
2+
3+
using MPI
4+
using AMDGPU
5+
6+
MPI.Init()
7+
8+
comm = MPI.COMM_WORLD
9+
rank = MPI.Comm_rank(comm)
10+
11+
size = MPI.Comm_size(comm)
12+
dst = mod(rank+1, size)
13+
src = mod(rank-1, size)
14+
println("rank=$rank, size=$size, dst=$dst, src=$src")
15+
16+
N = 4
17+
18+
send_mesg = ROCArray{Float64}(undef, N)
19+
recv_mesg = ROCArray{Float64}(undef, N)
20+
21+
fill!(send_mesg, Float64(rank))
22+
AMDGPU.synchronize()
23+
24+
println("start sending...")
25+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
26+
println("recv_mesg on proc $rank: $recv_mesg")
27+
rank==0 && println("done.")
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank)
2+
3+
using MPI
4+
using AMDGPU
5+
6+
MPI.Init()
7+
8+
comm = MPI.COMM_WORLD
9+
rank = MPI.Comm_rank(comm)
10+
11+
# select device (specifically relevant if >1 GPU per node)
12+
# using node-local communicator to retrieve node-local rank
13+
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
14+
rank_l = MPI.Comm_rank(comm_l)
15+
16+
# select device
17+
device = AMDGPU.device_id!(rank_l+1)
18+
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
19+
# device = AMDGPU.device_id!(1)
20+
gpu_id = AMDGPU.device_id(AMDGPU.device())
21+
22+
size = MPI.Comm_size(comm)
23+
dst = mod(rank+1, size)
24+
src = mod(rank-1, size)
25+
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")
26+
27+
N = 4
28+
29+
send_mesg = ROCArray{Float64}(undef, N)
30+
recv_mesg = ROCArray{Float64}(undef, N)
31+
32+
fill!(send_mesg, Float64(rank))
33+
AMDGPU.synchronize()
34+
35+
rank==0 && println("start sending...")
36+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
37+
println("recv_mesg on proc $rank: $recv_mesg")
38+
rank==0 && println("done.")

docs/src/usage.md

Lines changed: 30 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -74,33 +74,49 @@ with:
7474
$ mpiexecjl --project=/path/to/project -n 20 julia script.jl
7575
```
7676

77-
## CUDA-aware MPI support
77+
## GPU-aware MPI support
7878

79-
If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the
80-
[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as
81-
send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
79+
If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
80+
[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
81+
send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
8282

83-
Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
84-
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
85-
[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
83+
!!! note "Preloads"
84+
On Cray machines, you may need to ensure the following preloads to be set in the preferences:
85+
```
86+
preloads = ["libmpi_gtl_hsa.so"]
87+
preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
88+
```
89+
90+
### CUDA
91+
92+
Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl)
93+
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
94+
[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm
8695
your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
8796

8897
If using OpenMPI, the status of CUDA support can be checked via the
8998
[`MPI.has_cuda()`](@ref) function.
9099

91-
## ROCm-aware MPI support
100+
### ROCm
92101

93-
If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the
94-
[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
95-
96-
Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
97-
should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
98-
[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
102+
Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl)
103+
should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
104+
[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm
99105
your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
100106

101107
If using OpenMPI, the status of ROCm support can be checked via the
102108
[`MPI.has_rocm()`](@ref) function.
103109

110+
### Multiple GPUs per node
111+
112+
In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
113+
For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
114+
```
115+
comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
116+
rank_loc = MPI.Comm_rank(comm_loc)
117+
```
118+
For (2), one can use the default device but make sure to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
119+
104120
## Writing MPI tests
105121

106122
It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`:

0 commit comments

Comments
 (0)