Skip to content

Commit 622597c

Browse files
committed
Resolve merge conflict
2 parents b274a5e + 05c7f88 commit 622597c

7 files changed

Lines changed: 172 additions & 23 deletions

File tree

.github/workflows/UnitTests.yml

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@ name: Unit Tests
22

33
on:
44
pull_request:
5-
paths:
5+
paths: &paths
66
- '.github/workflows/UnitTests.yml'
7+
- 'Project.toml'
78
- 'bin/**'
89
- 'deps/**'
910
- 'src/**'
@@ -13,13 +14,7 @@ on:
1314
branches:
1415
- master
1516
- release-*
16-
paths:
17-
- '.github/workflows/UnitTests.yml'
18-
- 'bin/**'
19-
- 'deps/**'
20-
- 'src/**'
21-
- 'test/**'
22-
- 'lib/**'
17+
paths: *paths
2318

2419
concurrency:
2520
# Skip intermediate builds: all builds except for builds on the `master` branch

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Distributed = "1"
3434
DocStringExtensions = "0.8, 0.9"
3535
Libdl = "1"
3636
MPIABI_jll = "0.1.1"
37-
MPICH_jll = "4"
37+
MPICH_jll = "4, 5"
3838
MPIPreferences = "0.1.8"
3939
MPItrampoline_jll = "5"
4040
OpenMPI_jll = "4, 5"
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# This example demonstrates your MPI implementation to have the CUDA support enabled.
2+
3+
using MPI
4+
using CUDA
5+
6+
MPI.Init()
7+
8+
comm = MPI.COMM_WORLD
9+
rank = MPI.Comm_rank(comm)
10+
11+
size = MPI.Comm_size(comm)
12+
dst = mod(rank+1, size)
13+
src = mod(rank-1, size)
14+
println("rank=$rank, size=$size, dst=$dst, src=$src")
15+
16+
N = 4
17+
18+
send_mesg = CuArray{Float64}(undef, N)
19+
recv_mesg = CuArray{Float64}(undef, N)
20+
21+
fill!(send_mesg, Float64(rank))
22+
CUDA.synchronize()
23+
24+
println("start sending...")
25+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
26+
println("recv_mesg on proc $rank: $recv_mesg")
27+
rank==0 && println("done.")
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
confirm
2+
# This example demonstrates your CUDA-aware MPI implementation can use multiple Nvidia GPUs (one GPU per rank)
3+
4+
using MPI
5+
using CUDA
6+
7+
MPI.Init()
8+
9+
comm = MPI.COMM_WORLD
10+
rank = MPI.Comm_rank(comm)
11+
12+
# select device (specifically relevant if >1 GPU per node)
13+
# using node-local communicator to retrieve node-local rank
14+
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
15+
rank_l = MPI.Comm_rank(comm_l)
16+
17+
# select device
18+
gpu_id = CUDA.device!(rank_l)
19+
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
20+
# gpu_id = CUDA.device!(0)
21+
22+
size = MPI.Comm_size(comm)
23+
dst = mod(rank+1, size)
24+
src = mod(rank-1, size)
25+
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id), size=$size, dst=$dst, src=$src")
26+
27+
N = 4
28+
29+
send_mesg = CuArray{Float64}(undef, N)
30+
recv_mesg = CuArray{Float64}(undef, N)
31+
32+
fill!(send_mesg, Float64(rank))
33+
CUDA.synchronize()
34+
35+
rank==0 && println("start sending...")
36+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
37+
println("recv_mesg on proc $rank_l: $recv_mesg")
38+
rank==0 && println("done.")
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# This example demonstrates your MPI implementation to have the ROCm support enabled.
2+
3+
using MPI
4+
using AMDGPU
5+
6+
MPI.Init()
7+
8+
comm = MPI.COMM_WORLD
9+
rank = MPI.Comm_rank(comm)
10+
11+
size = MPI.Comm_size(comm)
12+
dst = mod(rank+1, size)
13+
src = mod(rank-1, size)
14+
println("rank=$rank, size=$size, dst=$dst, src=$src")
15+
16+
N = 4
17+
18+
send_mesg = ROCArray{Float64}(undef, N)
19+
recv_mesg = ROCArray{Float64}(undef, N)
20+
21+
fill!(send_mesg, Float64(rank))
22+
AMDGPU.synchronize()
23+
24+
println("start sending...")
25+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
26+
println("recv_mesg on proc $rank: $recv_mesg")
27+
rank==0 && println("done.")
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# This example demonstrates your ROCm-aware MPI implementation can use multiple AMD GPUs (one GPU per rank)
2+
3+
using MPI
4+
using AMDGPU
5+
6+
MPI.Init()
7+
8+
comm = MPI.COMM_WORLD
9+
rank = MPI.Comm_rank(comm)
10+
11+
# select device (specifically relevant if >1 GPU per node)
12+
# using node-local communicator to retrieve node-local rank
13+
comm_l = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
14+
rank_l = MPI.Comm_rank(comm_l)
15+
16+
# select device
17+
device = AMDGPU.device_id!(rank_l+1)
18+
# using default device if the scheduler exposes different GPU per rank (e.g. SLURM `--gpus-per-task=1`)
19+
# device = AMDGPU.device_id!(1)
20+
gpu_id = AMDGPU.device_id(AMDGPU.device())
21+
22+
size = MPI.Comm_size(comm)
23+
dst = mod(rank+1, size)
24+
src = mod(rank-1, size)
25+
println("rank=$rank rank_loc=$rank_l (gpu_id=$gpu_id - $device), size=$size, dst=$dst, src=$src")
26+
27+
N = 4
28+
29+
send_mesg = ROCArray{Float64}(undef, N)
30+
recv_mesg = ROCArray{Float64}(undef, N)
31+
32+
fill!(send_mesg, Float64(rank))
33+
AMDGPU.synchronize()
34+
35+
rank==0 && println("start sending...")
36+
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
37+
println("recv_mesg on proc $rank: $recv_mesg")
38+
rank==0 && println("done.")

docs/src/usage.md

Lines changed: 38 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -74,33 +74,57 @@ with:
7474
$ mpiexecjl --project=/path/to/project -n 20 julia script.jl
7575
```
7676

77-
## CUDA-aware MPI support
77+
!!! note "Juliaup"
78+
If you use `juliaup` as a manager for your `julia` installation and you want to use a non-default channel with `mpiexecjl`,
79+
you need to use the environment variable `JULIAUP_CHANNEL` instead of using the `+` syntax.
7880

79-
If your MPI implementation has been compiled with CUDA support, then `CUDA.CuArray`s (from the
80-
[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package) can be passed directly as
81-
send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
81+
```shell
82+
JULIAUP_CHANNEL=1.12 mpiexecjl --project=/path/to/project -n 20 julia script.jl
83+
```
8284

83-
Successfully running the [alltoall\_test\_cuda.jl](https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2)
84-
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
85-
[alltoall\_test\_cuda\_multigpu.jl](https://gist.github.com/luraess/ed93cc09ba04fe16f63b4219c1811566) should confirm
85+
## GPU-aware MPI support
86+
87+
If your MPI implementation has been compiled with CUDA or ROCm support, then `CUDA.CuArray`s (from
88+
[CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)) or `AMDGPU.ROCArray`s (from [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl)) can be passed directly as
89+
send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported). GPU-aware MPI requires in most cases to use a [system provided MPI installation](@ref using_system_mpi).
90+
91+
!!! note "Preloads"
92+
On Cray machines, you may need to ensure the following preloads to be set in the preferences:
93+
```
94+
preloads = ["libmpi_gtl_hsa.so"]
95+
preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
96+
```
97+
98+
### CUDA
99+
100+
Successfully running the [alltoall\_test\_cuda.jl](../examples/alltoall_test_cuda.jl)
101+
should confirm your MPI implementation to have the CUDA support enabled. Moreover, successfully running the
102+
[alltoall\_test\_cuda\_multigpu.jl](../examples/alltoall_test_cuda_multigpu.jl) should confirm
86103
your CUDA-aware MPI implementation to use multiple Nvidia GPUs (one GPU per rank).
87104

88105
If using OpenMPI, the status of CUDA support can be checked via the
89106
[`MPI.has_cuda()`](@ref) function.
90107

91-
## ROCm-aware MPI support
92-
93-
If your MPI implementation has been compiled with ROCm support (AMDGPU), then `AMDGPU.ROCArray`s (from the
94-
[AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package) can be passed directly as send and receive buffers for point-to-point and collective operations (they may also work with one-sided operations, but these are not often supported).
108+
### ROCm
95109

96-
Successfully running the [alltoall\_test\_rocm.jl](https://gist.github.com/luraess/c228ec08629737888a18c6a1e397643c)
97-
should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
98-
[alltoall\_test\_rocm\_multigpu.jl](https://gist.github.com/luraess/a47931d7fb668bd4348a2c730d5489f4) should confirm
110+
Successfully running the [alltoall\_test\_rocm.jl](../examples/alltoall_test_rocm.jl)
111+
should confirm your MPI implementation to have the ROCm support (AMDGPU) enabled. Moreover, successfully running the
112+
[alltoall\_test\_rocm\_multigpu.jl](../examples/alltoall_test_rocm_multigpu.jl) should confirm
99113
your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
100114

101115
If using OpenMPI, the status of ROCm support can be checked via the
102116
[`MPI.has_rocm()`](@ref) function.
103117

118+
### Multiple GPUs per node
119+
120+
In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.
121+
For (1), using the node-local rank `rank_loc` is a way to select the GPU device:
122+
```
123+
comm_loc = MPI.Comm_split_type(comm, MPI.COMM_TYPE_SHARED, rank)
124+
rank_loc = MPI.Comm_rank(comm_loc)
125+
```
126+
For (2), one can use the default device but make sure to handle device visibility in the scheduler or by using `CUDA/ROCM_VISIBLE_DEVICES`.
127+
104128
## Writing MPI tests
105129

106130
It is recommended to use the `mpiexec()` wrapper when writing your package tests in `runtests.jl`:

0 commit comments

Comments
 (0)