Skip to content

Commit 7339b90

Browse files
committed
use WORLD_SIZE instead of device_count, supports both the case where the number of gpus we train on is smaller than gpus available, and also multinode training may be a bugfix
1 parent f08abb4 commit 7339b90

1 file changed

Lines changed: 4 additions & 2 deletions

File tree

train.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,8 +89,10 @@
8989
torch.cuda.set_device(device)
9090
master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
9191
seed_offset = ddp_rank # each process gets a different seed
92-
assert gradient_accumulation_steps % torch.cuda.device_count() == 0
93-
gradient_accumulation_steps //= torch.cuda.device_count()
92+
# world_size number of processes will be training simultaneously, so we can scale
93+
# down the desired gradient accumulation iterations per process proportionally
94+
assert gradient_accumulation_steps % ddp_world_size == 0
95+
gradient_accumulation_steps //= ddp_world_size
9496
else:
9597
# if not ddp, we are running on a single gpu, and one process
9698
master_process = True

0 commit comments

Comments
 (0)