DeepSpeed
b891ec2a - engine.py - save_checkpoint: only rank-0 should create the save dir (#4536)

Commit
2 years ago
engine.py - save_checkpoint: only rank-0 should create the save dir (#4536) * engine.py - save_checkpoint: only rank-0 should create the save dir In some NFS it may introduce a race and deadlock between the rank. We found that limiting the creation only to rank-0 can prevent this, while the following barrier ensure the other rank are not proceeding before dir is created. * save_checkpoint: In case of local storage only local rank 0 should mkdir --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Author
Parents
Loading