DeepSpeed
c88af214 - [MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440)

Commit
2 years ago
[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding (#3440) * fix mics save checkpoint hanging * MiCS load_checkpoint * copyright * fix for torch-1.9.0 all_reduce_coalesced api does not support nccl backend * Naming alignment * adding more test conditions for mics shard size * test with different shard sizes * adding assertion for better error msg --------- Co-authored-by: Zhen Zhang <zhzhn@amazon.com>
Author
Parents
Loading