DeepSpeed
07e76bd4 - Fixed the issue that universal checkpoint cannot be loaded for stage3 when world size expansion. (#7599)

Commit
68 days ago
Fixed the issue that universal checkpoint cannot be loaded for stage3 when world size expansion. (#7599) When the world size expands from 2 to 4, then convert to universal checkpoint, and load from universal checkpoint. The new rank, for example, rank3 will load model file `zero_pp_rank_3_mp_rank_00_model_states.pt`. But this file was not produced during the last execution. For stage3, just load the first file, that is `zero_pp_rank_0_mp_rank_00_model_states`. The existing unit test TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this problem. --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Author
Parents
Loading