DeepSpeed
c946a342 - Fix sort of zero checkpoint files (#5342)

Commit
1 year ago
Fix sort of zero checkpoint files (#5342) The conversion from a regular checkpoint to universal one relies on sorting of zero checkpoint files to merge sharded optimizer states. This merge can silently produce wrong results as the sorting is in alphabetical order. The merging logic assumes that files are given in this order. 1. pp_index=0 tp_index=0 dp_index=0 2. pp_index=0 tp_index=0 dp_index=1 ... The optimizer state of a parameter can be sharded across multiple ranks. If it is sharded across dp_index 9-11, the files will be - bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt - bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt - bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt As they are sorted in alphabetical order, the script merges the sharded fragment in the order of [10, 11, 9]. This PR fixes this sort to extracts dp ranks in files and sort the files treating the ranks as numbers. Fix #5283 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Author
Parents
Loading