Fix sort of zero checkpoint files (#5342)
The conversion from a regular checkpoint to universal one relies on
sorting of zero checkpoint files to merge sharded optimizer states. This
merge can silently produce wrong results as the sorting is in
alphabetical order.
The merging logic assumes that files are given in this order.
1. pp_index=0 tp_index=0 dp_index=0
2. pp_index=0 tp_index=0 dp_index=1
...
The optimizer state of a parameter can be sharded across multiple ranks.
If it is sharded across dp_index 9-11, the files will be
- bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt
- bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt
- bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt
As they are sorted in alphabetical order, the script merges the sharded
fragment in the order of [10, 11, 9].
This PR fixes this sort to extracts dp ranks in files and sort the files
treating the ranks as numbers.
Fix #5283
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>