DeepSpeed
a84d07c5 - MOE: Fix save checkpoint when TP > 1 (#5157)

Commit
1 year ago
MOE: Fix save checkpoint when TP > 1 (#5157) When using MOE, currently, only mp_rank_00_model_states.pt is saved. This fails when using TP > 1. Fix it by saving all required mp_rank_xx_model_states.pt files. Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>
Author
Parents
Loading