DeepSpeed
d2b1d7fc - Universal checkpoint for zero stage 3 (#5475)

Commit
1 year ago
Universal checkpoint for zero stage 3 (#5475) This PR enables the universal checkpoint for zero stage 3. Notes: - The current implementation supports Data parallelism. - Development is ongoing for universal checkpoint Stage 3 with tensor-slicing model parallelism. - Pipeline parallelism is not supported by ZeRO Stage 3, and hence is not included in this universal checkpoint implementation. In this PR: - I've updated `deepspeed/checkpoint/ds_to_universal.py ` to support converting Zero checkpoints into Universal checkpoints. - I've updated `deepspeed/runtime/zero/stage3.py` to enable loading Universal checkpoints using the Stage 3 optimizer. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author
Parents
Loading