Universal Checkpoint for Sequence Parallelism (#4752)
This PR extends the [universal
checkpoint](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing)
to support DS sequence parallelism and training scenarios where pipeline
parallelism is not enabled.
The attached Tensorboard chart show a training scenario (validation
curve) where a GPT model is pre-trained with data parallelism (4 GPUs),
and checkpoints are saved at the 100th and 200th iterations. The
checkpoint at the 100th iteration is later loaded for continual
pre-training with different configurations (more GPU resources, data
parallelism = 4 GPUs, sequence parallelism = 2 GPUs).
<img width="1783" alt="Screenshot 2023-11-28 at 9 11 55 AM"
src="https://github.com/microsoft/DeepSpeed/assets/16696152/817141b9-2b37-4a3b-9a47-07324877a4eb">
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>