DeepSpeed
08e0733e - Support MoE for pipeline models (#5338)

Comment changes are shownComment changes are hidden
Commit
1 year ago
Support MoE for pipeline models (#5338) This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>
Author
Parents
  • deepspeed
    • moe
      • File
        layer.py
      • File
        mappings.py
      • File
        sharded_moe.py
    • ops/transformer/inference
      • File
        moe_inference.py
    • runtime
      • activation_checkpointing
        • File
          checkpointing.py
      • File
        bf16_optimizer.py
      • File
        engine.py
      • pipe
        • File
          engine.py
        • File
          module.py
      • File
        utils.py
      • zero
        • File
          stage_1_and_2.py
    • utils
      • File
        bwc.py
      • File
        groups.py
  • tests/unit/utils
    • File
      test_groups.py