DeepSpeed
56c52238 - bf16+pipeline parallelism (#1801)

Commit
3 years ago
bf16+pipeline parallelism (#1801) * bf16 updates * Got bf16 working * fp32 reduction; flattened tensors * bf16+zero_stage_1 first cut * finish zero_stage 1 sharding * Matching fp16 with debugging codes * Matching loss with fp16 * Fix gradient clipping * bf16 gradient clipping fix bf16 checkpoint save/load * Unscale grad norm * Fix grad norm scaling * Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa * Fix clip_grad key error * Reduce tied weight gradients * Fix grad norm for moe * Reduce specified gradients * Use O(n) instead of O(n^2) * Remove optimizer restriction for bf16 * Link bf16 & fp32 params * Clip gradients of last stage tied weights * Simplify tied weights reduction logic * Also clip all tp rank parameters * lp to hp mapping * Link lp/hp/optim state; Refresh links after checkpoint load * Remove debug print * Remove debug print * Simplify zero_grad logic * fp32 accessors * Fix update bug Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Author
Parents
  • File
    .gitignore
  • deepspeed
    • checkpoint
      • File
        constants.py
    • runtime
      • File
        bf16_optimizer.py
      • File
        config.py
      • File
        constants.py
      • File
        engine.py
      • fp16
        • File
          fused_optimizer.py
      • pipe
        • File
          engine.py
        • File
          module.py
      • File
        utils.py
      • zero
        • File
          stage3.py
        • File
          stage_1_and_2.py
        • File
          utils.py