DeepSpeed
56c52238 - bf16+pipeline parallelism (#1801)

Commit

3 years ago

bf16+pipeline parallelism (#1801) * bf16 updates * Got bf16 working * fp32 reduction; flattened tensors * bf16+zero_stage_1 first cut * finish zero_stage 1 sharding * Matching fp16 with debugging codes * Matching loss with fp16 * Fix gradient clipping * bf16 gradient clipping fix bf16 checkpoint save/load * Unscale grad norm * Fix grad norm scaling * Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa * Fix clip_grad key error * Reduce tied weight gradients * Fix grad norm for moe * Reduce specified gradients * Use O(n) instead of O(n^2) * Remove optimizer restriction for bf16 * Link bf16 & fp32 params * Clip gradients of last stage tied weights * Simplify tied weights reduction logic * Also clip all tp rank parameters * lp to hp mapping * Link lp/hp/optim state; Refresh links after checkpoint load * Remove debug print * Remove debug print * Simplify zero_grad logic * fp32 accessors * Fix update bug Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

References

#1801 - bf16+pipeline parallelism

Author

tjruwase

Parents

9bf1e9af

Files13

.gitignore
deepspeed
- checkpoint
  - constants.py
- runtime
  - bf16_optimizer.py
  - config.py
  - constants.py
  - engine.py
  - fp16
    - fused_optimizer.py
  - pipe
    - engine.py
    - module.py
  - utils.py
  - zero
    - stage3.py
    - stage_1_and_2.py
    - utils.py

DeepSpeed 56c52238 - bf16+pipeline parallelism (#1801)

DeepSpeed
56c52238 - bf16+pipeline parallelism (#1801)