bf16+pipeline parallelism (#1801)

Commit

3 years ago

bf16+pipeline parallelism (#1801) * bf16 updates * Got bf16 working * fp32 reduction; flattened tensors * bf16+zero_stage_1 first cut * finish zero_stage 1 sharding * Matching fp16 with debugging codes * Matching loss with fp16 * Fix gradient clipping * bf16 gradient clipping fix bf16 checkpoint save/load * Unscale grad norm * Fix grad norm scaling * Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa * Fix clip_grad key error * Reduce tied weight gradients * Fix grad norm for moe * Reduce specified gradients * Use O(n) instead of O(n^2) * Remove optimizer restriction for bf16 * Link bf16 & fp32 params * Clip gradients of last stage tied weights * Simplify tied weights reduction logic * Also clip all tp rank parameters * lp to hp mapping * Link lp/hp/optim state; Refresh links after checkpoint load * Remove debug print * Remove debug print * Simplify zero_grad logic * fp32 accessors * Fix update bug Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

References

#1801 - bf16+pipeline parallelism

Author

tjruwase

Parents

9bf1e9af

DeepSpeed 56c52238 - bf16+pipeline parallelism (#1801)

DeepSpeed
56c52238 - bf16+pipeline parallelism (#1801)