bf16+pipeline parallelism (#1801)
* bf16 updates
* Got bf16 working
* fp32 reduction; flattened tensors
* bf16+zero_stage_1 first cut
* finish zero_stage 1 sharding
* Matching fp16 with debugging codes
* Matching loss with fp16
* Fix gradient clipping
* bf16 gradient clipping fix
bf16 checkpoint save/load
* Unscale grad norm
* Fix grad norm scaling
* Enable loading fp16_zero_1 into bf16_zero_1 engine and vice versa
* Fix clip_grad key error
* Reduce tied weight gradients
* Fix grad norm for moe
* Reduce specified gradients
* Use O(n) instead of O(n^2)
* Remove optimizer restriction for bf16
* Link bf16 & fp32 params
* Clip gradients of last stage tied weights
* Simplify tied weights reduction logic
* Also clip all tp rank parameters
* lp to hp mapping
* Link lp/hp/optim state; Refresh links after checkpoint load
* Remove debug print
* Remove debug print
* Simplify zero_grad logic
* fp32 accessors
* Fix update bug
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>