ZeRO Gradient Accumulation Dtype. (#2847)
* Adding attributes for grad accum dtype.
* accumulating reduction grads in stage 2 mode 2
* missing colon
* tracking reduc grad move
* Correct hooks.
* Name change updates.
* Using grad_accum in cpu offload functions.
* Addressing comments: putting bf opt back, removing hooks
* Fixing missing pointer to grad accum.
* Renaming functions.
* More function renames.
* Adding reduction dtype.
* updating for offload
* Adding functionality for stage 3.
* Adding s3 test support.
* Add to MiCS optimizer.
* zero++ tutorial PR (#3783)
* Removing need to grad_reduc attribute.
* Offload correctness.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>