DeepSpeedZeroOptimizer: refactor bit16 flattening to support more accelerators (#4833)
The approach till today use the practice where the torch.nn.parameter
data is being replaced with a new cpu data storage, to offload device
memory.
All params are being flatenned on the host and moved to the device.
in some accelerators torch.nn.parameter which is a device parameter
cannot be assigned with a cpu storage.
This PR copy the param data into a new cpu tensor, and shrinks the
device storage.
Later when the flat buffer is moved to the device param.data will be a
view to the flat buffer.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>