Guanhua/partial offload rebase v2 (#590) (#4636)
This PR introduces Twin-Flow feature of ZeRO-Offload++, which improves
e2e training iteration time by up to 6x on DGX-H100s.
This PR includes:
* Twin-Flow implementation inside ZeRO optimizer
* json config tutorial
* example using deepspeed
* unit tests
cc @jeffra @awan-10 @tjruwase @mrwyattii
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>