fix pagable h2d memcpy (#5301)
ZeRO offload case
Fix the issue of pageble h2d memcpy in step process. Now h2d memcpy uses
pinned memory.
Speedup h2d memcpy by 6x on single GPU and 4-5x on 8GPU node.
cc @tjruwase
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ubuntu <deepspeed@deepspeed-login.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>