fix memcpy issue on backward for zero-infinity (#6670)
This PR is similar to
[PR#5301](https://github.com/microsoft/DeepSpeed/pull/5301), that
optimizes the D2H time use pinned memory.
Previously, the D2H memcpy will be the bottleneck during the final
backward pass of each iteration for ZeRO-Infinity(offload), as shown in
Trace-1. The new version can eliminate the bottleneck, as shown in
Trace-2.
_Trace-1_
<img width="480" alt="image"
src="https://github.com/user-attachments/assets/891e3770-351b-4e03-8a59-b491bc44d03b">
_Trace-2_
<img width="192" alt="image"
src="https://github.com/user-attachments/assets/f1cf9037-77f8-42a6-adc8-d5c6bacde0aa">
cc @tjruwase
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>