Fix the GPU memory usage of ZeRO-Offload (only update stage_1_and_2.py) (#7309)
Signed-off-by: Armin Zhu <mingzhengzhu1998@gmail.com>
Fix the memory usage of ZeRO-Offload with stage 1 and 2. Before the fix,
the memory usage is about 3x that of params_FP16. This is caused by the
H2D data copy is using different data type. Now the GPU memory usage is
about 1x params_FP16. And the H2D memory copy needs a 16bit pinned
memory buffer.