Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) (#6694)
depend on https://github.com/microsoft/DeepSpeed/pull/6649
When performing fetch/release operations on Z3 leaf modules, the loop
time is excessively long in fine-grained module. Compared to non-leaf
modules, Z3 leaf modules may include a larger number of parameters.
Although each loop unit does not consume much time, the overall loop
length can be significant.

**The fetch time is impacted by:**
Post-allgather operations (narrow, slice ,cat, difficult to avoid)
Memory pressure(record_stream/fetch event create&sync)
**The release time is impacted by:**
slice
Free parameter record_stream
Considering the fine-grained leaf modules, where each parameter is
relatively small, we can treat the parameters within each leaf module as
a unified entity to handle memory pressure. This approach can
approximately halve the CPU time required for fetch/release operations.
---------
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>