DeepSpeed
b0040b6c - Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) (#6694)

Commit
336 days ago
Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) (#6694) depend on https://github.com/microsoft/DeepSpeed/pull/6649 When performing fetch/release operations on Z3 leaf modules, the loop time is excessively long in fine-grained module. Compared to non-leaf modules, Z3 leaf modules may include a larger number of parameters. Although each loop unit does not consume much time, the overall loop length can be significant. ![image](https://github.com/user-attachments/assets/9891835a-2620-47f3-aba6-ea22b8905d1c) **The fetch time is impacted by:** Post-allgather operations (narrow, slice ,cat, difficult to avoid) Memory pressure(record_stream/fetch event create&sync) **The release time is impacted by:** slice Free parameter record_stream Considering the fine-grained leaf modules, where each parameter is relatively small, we can treat the parameters within each leaf module as a unified entity to handle memory pressure. This approach can approximately halve the CPU time required for fetch/release operations. --------- Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Author
Parents
Loading