Autotune ZenFlow affinity (#7506)

Commit

169 days ago

Autotune ZenFlow affinity (#7506) This PR address the following ZenFlow optimizer core binding issue. https://github.com/deepspeedai/DeepSpeed/issues/7478 With this PR, ZenFlow optimizer worker would derive its core binding from deepspeed core binding mechanism. The algorithm is as following: 1. Each DeepSpeed rank get its core binding by using DeepSpeed command line `--bind_cores_to_rank`, this command would assign each CPU physical cores to different workers 2. When spawing ZenFlow optimizer worker, DeepSpeed would split current CPU affinity list into two sublist: pt_affinity and zf_affinity 3. zf_affinity would be used to set affinity of ZenFlow optimizer worker. pt_affinity would be used to set current pytorch process. 4. By default, one cores is reserved by each pytorch process, the rest is used by ZenFlow optimizer worker. The number of cores reserved for pytorch process can be changed by ZenFlow config variable: `pt_reserved_cores` --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Signed-off-by: aeeeeeep <aeeeeeep@proton.me> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: aeeeeeep <aeeeeeep@proton.me> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com> Co-authored-by: Zhipeng Wang <zwanga@wustl.edu> Co-authored-by: Peng Du <pedu@linkedin.com> Co-authored-by: pengdurice <pengduhit@gmail.com> Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

References

#7506 - Autotune ZenFlow affinity

Author

delock

Parents

66bf2a64

DeepSpeed 43537d0a - Autotune ZenFlow affinity (#7506)

DeepSpeed
43537d0a - Autotune ZenFlow affinity (#7506)