DeepSpeed
52b1d4d6 - Fix leaf module race condition (#7825)

Commit
8 days ago
Fix leaf module race condition (#7825) Fix #7824 For [leaf modules](https://deepspeed.readthedocs.io/en/latest/training.html#configuring-zero-leaf-modules), ZeRO3 manages all parameters within the module uniformly. When a module returns multiple output tensors, PyTorch's autograd can trigger backward hooks from multiple threads concurrently. This causes race conditions when multiple threads simultaneously modify __inflight_param_registry and parameter states. This PR ensures that for leaf modules, only one thread performs the actual parameter fetching work while other concurrent threads wait and return early, preventing the race condition. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Parents
Loading