Fix leaf module race condition (#7825)
Fix #7824
For [leaf
modules](https://deepspeed.readthedocs.io/en/latest/training.html#configuring-zero-leaf-modules),
ZeRO3 manages all parameters within the module uniformly. When a module
returns multiple output tensors, PyTorch's autograd can trigger backward
hooks from multiple threads concurrently. This causes race conditions
when multiple threads simultaneously modify __inflight_param_registry
and parameter states.
This PR ensures that for leaf modules, only one thread performs the
actual parameter fetching work while other concurrent threads wait and
return early, preventing the race condition.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>