DeepSpeed
d7a3972f - Fix ZeRO-3 forward crash on modules with plain dict _parameters (#8009)

Commit
35 days ago
Fix ZeRO-3 forward crash on modules with plain dict _parameters (#8009) ## Summary Fixes #6961 ZeRO-3 forward crashes with `AttributeError: 'dict' object has no attribute '_in_forward'` since torch 2.5. PyTorch changed `nn.Module._parameters` from `OrderedDict` to plain `dict` (pytorch/pytorch#129164), and a plain `dict` does not allow attribute assignment. DeepSpeed wraps every module into `ZeROOrderedDict` at engine init via `_inject_parameters`. Any module not present at that point keeps the plain dict and crashes the next forward. This includes a submodule attached after `deepspeed.initialize()` (PEFT/LoRA adapters), or a module restored by `deepspeed/compile/init_z3.py:35`. The fix adds `ensure_zero_ordered_dict()` and calls it from the forward prologue. It wraps lazily, is idempotent, and keeps the original container so the deepcompile un-injection path still works. The epilogue gets an `isinstance` guard for modules that show up between the two hooks. This only fixes the crash. Late-attached parameters are still not in the optimizer and not partitioned by ZeRO-3. For full ZeRO-3 semantics on a late adapter, build it inside `deepspeed.zero.Init()`. ## Tests `tests/unit/runtime/zero/test_zero_late_module_attach.py` - forward after attaching a Linear post-init, with `_parameters` forced to plain dict so the bug reproduces on any torch version - repeated forwards do not re-wrap an already-wrapped module Signed-off-by: Sung Hyun Cho <hope5487@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author
Parents
Loading