DeepSpeed
6b8290a4 - fix(pipeline): set _running_engine_backward for non-last stage backward

Commit
70 days ago
fix(pipeline): set _running_engine_backward for non-last stage backward In PipelineEngine._exec_backward_pass(), for non-last stages (Stage 0), torch.autograd.backward() was called directly without setting _running_engine_backward=True. This caused the post-backward hook (_backward_post_hook) to raise a RuntimeError when needs_scaler=True because it incorrectly detected that backward() was called without proper loss scaling. The exception raised inside the callback caused the process to hang, which in turn caused NCCL collective operations to deadlock while waiting for all ranks. Fix by setting _running_engine_backward=True before calling backward() for non-last stages, and resetting it in a finally block. Also update to use the new tensor.backward(gradient) API style instead of torch.autograd.backward(), which properly integrates with DeepSpeed's hooks and loss scaling for non-scalar backward. Fixes pipeline checkpoint tests timing out with ZeRO stage 1. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Committer
Parents
Loading