fix(pipeline): set _running_engine_backward for non-last stage backward
In PipelineEngine._exec_backward_pass(), for non-last stages (Stage 0),
torch.autograd.backward() was called directly without setting
_running_engine_backward=True. This caused the post-backward hook
(_backward_post_hook) to raise a RuntimeError when needs_scaler=True
because it incorrectly detected that backward() was called without
proper loss scaling.
The exception raised inside the callback caused the process to hang,
which in turn caused NCCL collective operations to deadlock while
waiting for all ranks.
Fix by setting _running_engine_backward=True before calling backward()
for non-last stages, and resetting it in a finally block.
Also update to use the new tensor.backward(gradient) API style instead
of torch.autograd.backward(), which properly integrates with DeepSpeed's
hooks and loss scaling for non-scalar backward.
Fixes pipeline checkpoint tests timing out with ZeRO stage 1.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>