DeepSpeed
aeb10bb1 - Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step (#7981)

Commit
17 days ago
Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step (#7981) ## Summary ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with multiple `engine.backward()` calls per optimizer step (via `set_gradient_accumulation_boundary()`, formalized in #7665) silently drops all but the last backward's gradient. `copy_grads_in_partition` only called `async_accumulate_grad_in_cpu_via_gpu` under `if gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate backwards' reduced grads were never stored. The boundary `async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not added) the fp32 buffer with the last chunk only. ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected. ## Fix Replace the `ga > 1` gate with one that fires exactly when a CPU accumulator is needed: ```python if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary: self.async_accumulate_grad_in_cpu_via_gpu(param) ``` - `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra copy. Fast path preserved. - `ga_steps=1` + multi-backward → accumulates correctly across calls. - `ga_steps>1` → identical to prior behaviour. ## Measurement 2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1 Max param diff vs no-offload reference: | | fp32 | bf16 | | ------ | ------------------------------------ | -------------------- | | Before | 2.00e-03 (wrong, around 2 x lr) | — | | After | 7.45e-09 (noise) | 0.00e+00 | ## Tests New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`, parametrized over ZeRO-1/2: multi-backward offload matches no-offload / single-backward unchanged / multi-step state-leak guard / single-backward allocates no CPU buffer (perf guard) / `ga_steps>1` + offload unchanged (#7967 regression guard). --------- Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
Author
Parents
Loading