Fix BF16_Optimizer last-microbatch grad leak under ZeRO-1 (#7985)
# Fix BF16_Optimizer last-microbatch grad leak under ZeRO-1
## Summary
`DeepSpeedEngine._backward_epilogue` calls `allreduce_gradients()`
**before** `optimizer.backward_epilogue()`. For `BF16_Optimizer` (used
when `bf16` model + `grad_accum_dtype: fp32` + ZeRO stage 1) without
`immediate_grad_update`, this means the **boundary microbatch's gradient
is added to the rank-local fp32 accumulator AFTER the cross-rank
allreduce, so it is silently skipped from the average.**
The bias is `(world_size − 1) / world_size × 1 /
gradient_accumulation_steps` of the per-step gradient. Because the bias
scales with per-microbatch grad weight, training trajectories visibly
diverge depending on `per_device_train_batch_size` even with identical
effective batch size — the symptom users see is loss / grad-norm curves
drifting apart between otherwise-equivalent configs.
The bug is reproducible in DeepSpeed 0.18.6 through current `master`
(0.18.10 at time of writing).
## Fix
Swap the order so `optimizer.backward_epilogue()` runs before
`allreduce_gradients()`, with `exit_backward()` left after.
`exit_backward()` only manages backward-hook state
(`_backward_hook_state`); it has no ordering dependency on the gradient
accumulator.
```python
def _backward_epilogue(self):
self._stop_timers(self.engine_timers.backward_inner_timers)
self._start_timers(self.engine_timers.backward_reduce_timers)
# NEW: run backward_epilogue() before allreduce so the boundary microbatch
# grad lands in the optimizer accumulator that gets reduced.
if isinstance(self.optimizer, ZeROOptimizer):
self.optimizer.backward_epilogue()
if self.enable_backward_allreduce and not self.inside_no_sync_ctxt:
self.allreduce_gradients()
if isinstance(self.optimizer, ZeROOptimizer):
self.optimizer.exit_backward()
see_memory_usage("Engine after backward", force=self.memory_breakdown())
self._stop_timers(self.engine_timers.backward_reduce_timers)
self._stop_timers(self.engine_timers.backward_timers)
```
Diff: +10 / −1, single file (`deepspeed/runtime/engine.py`).
## Root cause walkthrough
The bug requires **both** of the following to be true:
1. The accumulator that `optimizer.backward_epilogue()` mutates is the
**same tensor** that `engine.allreduce_gradients()` later reduces, AND
2. The accumulator is updated *only* by `optimizer.backward_epilogue()`
(no per-param hooks updating it inline during backward).
Both conditions hold for `BF16_Optimizer` without
`immediate_grad_update`:
- It maintains a **separate fp32 accumulator**
(`fp32_groups_gradients_flat`) — distinct from `param.grad`.
- Its `backward_epilogue()` calls `update_hp_grads()` which casts each
param's bf16 `lp.grad` to fp32 and adds it into that accumulator (and
only this code path fills the accumulator when
`immediate_grad_update=False`).
- `engine.allreduce_gradients()` → `buffered_allreduce_fallback()` →
`optimizer.get_grads_for_reduction()` returns **the same
`non_expert_gradients` list = `fp32_groups_gradients_flat`**.
So on the gradient-accumulation boundary microbatch:
1. `loss.backward()` populates bf16 `lp.grad` for that microbatch.
2. `_backward_epilogue` first calls `allreduce_gradients()`. The fp32
accumulator at this point contains microbatches `0..ga-2`'s grads
(summed locally on each rank). The allreduce averages **only that**
across ranks.
3. `_backward_epilogue` then calls `optimizer.backward_epilogue()` →
`update_hp_grads()` → adds the boundary microbatch's local `lp.grad` to
the now-allreduced accumulator.
Result, per rank `i`:
```
fp32_buffer_rank_i = avg_ranks(Σ_{m=0..ga-2} grad_m) + local_grad_{ga-1}_rank_i
└────── shared across ranks ─────┘ └─── rank-private leak ───┘
```
ZeRO-1 partitions optimizer states across ranks, so each rank then runs
`optimizer.step()` on its slice of this rank-divergent buffer;
`update_lp_params()` allgathers the bf16 params back. The effective
gradient applied to parameter `p` is:
```
g_p = avg_ranks(prior microbatches' grad for p) + local_grad_last_for_p_at_owning_rank
```
i.e. the boundary microbatch's contribution captures only `1 /
world_size` of the cross-rank average, biasing the global gradient by
`(world_size − 1) / world_size × 1 / ga_steps`.
## Impact on other optimizers (no behavior change)
| Optimizer | Accumulator | How accumulator is filled | Reduction path |
Affected by current bug? | Affected by this fix? |
|---|---|---|---|---|---|
| `BF16_Optimizer` (immediate_grad_update=False) | separate
`fp32_groups_gradients_flat` | only via `optimizer.backward_epilogue()`
→ `update_hp_grads()` | `engine.allreduce_gradients` →
`buffered_allreduce_fallback` → `optimizer.get_grads_for_reduction()`
returns the same fp32 buffer | **Yes — leak** | **Yes — leak fixed** |
| `BF16_Optimizer` (immediate_grad_update=True) | same fp32 buffer |
per-param hooks (`create_grad_acc_hooks`) fire inline during backward |
same allreduce path | No (hooks already filled buffer before allreduce)
| No-op (`update_hp_grads` early-returns when `immediate_grad_update`) |
| `DeepSpeedZeroOptimizer_Stage1And2` (ZeRO-1, default for
bf16+bf16-grad-accum) | `param.grad` directly + ipg buckets | hooks fire
inline during backward (`overlap_comm=True` default), or
`reduce_gradients()` walks all params at boundary |
`engine.allreduce_gradients` takes the `if hasattr(self.optimizer,
'reduce_gradients')` branch → `optimizer.reduce_gradients()` walks all
params; the boundary microbatch's grad is already on `param.grad`
(autograd populates this before `_backward_epilogue` runs) | No | No-op
(`Stage1And2.backward_epilogue` does not mutate the reduction buffer) |
| `DeepSpeedZeroOptimizer_Stage3` | partitioned via
`overlapping_partition_gradients_reduce_epilogue()` | reduce-scatter
inline during backward via hooks | `engine.allreduce_gradients` takes
the `if zero_optimization_partition_gradients()` branch → calls
overlapping epilogue, which is fed by hooks | No | No-op |
In short: the fix is functionally relevant **only** for `BF16_Optimizer`
without `immediate_grad_update`. For every other ZeRO optimizer the
change is observably a no-op because their `backward_epilogue` does not
mutate the buffer being reduced.
## Reproducer
The minimum reproducer is a 2-rank standalone script that runs one
gradient-accumulation cycle and prints the per-rank fp32 accumulator
norm at each microbatch and immediately before `engine.step()`. With the
bug present the per-rank values **disagree at the boundary microbatch
and going into the optimizer step**; with the fix they agree.
Save as `probe_bf16_grad_accum.py`:
```python
"""Probe whether DeepSpeed's BF16_Optimizer leaks the boundary microbatch grad
out of the cross-rank average. Run with two ranks, e.g.:
deepspeed --num_gpus 2 probe_bf16_grad_accum.py
or
accelerate launch --num_processes 2 --num_machines 1 probe_bf16_grad_accum.py
"""
import os
import torch
import torch.nn as nn
import deepspeed
import torch.distributed as dist
def main():
rank = int(os.environ.get("RANK", "0"))
world = int(os.environ.get("WORLD_SIZE", "1"))
GA_STEPS = 4
HIDDEN = 64
BATCH = 4
torch.manual_seed(0) # SAME init across ranks
model = nn.Sequential(
nn.Linear(HIDDEN, HIDDEN),
nn.GELU(),
nn.Linear(HIDDEN, HIDDEN),
).to(torch.bfloat16).cuda()
ds_config = {
"bf16": {"enabled": True}, # set "immediate_grad_update": True to also bypass the bug
"data_types": {"grad_accum_dtype": "fp32"},
"communication_data_type": "fp32",
"zero_optimization": {
"stage": 1,
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_scatter": True,
"allgather_partitions": True,
"allgather_bucket_size": 200000000,
"reduce_bucket_size": 200000000,
},
"train_micro_batch_size_per_gpu": BATCH,
"gradient_accumulation_steps": GA_STEPS,
"train_batch_size": BATCH * world * GA_STEPS,
"gradient_clipping": 0.0,
"steps_per_print": 9999,
}
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, betas=(0.9, 0.95))
engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config)
bf16_opt = engine.optimizer
if rank == 0:
print(f"[INFO] DeepSpeed optimizer class: {type(bf16_opt).__name__}", flush=True)
print(f"[INFO] grad_acc_dtype = {bf16_opt.grad_acc_dtype}", flush=True)
print(f"[INFO] world={world} ga={GA_STEPS} batch={BATCH}", flush=True)
def buffer_summary(label):
bufs = bf16_opt.fp32_groups_gradients_flat
local_norm = sum(b.detach().to(torch.float64).norm().item() ** 2 for b in bufs) ** 0.5
rank_buf = torch.tensor([local_norm], device="cuda", dtype=torch.float64)
gathered = [torch.zeros_like(rank_buf) for _ in range(world)]
dist.all_gather(gathered, rank_buf)
if rank == 0:
vals = [g.item() for g in gathered]
diffs = [(v - vals[0]) / vals[0] * 100 if vals[0] != 0 else 0 for v in vals]
print(f"[{label}] per_rank={vals} diff_pct_vs_rank0={diffs}", flush=True)
dist.barrier()
buffer_summary("init (zero)")
# Different inputs per rank so per-rank grads differ.
torch.manual_seed(100 + rank)
inputs = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)]
targets = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)]
for i in range(GA_STEPS):
is_boundary = (i == GA_STEPS - 1)
engine.set_gradient_accumulation_boundary(is_boundary=is_boundary) # what accelerate's wrapper does
out = engine(inputs[i])
loss = ((out - targets[i]) ** 2).mean()
engine.backward(loss)
buffer_summary(f"after backward microbatch {i} (boundary={is_boundary})")
buffer_summary("BEFORE engine.step()")
engine.step()
if __name__ == "__main__":
main()
```
## Verification
### Probe (synthetic, 2 GPUs, 1 grad-accum cycle)
Run on `master` (bug):
```
[init (zero)] per_rank=[0.0, 0.0 ] diff = 0%
[after microbatch 0 (boundary=False)] per_rank=[0.1495, 0.1378] diff = -7.78% (local-only accumulation)
[after microbatch 1 (boundary=False)] per_rank=[0.1998, 0.2098] diff = +5.03%
[after microbatch 2 (boundary=False)] per_rank=[0.2322, 0.2434] diff = +4.82%
[after microbatch 3 (boundary=True)] per_rank=[0.2206, 0.2123] diff = -3.80% ← bug
[BEFORE engine.step()] per_rank=[0.2206, 0.2123] diff = -3.80% ← bug
```
Run on this PR (fixed):
```
[init (zero)] per_rank=[0.0, 0.0 ] diff = 0%
[after microbatch 0 (boundary=False)] per_rank=[0.1495, 0.1378] diff = -7.78%
[after microbatch 1 (boundary=False)] per_rank=[0.1998, 0.2098] diff = +5.03%
[after microbatch 2 (boundary=False)] per_rank=[0.2322, 0.2434] diff = +4.82%
[after microbatch 3 (boundary=True)] per_rank=[0.1942, 0.1942] diff = 0.00% ← fixed
[BEFORE engine.step()] per_rank=[0.1942, 0.1942] diff = 0.00% ← fixed
```
The same agreement is reproduced by the existing `bf16: {
immediate_grad_update: true }` workaround, which uses per-param hooks to
fill the fp32 accumulator inline during backward (and is therefore not
affected by the `_backward_epilogue` ordering).
### End-to-end training (HuggingFace Trainer + accelerate + DeepSpeed,
2× A100)
A small custom Qwen3-derived model (~64M params, bf16, ZeRO-1 with
`grad_accum_dtype: fp32`), 10 optimizer steps, identical seed and data
ordering, identical effective batch size (`global_batch_size = 64`),
only `per_device_train_batch_size` varies (so
`gradient_accumulation_steps = global_batch_size / (per_device *
world_size)` differs).
| Configuration | Run A: per_device=2, ga=16 | Run B: per_device=8, ga=4
| Final loss gap |
|---|---|---|---|
| DeepSpeed `master` + `grad_accum_dtype: fp32` (broken) | `train_loss =
6.896` | `train_loss = 6.999` | **0.103** |
| No DeepSpeed (DDP + native grad-accum, control) | `train_loss =
6.9035` | `train_loss = 6.9037` | 0.0002 (bf16 noise) |
| `master` + `bf16.immediate_grad_update: true` (existing workaround) |
`train_loss = 6.9057` | `train_loss = 6.9057` | < 0.0001 |
| **This PR + original config** | `train_loss = 6.9057` | `train_loss =
6.9059` | 0.0002 (bf16 noise) |
The broken case also produces qualitatively misleading instabilities —
e.g. at step 5 in the broken run, B's grad-norm spikes to **17.0** vs
A's **1.35** (≈ 12× ratio), while in the fixed case the two grad-norm
trajectories agree to within bf16 noise at every step.
Per-step loss / grad-norm trajectories under the fixed engine (this PR),
for completeness:
| step | A loss | A gnorm | B loss | B gnorm |
|---|---|---|---|---|
| 1 | 9.1907 | 8.0613 | 9.1907 | 8.0615 |
| 2 | 8.1962 | 5.0553 | 8.1961 | 5.0561 |
| 3 | 7.2035 | 2.2668 | 7.2035 | 2.2683 |
| 4 | 7.0588 | 3.1079 | 7.0618 | 3.1118 |
| 5 | 6.5661 | 2.6192 | 6.5627 | 2.4634 |
| 6 | 6.3097 | 1.8798 | 6.3086 | 1.8913 |
| 7 | 6.1332 | 1.2811 | 6.1317 | 1.2584 |
| 8 | 6.1297 | 2.8776 | 6.1305 | 2.9574 |
| 9 | 6.1739 | 1.4336 | 6.1748 | 1.4687 |
| 10 | 6.0950 | 1.5941 | 6.0957 | 1.6031 |
## Notes
- `BF16_Optimizer` users on `master` who are not pinning
`per_device_train_batch_size` may see silently degraded training when
sweeping per-device batch sizes (the symptom that triggered this
investigation). The bug also makes per-rank model weights briefly
diverge between the optimizer step and the next `update_lp_params()`
allgather, which means cross-rank invariants (e.g. asserts that compare
per-rank state) can flip behavior depending on
`gradient_accumulation_steps`.
- Tested on DeepSpeed 0.18.6 (where the bug was first observed) and
confirmed unchanged on `master` (0.18.10).
- No new tests are added in this PR, but a regression test that asserts
cross-rank fp32 buffer agreement after the boundary microbatch in
`BF16_Optimizer` would be a natural follow-up.
---------
Signed-off-by: Max Yu <18641481+maxyu1115@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>