Fix DeepCompile ZeRO-1 grad target lifetime (#8036)
DeepCompile ZeRO-1 kept compile-time reduce target buffers alive into
the optimizer step, causing backward gradient storage to overlap with
optimizer temporaries. This PR fixes the issue by making DeepCompile
ZeRO-1 reduce targets follow the normal step-local ZeRO partition
gradient-buffer lifetime, instead of preserving cloned target storage
from compile setup.
The actual code changes are:
- During compile initialization, register empty DeepCompile ZeRO-1
gradient targets, then bind them to the step-local flat ZeRO partition
gradient buffer and per-parameter views when gradients are ready to
synchronize.
- After ZeRO-1 builds the optimizer-facing fp32 gradient partition,
release the DeepCompile registry references and clear reduce bucket
storage after backward synchronization.
| | Step 10-30 avg sec | Peak alloc GiB |
| --- | --- | ---: |
| Without this PR | 0.858 | 43.594 |
| Without this PR | 0.859 | 39.366 |
Fine-tuning style training (8xH100, Qwen3-8B random weights, bs/GPU=1,
seq=4096, GAS=1) showed only finite value losses for 1000 steps.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>