DeepSpeed
11eeb7cd - Fix DeepCompile ZeRO-1 grad target lifetime (#8036)

Commit
1 day ago
Fix DeepCompile ZeRO-1 grad target lifetime (#8036) DeepCompile ZeRO-1 kept compile-time reduce target buffers alive into the optimizer step, causing backward gradient storage to overlap with optimizer temporaries. This PR fixes the issue by making DeepCompile ZeRO-1 reduce targets follow the normal step-local ZeRO partition gradient-buffer lifetime, instead of preserving cloned target storage from compile setup. The actual code changes are: - During compile initialization, register empty DeepCompile ZeRO-1 gradient targets, then bind them to the step-local flat ZeRO partition gradient buffer and per-parameter views when gradients are ready to synchronize. - After ZeRO-1 builds the optimizer-facing fp32 gradient partition, release the DeepCompile registry references and clear reduce bucket storage after backward synchronization. | | Step 10-30 avg sec | Peak alloc GiB | | --- | --- | ---: | | Without this PR | 0.858 | 43.594 | | Without this PR | 0.859 | 39.366 | Fine-tuning style training (8xH100, Qwen3-8B random weights, bs/GPU=1, seq=4096, GAS=1) showed only finite value losses for 1000 steps. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Parents
Loading