Fix ZeRO stage 1 and add stage 2 support with DeepCompile (#7366)
This PR fixes the behavior of DeepCompile's ZeRO stage 1 and adds stage
2 support.
DeepCompile's ZeRO1 currently performs allreduce at every iteration even
when it is not a gradient accumulation boundary. This significantly
slows down the performance when gradient accumulation is enabled. This
PR fixes this issue by performing allreduce only at the gradient
accumulation boundary.
As the current behavior is similar to ZeRO2, this PR also adds
DeepCompile's ZeRO2 support. We can now set zero stage to 2 with
DeepCompile.
The loss values, performance, and memory usages were verified using this
[verification tool](https://github.com/tohtana/ds_verify_loss)
([results](https://github.com/tohtana/ds_verify_loss/blob/main/results/results_20250617_035117/report.md)).
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>