Add sequential allgather optimization for ZeRO-3 (#7661)
* Perform allgather operations on parameters sequentially instead of
coalescing them into large buckets.
* Significantly reduce peak memory usage in high memory pressure
scenarios.
* Improve performance by minimizing temporary buffer requirements.
* The behavior is enabled via a new boolean flag under the section
```json
"zero_optimization": {
"stage3_allgather_sequential": true
}
```
* By default the optimization is not enabled.
---------
Signed-off-by: aeeeeeep <aeeeeeep@proton.me>