BF16 optimizer: Improve device utilization by immediate grad update (#4975)
Enabled gradient accumulation in bf16 optimizer which updates fp32
gradients once they are available.
This improves device utilization on some back-ends, by parallelizing the
workload across engines.
To enable the feature (disabled by default), use a new config flag
"immediate_grad_update" under "bf16"
section in Deepspeed config.json (default is false).
Example:
"bf16": {
"enabled": true,
"immediate_grad_update": true
}
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>