Support fp32 gradaccum for bf16 model (#2566)
* allow bf16 model with fp32 gradient accumulation datatype
* allow fp32 gradient accumulation and bfloat16 model in amp mode
* alternative fix for grad accumulation type mismatch. In the case of zero optimizer we should have grad accum type == model data type
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>