Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training (#4731)
* use cublas extenstion API for fp16
* Using cublasGemmBatchedEx/cublasGemmStridedBatchedEx for training
To avoid accuracy, the accumulation needs to be done in FP32 for training.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>