enable mkl_gemm_bf16bf16f32 in cpublas::gemm (#107196)
This one is a wrapper upon `mkl_gemm_bf16bf16f32` which is used in flash attention kernel on intel 4th gen xeon.
Fallback path has also been implemented on cpublas::gemm in case `mkl_gemm_bf16bf16f32` is not available.
The primary target of this change is to help build kernels in `scaled_dot_product_attention`, e.g. flash attention and efficient attention. In the attention kernel, `q @ k.T = attn`, q and k will be given as bfloat16 and attn is float32. This is actually both beneficial for both performance and accuracy, since attn will be used to compute lazy softmax which has to be done in float32.
This patch also adds routine from OpenBlas `sbgemm_` which also has a signature of bf16 * bf16 -> fp32; but since OpenBlas routine has different name from MKL's, we can not use `sbgemm_` in MKL.
In the fallback path, it takes two steps to do the computation: first do gemm with beta = 0; then add beta * C in full precision. Idea from @peterbell10 not to truncate C to bfloat16, so as to avoid unnecessary accuracy loss.
ref: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/cblas-gemm-bf16bf16f32.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107196
Approved by: https://github.com/jgong5, https://github.com/peterbell10