Prefer contiguous output from mkldnn_bf16_gemm (#82968)
In https://github.com/pytorch/pytorch/pull/65840#issuecomment-1207843020 it was reported that `mkldnn_bf16_gemm` resulted in extra reorder calls. This seems to be due to the fortran-contiguous strides on the output tensor. Rearranging the matmul operation to output to c-contiguous strides removes these extra reorders.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82968
Approved by: https://github.com/malfet