[PyTorch] Port Caffe2 opti for BatchMatMul batch size 1 to baddbmm (#51057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51057
Caffe2 has an
[optimization](https://github.com/pytorch/pytorch/blob/f8eefbdf7a229abbb864e47e0b664c7628d80224/caffe2/operators/batch_matmul_op.h#L192)
for the case where the batch size is 1 that uses the underlying `gemm`
instead of `gemm_batched` BLAS function. This diff tries to port that
optimization to `baddbmm_mkl`.
Note that I have very little linear algebra background and am just
going off existing code and cblas API documentation, so please
review without assuming I know what I'm doing with the math itself.
ghstack-source-id: 120342923
Reviewed By: hlu1
Differential Revision: D26056613
fbshipit-source-id: feef80344b96601fc2bd0a2e8c8f6b57510d7856