improve mkldnn matmul performance when one input is contiguous tensor but the strides is not default contiguous strides (#99511)
giving the following case:
```
import torch
a= torch.empty_strided([64, 1, 33], [33, 3, 1], dtype=torch.bfloat16).fill_(1)
b = torch.randn(64, 33, 256).to(dtype = torch.bfloat16)
y = torch.ops.aten.bmm(a, b)
```
```a``` is a contiguous tensor, but the strides are not defaulted contiguous strides ([33, 33, 1]), onednn matmul always running a non-optimized path:
```
onednn_verbose,exec,cpu,matmul,gemm:jit,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,7.28711
```
This PR will convert the inputs' stride to deafult contiguous stride before calling onednn to running an optimization path:
```
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_bf16,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,3.06396
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99511
Approved by: https://github.com/mingfeima, https://github.com/jgong5