Remove useless contiguous calls from torch.matmul (#54616)
Summary:
This reduces the memory usage of matmul significantly for expanded batch size.
This reduces the peak memory usage of
```
a = torch.rand(1, 1024, 1024, device="cuda")
b = torch.rand(1024, 1024, 1, device="cuda")
out = torch.matmul(a, b)
```
From 4GB to 16MB which is not too bad.
It also fixes the same problem when `b` is not batched.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54616
Reviewed By: ailzhang
Differential Revision: D27327056
Pulled By: albanD
fbshipit-source-id: 4bb5f4015aeab4174148512f3c5b8d1ffa97bf54