Improved matmul tests
Let's make sure we don't break anything in the next PRs of the stack.
Also some comprehensive testing of matmul on CPU and CUDA was long due.
Running this tests we see that the `out=` variant of matmul is broken
when used on 4D tensors. This hints what would be the amount of people
that use out= variants...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75193
Approved by: https://github.com/ngimel