Micro-optimisations for matmul 2.0: Electric boogaloo
This PR implements the bulk of
https://github.com/pytorch/pytorch/pull/64387
Part of the optimisations were already merged in
https://github.com/pytorch/pytorch/pull/72230
A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
/ `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous` + `view`
On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75197
Approved by: https://github.com/mruberry