Expand the coverage of test_addmm and test_addmm_sizes (#43831)
Summary:
- This test is very fast and very important, so it makes no sense in marking it as slowTest
- This test is should also run on CUDA
- This test should check alpha and beta support
- This test should check `out=` support
- manual computation should use list instead of index_put because list is much faster
- precision for TF32 needs to be fixed. Will do it in future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43831
Reviewed By: ailzhang
Differential Revision: D23435032
Pulled By: ngimel
fbshipit-source-id: d1b8350addf1e2fe180fdf3df243f38d95aa3f5a