various overhead improvements to cuda addmm (#55026)
Summary:
Add fast common case to `prepare_matrix_for_cublas`, use index size instead of size(), move some checks where they belong so they are not triggered where they are guaranteed to be true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55026
Reviewed By: gchanan
Differential Revision: D27468945
Pulled By: ngimel
fbshipit-source-id: 79c9f7b3d61595536f603d6fb0316e6f21630f38