Minimize the cases where we have to cpu_zero. (#33570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33570
In this PR, we are a bit more careful about avoiding zero-ing the output. Analysis as follows:
1) `mm` doesn't need zero_ because it never calls scal, which is the underlying problem.
2) for `mv`, which does call scal (in certain cases), we can just move the zeroing to where it would actually be a problem, namely when the scalar value is 0.
In this case we just run the non-BLAS version of the code.
Test Plan: Imported from OSS
Differential Revision: D20007665
Pulled By: gchanan
fbshipit-source-id: 1f3a56954501aa9b2940d2f4b35095b2f60089a8