Increase size limit on calling CublasLt in addmm by 32x (#82922)
Summary:
Increase the limit by 32 times to go to cublasLt fastpath for linear/addmm.
Why?
Discovered this when looking at linear performance for a linear with input/output [512, 512] and an input of size [1024, 82, 512]. It was slow.
Did a sweep on inputs, and discovered there was a perf cliff between input sizes [799, 82, 512] and [800, 82, 512]. There is a check to call into CublasLt when input dim 1 is < 65535. So 799 * 82 = 65518 (fastpath), 800 * 82 = 65600 (slowpath).
With CublasLt we just get one nice sgemm. Without CublasLt, there is a copy then an sgemm, which can be almost 2x slower. This change roughly 1.6-1.9x the speed of linear/addmm. However, we only increased the limit. For matrices with even bigger sizes, it will go to the slow path again.
Test Plan: CI, manual testing with linear.
Differential Revision: D38478430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82922
Approved by: https://github.com/ngimel, https://github.com/malfet