SemanticDiff

pytorch
5b51849b - Increase size limit on calling CublasLt in addmm by 32x (#82922)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

2 years ago

Increase size limit on calling CublasLt in addmm by 32x (#82922) Summary: Increase the limit by 32 times to go to cublasLt fastpath for linear/addmm. Why? Discovered this when looking at linear performance for a linear with input/output [512, 512] and an input of size [1024, 82, 512]. It was slow. Did a sweep on inputs, and discovered there was a perf cliff between input sizes [799, 82, 512] and [800, 82, 512]. There is a check to call into CublasLt when input dim 1 is < 65535. So 799 * 82 = 65518 (fastpath), 800 * 82 = 65600 (slowpath). With CublasLt we just get one nice sgemm. Without CublasLt, there is a copy then an sgemm, which can be almost 2x slower. This change roughly 1.6-1.9x the speed of linear/addmm. However, we only increased the limit. For matrices with even bigger sizes, it will go to the slow path again. Test Plan: CI, manual testing with linear. Differential Revision: D38478430 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82922 Approved by: https://github.com/ngimel, https://github.com/malfet

Author

erichan1

erichan1

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading