[pytorch] use cublas lt interface for bias fusion (#72148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72148
To quantify how much cublas lt interface can help param bench (https://github.com/facebookresearch/param/) linear perf
On V100 GPU
for b in 512 1024; do for i in {1..5}; param_bench/train/compute/pt/pytorch_linear.py --device gpu --dtype=float16 --hidden-size 1024 --batch-size ${b}; done; done
Before this commit
batch size 512: median 21.4 TF/s (20.7, 20.6, 21.8, 21.6, 21.4)
batch size 1024: median 40.1 TF/s (39.4, 39.3, 40.2, 40.4, 40.1)
After this commit
batch size 512: median 23.5 TF/s (23.2, 23.5, 23.8, 23.9, 23.6 ) 9.8% speedup
batch size 1024: median 41.6 TF/s (42.7, 41.6, 40.4, 41.3, 41.9 ) 3.7% speedup
Reviewed By: jasonjk-park, jianyuh
Differential Revision: D33928147
fbshipit-source-id: cecc51a27f4b07a7f8cb728d48eebfc4e41ea823
(cherry picked from commit 2b71db6199c49b2461bc0d4c2647644b76b29d5d)