Unfuse bias add before pointwise ops (#106912)
I get a 2% inference speedup in HF with this PR. I checked to see if there any models where unfusing was slower than the cublas gelu fusion, and I did not see any, which was surprising to me. Sorry for the cublas-activation api churn 😬
Kicking off another run in cublas 12, it's possible that the results have changed since.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106912
Approved by: https://github.com/jansel
ghstack dependencies: #106911