use legacy unrolled kernel for non-trivial offset calc cases (#71710)
Summary:
This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6
Most importantly, it reduces context size by 40 MB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71710
Reviewed By: mruberry
Differential Revision: D33769330
Pulled By: ngimel
fbshipit-source-id: 5a7942261e06003ca79bfa3b071106aab1a8a4bc
(cherry picked from commit f9b51b48112b25353c928711974537a0792516c8)
Author
Natalia Gimelshein