CUDA Kernels: Use per-operator headers (1/4) (#71212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71212
Splitting this into multiple PRs to keep the diffs more managable.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33949896
Pulled By: malfet
fbshipit-source-id: b11e5effa44d660932b8c21ccab6ece3e48e848c
(cherry picked from commit b866a2b5dafb8d8af061190080c367099c12b178)