CUDA Kernels: Use per-operator headers (4/4) (#71215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71215
Splitting this into multiple PRs to keep the diffs more managable.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33949902
Pulled By: malfet
fbshipit-source-id: e737245fb9ebba3c301ee644fba447ea5ddfdfba
(cherry picked from commit a3102cc0d6795e25d3132f8750e092ee2fac59e7)