CUDA Kernels: Use per-operator headers (3/4) (#71214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71214
Splitting this into multiple PRs to keep the diffs more managable.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D33949897
Pulled By: malfet
fbshipit-source-id: 3ba6b4b8083fe97e11644688f9d90a4ec217fedc
(cherry picked from commit 5a3e986cfea686e360be484da0087e1b3e2d1ea9)