Move some cub templates out of the header file (#67650)
Summary:
Cub routines are both expensive to compile and used in multiple
different operators throughout the cuda folder. So, it makes sense to
compile them in one centralized place where possible (i.e. when
custom operators aren't involved).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67650
Reviewed By: mruberry
Differential Revision: D32259660
Pulled By: ngimel
fbshipit-source-id: 5f7dbdb134297e1ffdc1c7fc5aefee70a2fa5422