Remove native_functions.yaml dependency from some reduction operators (#64173)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64173
This one also required restructuring the code a bit to move the kernel
code into seperate files. So, I've mainly focused on CUDA which is
where the real build-time issues are.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, ezyang
Differential Revision: D30728581
Pulled By: dagitses
fbshipit-source-id: a69eea5b4100d16165a02660dde200c8f648683d