Use explicit templates in CUDALoops kernels (#41059)
Summary:
Follow up after https://github.com/pytorch/pytorch/pull/40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41059
Differential Revision: D22458928
Pulled By: malfet
fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14