Use explicit templates in CUDALoops kernels (#44286)
Summary:
Reland attempt of https://github.com/pytorch/pytorch/pull/41059
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44286
Reviewed By: ngimel
Differential Revision: D23859691
Pulled By: malfet
fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d