Use explicit templates in `gpu_kernel_with_scalars` (#40992)
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40992
Differential Revision: D22398733
Pulled By: malfet
fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f