Add acc_gpu_kernel_with_scalars and port add to use it (#63884)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63884
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30545296
Pulled By: ezyang
fbshipit-source-id: f0da52153ae63599fe1d57e90e73f50ca2116939