Add symmetric version of gpu_kernel_with_scalars
`gpu_kernel_with_scalars` produces 3 calls to `gpu_kernel` for
`f(a, b)` where either of a or b can be cpu scalars. If `f` happens to
be symmetric (i.e. `f(a, b) == f(b, a)`) then we only need 2 calls to
`gpu_kernel` thus reducing the cuda context size.
On my build for 1 cuda architecture, this reduces `torch_cuda_cu.so`
by 24.5 MB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78989
Approved by: https://github.com/ngimel