Exploit symmetry in comparison operators to reduce no. of kernels
`gpu_kernel_with_scalars` generates 3 `gpu_kernel` calls to compute
`f(a, b)` where either a or b can be a scalar constant. We can cut
this to 2 by using the symmetry in the comparison operators to only
create one unary kernel e.g. by changing `a < b` into `b > a`.
On my build for 1 cuda architecture, this reduces `torch_cuda_cu.so`
by 1.8MB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78990
Approved by: https://github.com/ngimel