Reduce binary size of TensorCompare.cu (#68835)
Summary:
This PR does several things
1) eliminates `where` instantiations for deprecated `byte` condition dtype, and casts `condition` to `bool` in this case. This is a perf penalty for people using deprecated calls
2) Makes `clamp_{min/max}.Tensor` overload reuse `clamp_{min/max}.Scalar` kernels if limit argument is cpu scalar, instead of instantiating `gpu_kernel_with_scalars`
3) Unifies all clamp_scalar kernels to use a single kernel with lambda picking the correct operation. I've verified that it doesn't degrade kernel performance.
4) Eliminates redundant TensorIterator construction that `clamp` structured kernel was doing when only `min` or `max` was specified
This reduces the cubin size for TensorCompare.cu on V100 from 15751920 bytes to 7691120 bytes, with corresponding reduction in compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68835
Reviewed By: mruberry
Differential Revision: D32839241
Pulled By: ngimel
fbshipit-source-id: 0acde5af10a767264afbdb24684b137c5544b8d9