pytorch
ff1c1d41 - Reduce binary size of TensorCompare.cu (#68835)

Commit View On GitHub

Commit

2 years ago

Reduce binary size of TensorCompare.cu (#68835) Summary: This PR does several things 1) eliminates `where` instantiations for deprecated `byte` condition dtype, and casts `condition` to `bool` in this case. This is a perf penalty for people using deprecated calls 2) Makes `clamp_{min/max}.Tensor` overload reuse `clamp_{min/max}.Scalar` kernels if limit argument is cpu scalar, instead of instantiating `gpu_kernel_with_scalars` 3) Unifies all clamp_scalar kernels to use a single kernel with lambda picking the correct operation. I've verified that it doesn't degrade kernel performance. 4) Eliminates redundant TensorIterator construction that `clamp` structured kernel was doing when only `min` or `max` was specified This reduces the cubin size for TensorCompare.cu on V100 from 15751920 bytes to 7691120 bytes, with corresponding reduction in compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/68835 Reviewed By: mruberry Differential Revision: D32839241 Pulled By: ngimel fbshipit-source-id: 0acde5af10a767264afbdb24684b137c5544b8d9

References

#69791 - Merge from master

Author

ngimel

Committer

desertfire

Parents

6c5251c2

pytorch ff1c1d41 - Reduce binary size of TensorCompare.cu (#68835)

Commit

pytorch
ff1c1d41 - Reduce binary size of TensorCompare.cu (#68835)