reduce number of randperm template instantiations (#58362)
Summary:
Per title, benchmarks in https://github.com/pytorch/pytorch/issues/54113 don't regress, size of torch_cuda_cu_generated_Randperm.cu.o goes 8562152 -> 2585792 for a single architecture, compilation time decreases also.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58362
Reviewed By: heitorschueroff
Differential Revision: D28477697
Pulled By: ngimel
fbshipit-source-id: 32dbe44ca6b3807668d548512d7484f8488834c4
Author
Natalia Gimelshein