fix nonzero perf regression (#58468)
Summary:
https://github.com/pytorch/pytorch/issues/55292 introduced perf regression for nonzero cuda, this fixes it. nvcc is still pretty bad about unrolling loops with boundaries that are not known at compile time, this makes `write_indices` kernels ~5x slower than it should be.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58468
Reviewed By: mruberry
Differential Revision: D28511147
Pulled By: ngimel
fbshipit-source-id: fe7303ec77da1abbe5e874093eca247b3919616f
Author
Natalia Gimelshein