sparse_mask: faster, with support for uncoalesced mask (#91964)
This PR updates `sparse_mask` to be:
* about 30% faster on CUDA.
* able to support uncoalesced masks.
* much shorted code-wise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91964
Approved by: https://github.com/cpuhrsch, https://github.com/pearu