Improve performance of advanced indexing backward (#20557)
Summary:
This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100:
```
Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1
Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096
```
after:
```
Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096
Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096
```
Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than `65K*128`, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported.
The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index.
There are ways to make sorting overhead smaller (thanks mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls.
I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent.
This PR also fixes #20457 by casting indices to `self` backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20557
Differential Revision: D15582434
Pulled By: ezyang
fbshipit-source-id: 91e8f2769580588ec7d18823d99a26f1c0da8e2a
Author
Natalia Gimelshein