Improve performance of index_select by avoiding item (#63008)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/61788
From a CUDA perspective: item already pulls all Tensor content onto the host (albeit one-by-one), which incurs very expensive memory transfers. This way we'll do it all at once.
From a CPU perspective: item has a lot of overhead as a native function in comparison to simply using a pointer.
Overall there's still lots of performance gains to be had, but this is a small change that should take us into a more usable landscape. This doesn't land a separate benchmark, but I postulate that's not necessary to decide on the benefit of this (we'll also see if it shows up indirectly), however is still a good follow-up item.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63008
Reviewed By: zou3519
Differential Revision: D30211160
Pulled By: cpuhrsch
fbshipit-source-id: 70b752be5df51afc66b5aa1c77135d1205520cdd