Optimize index_select for 1D inputs (#35243)
Summary:
`gather` turns out to be much faster than `index_select` for this function. (Anywhere from 2-10x faster across my testing.) We do have to match the shape for the generated indices, however this does not affect performance since `.expand` does not copy the underlying buffer.
I experimented with a custom kernel, but the improvement over this implementation didn't justify the approach since it would have added significant complexity and reduced the use of shared infrastructure in the PyTorch codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35243
Differential Revision: D20629914
Pulled By: robieta
fbshipit-source-id: 7841b6a40ffd2b32e544f54ef2529904d76864b8