Fix perf bug with indexed assignment (index_put_) (#24083)
Summary:
TensorIterator was incorrectly moving the stride 0 dimension to the
inner-most dim in the assignment:
a[idx] = b
Note that the corresponding read was still fast:
c = a[idx]
This was noticed by adamlerer
```
import torch
import time
import sys
N = 300000
torch.set_num_threads(1)
a = torch.zeros(N, 128)
b = torch.zeros(N, 128)
idx = torch.arange(N)
%timeit c = a[idx] # before and after: ~91.3 ms
%timeit a[idx] = b # before: 4.38 sec after: 44.1 ms
```
Note that the indexed read is slower than the indexed assignment on
my computer because the read has to allocate a new output (which is
zero'ed by the kernel). The indexed assignment doesn't allocate any new
Tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24083
Differential Revision: D16805440
Pulled By: colesbury
fbshipit-source-id: 70a2e74ae79691afbfa9f75b3d7d1e6806f603f5