Use indexing instead of reshape for broadcasting (#91722)
This is needed for MLIR rewrite
This replaces
```
xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK, 1])
```
with
```
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
```
so code is a bit more readable, and compiles with master triton (which doesn't currently support first construct).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91722
Approved by: https://github.com/desertfire
Author
Natalia Gimelshein