[inductor] Inline ComputedBuffer computation when there are no reads (#102000)
When inductor compiles the following example,
```python
def flip(x):
idx = torch.arange(x.shape[0] - 1, -1, -1, device=x.device)
return x[idx], idx
```
The return of `idx` forces it to be realized into a `ComputedBuffer`
and the downstream index call inserts a corresponding load and
indirect_indexing:
```python
tmp0 = tl.load(in_ptr0 + (x1), None)
tmp1 = triton_helpers.promote_to_tensor(tmp0)
tl.device_assert((0 <= tmp1) & (tmp1 < 128), "index out of bounds: 0 <= tmp1 < 128")
tmp2 = tl.load(in_ptr1 + (x0 + (128*tmp0)), None)
```
However, if we can inline the index expression from the buffer's
computation we instead get direct indexing (and half the loads):
```python
tmp0 = tl.load(in_ptr0 + (127 + ((-1)*x0)), None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102000
Approved by: https://github.com/lezcano