[inductor] Add lowering for as_strided_scatter (#88379)
Ref pytorch/torchdynamo#327
The use of as_strided does require in-memory manipulations, however this
lowering allows those memory ops to be fused with any preceding calculations.
e.g.
```
def f(a, b):
return torch.as_strided_scatter(
a * 8 + 10,
b * 2 - 4,
size=(a.numel() // 2,),
stride=(2,))
```
Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.
In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88379
Approved by: https://github.com/jansel