[inductor] New approach for computing triton load/store masks (#91241)
This PR is a new version of #89566, fixing a test failure.
Couldn't get ghstack to colaborate on updating that PR after re-opening,
so started a new one.
This changes the way masks for loads/stores are computed in triton backend of inductor.
New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.
I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.
Relative to #89566, the only change is to not include the mask variables
of arguments when the function being called is `tl.where`. The reason
being that `tl.where` is often used precisely to make sure the output
variable has valid values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91241
Approved by: https://github.com/ngimel