In inductor triton generated code, avoid masking when numel=1 (#91254)
This is implementing an idea from @lezcano : if we have a generated triton kernel with `xnumel=1`, then `xmask` is just `0<1` and can be dropped from all `load`/`store`/`where`.
The `xnumel=1` case actually comes up relatively often when code for reductions is being generated. @lezcano reported some performance gains in micro-benchmarks (see comment below) and it is a very simple change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91254
Approved by: https://github.com/jansel, https://github.com/ngimel