Second part of splitting #91254 in two (#92749)
This handles the disabling masks if numel is a multiple of BLOCK.
It currently introduces a performance regression, but the triton
it generates does not seem to have any issues: all the change does
is cause xmask to be removed from load/stores in cases where it safely
can be removed. It seems it must be coming from some issue in triton
optimizer.
FWIW, if you try this change with current triton master (instead of
pinned version) it does _not_ cause a performance regression.
However, upgradign to triton master by itself already causes
significant performance regressions so it's not an option
to just bump up the pin.
I'm going to leave this PR open until we manage to increase
the triton pin past the big refactoring. Once we do that
I will check if it still causes a performance regression.
UPDATE:
The triton pin has been moved and I retried this PR. As expected, there's no longer a performance regression for hf_Bert:
```
tspin python benchmarks/dynamo/torchbench.py --performance --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert -n 5 --diff-branch viable/strict 2> err
batch size: 16
cuda train hf_Bert numel_BLOCK 1.175x p=0.00
batch size: 16
cuda train hf_Bert viable/strict 1.161x p=0.00
```
Re-opening this, should be okay to merge now I expect.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92749
Approved by: https://github.com/jansel