pytorch
f6f413c6 - Second part of splitting #91254 in two (#92749)

Commit View On GitHub

Commit

1 year ago

Second part of splitting #91254 in two (#92749) This handles the disabling masks if numel is a multiple of BLOCK. It currently introduces a performance regression, but the triton it generates does not seem to have any issues: all the change does is cause xmask to be removed from load/stores in cases where it safely can be removed. It seems it must be coming from some issue in triton optimizer. FWIW, if you try this change with current triton master (instead of pinned version) it does _not_ cause a performance regression. However, upgradign to triton master by itself already causes significant performance regressions so it's not an option to just bump up the pin. I'm going to leave this PR open until we manage to increase the triton pin past the big refactoring. Once we do that I will check if it still causes a performance regression. UPDATE: The triton pin has been moved and I retried this PR. As expected, there's no longer a performance regression for hf_Bert: ``` tspin python benchmarks/dynamo/torchbench.py --performance --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert -n 5 --diff-branch viable/strict 2> err batch size: 16 cuda train hf_Bert numel_BLOCK 1.175x p=0.00 batch size: 16 cuda train hf_Bert viable/strict 1.161x p=0.00 ``` Re-opening this, should be okay to merge now I expect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92749 Approved by: https://github.com/jansel

Author

Fabio Rocha

Committer

pytorchmergebot

Parents

cbac56e2

pytorch f6f413c6 - Second part of splitting #91254 in two (#92749)

Commit

pytorch
f6f413c6 - Second part of splitting #91254 in two (#92749)