pytorch
2cc6ae19 - squash xblock for persistent inner reduction (#102444)

Commit

1 year ago

squash xblock for persistent inner reduction (#102444) Currently layer norm kernel performance is pretty bad due to triton perf bug https://gist.github.com/ngimel/c1e7f70f8268f038e710e835b0065f63, but since XBLOCK for persistent reduction is `1` we can just drop this dimension and operate on 1d tensors (and then perf of ln kernels improves a lot) Perf results http://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2022%20May%202023%2001%3A27%3A25%20GMT&stopTime=Mon%2C%2029%20May%202023%2001%3A27%3A25%20GMT&suite=torchbench&mode=training&dtype=amp&lBranch=ngimel/persistent_1d&lCommit=1d5175f5e682f37aae15fd217bc3767e1788bacf&rBranch=main&rCommit=c9f4f01981fd73fcc7c27676cc50230cd1b5bc22, approx 4% on hf Pull Request resolved: https://github.com/pytorch/pytorch/pull/102444 Approved by: https://github.com/jansel

References

gh/willfengg/1/base

Author

Natalia Gimelshein

Committer

pytorchmergebot

Parents

3c2519ab

pytorch 2cc6ae19 - squash xblock for persistent inner reduction (#102444)

pytorch
2cc6ae19 - squash xblock for persistent inner reduction (#102444)