SemanticDiff

pytorch
6696a75e - [inductor] make thread order consistent with loop order (#106827)

Commit View On GitHub

Login via GitHub
Home
Pricing
FAQ
Install

Login via GitHub

Commit

1 year ago

[inductor] make thread order consistent with loop order (#106827) I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead. For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels: - before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d - after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551 I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827 Approved by: https://github.com/jansel

Author

shunting314

shunting314

Committer

pytorchmergebot

pytorchmergebot

Parents

FAQ Terms Privacy Refunds Impressum

Loading