Improve compare kernel (#29743)

Commit

4 years ago

Improve compare kernel (#29743) Summary: Currently, the way the compare kernels handle dtypes is very funny (this behavior is introduced in https://github.com/pytorch/pytorch/pull/28427 and I just realize it today): Let's say `a, b` are two float tensors on CUDA. If you do `a < b`, this is what would happen inside the loop: - Step 1: Fetch `a` and `b`, dynamically cast them from `float` to `float`. (i.e. check the scalar type to figure out if it needs cast. it doesn't. so do nothing then.) - Step 2: compute `a < b`, get a `bool` result - Step 3: statically cast the result into `float` - Step 3: do a dynamic cast of the result from `float` to `bool` and store the value And if you do `a.lt_(b)`, this is what would happen: - Step 1: Fetch `a` and `b`, no casting - Step 2: compute `a < b`, get a `bool` result - Step 3: statically cast the result into `float` - Step 4: store the result to memory, no casting Although dynamic casting happens on registers, it still hurt the performance a bit (~8%). This PR fixes this issue. Now for compare kernels, if the output is bool and inputs have the same dtype, then there is no dynamic casting. Otherwise, there will be dynamic casting for each input and output. That is, the dynamic casting behavior of the two cases described above are swapped. Benchmark on `a < b` for tensor of 1000000000 fp32 elements: Before https://github.com/pytorch/pytorch/issues/28427 6.35 ms Current master: 6.88 ms With this PR: 6.36 ms Benchmark on `a.lt_(b)` does not show any difference across versions. Besides this, what worries me most is, with type promotion, the logic for tensor iterator is becoming super complicated, and it is hard to see if one change causes the performance regression of others. I suggest we create scripts that could benchmark tensor iterator entirely, review that code and put it somewhere inside the repository (maybe under `/tools` or `/test/scripts`?), and whenever we are not certain about the performance we could run it to check. (I guess not on this PR but on PRs after the script is done. If there are worries about performance, the author of PRs should run the script manually, and the reviewer should remind PR author to do so if necessary) If this is a good idea, I will send a PR for the script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29743 Differential Revision: D18671269 Pulled By: ngimel fbshipit-source-id: 89a9c1c8b5fd45d5ae8fe907d65c2fe1a7dfd2dc

Author

zasdfgbnm

Committer

facebook-github-bot

Parents

5c6705e6

pytorch 25f4ba7c - Improve compare kernel (#29743)

Commit

pytorch
25f4ba7c - Improve compare kernel (#29743)