Prototype benchmarking util (#38338)

Commit

4 years ago

Prototype benchmarking util (#38338) Summary: This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same. In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests https://github.com/pytorch/pytorch/pull/38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar) I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2) Key takeaways: 1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?) 2) There is an extra ~1.5 us overhead, which dominates small kernels. 3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer. Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback. Pull Request resolved: https://github.com/pytorch/pytorch/pull/38338 Differential Revision: D21551048 Pulled By: robieta fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db

Author

Taylor Robie

Committer

facebook-github-bot

Parents

c648cd37

pytorch f3949794 - Prototype benchmarking util (#38338)

Commit

pytorch
f3949794 - Prototype benchmarking util (#38338)