Prototype benchmarking util (#38338)
Summary:
This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same.
In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests https://github.com/pytorch/pytorch/pull/38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar) I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2)
Key takeaways:
1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?)
2) There is an extra ~1.5 us overhead, which dominates small kernels.
3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer.
Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38338
Differential Revision: D21551048
Pulled By: robieta
fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db