A heuristic to avoid perf incompatible MKLDNN formats for binary ops (#56089)
Summary:
After adding new ops to a set of fusible ops, mobilenetv3 slows down to **9000ms from 1200ms** without this fix.
This happens because one of the inputs was expanded and converted to nchw/nhwc
we might end up in a very bad spot if the second argument
is in a blocked format. In this case, MKLDNN uses its
reference implementation for a binary operation that follows
these broadcasts and it could be up to ~100x slower.
We use a very simple heuristic to convert an arg in nchw
to the blocked format of the other argument.
* MKLDNN_VERBOSE without the issue:
[test_mobilenet_nopool.txt](https://github.com/pytorch/pytorch/files/6319528/test_mobilenet_nopool.txt)
* MKLDNN_VERBOSE with the issue (Note the times for `ref` operations)
[test_mobilenet_pool.txt](https://github.com/pytorch/pytorch/files/6319529/test_mobilenet_pool.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56089
Reviewed By: eellison
Differential Revision: D27796688
Pulled By: Krovatkin
fbshipit-source-id: fc34d76358ce899e3b1f2b69efb9b5c38f5af1ad