Avoid calling tensor.numel() in for loops (#27298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27298
PR #26908 toggles NonVariableTypeMode in ATen dispatcher, which is where
USE_STATIC_DISPATCH takes place.
This causes an issue with numel() as it gets called through the dispatch mode and probably not getting inlined.
Also the thread local state is expensive to read/write so many times and this kills perf.
PR #27274 is another approach to fix this and has more details.
Test Plan:
Quantized mobilenetV2 perf before this change
Main run finished. Milliseconds per iter: 28.6782. Iters per second: 34.8696
Perf after this change
Main run finished. Milliseconds per iter: 22.2585. Iters per second: 44.9267
Imported from OSS
Differential Revision: D17742565
fbshipit-source-id: 43c6045cc001c46916ba339555c9d809a2537eff