Changes to qadd for perf improvement.

Commit

4 years ago

Changes to qadd for perf improvement. Summary: qadd calls contiguous on input tensors. This by default does contiguous in NCHW format (for 4D tensors). We should call .contiguous(input.suggest_memory_format()) Output allocation also done NCHW format. This results in the subsequent conv having to do memcpy for NHWC format. Both of this leads to majority of the time spent in qadd in copying in FBNET_A model. Fixing these reduces runtime on S8 phone to 15ms from 17. Reducing the gap between c2 and PT latency from ~24% to ~9.5%. Also note that the contract for ops is that they return output tensor in same format as the input memory format. Test Plan: Apply on top of diff D20721889. bento console --file mobile-vision/projects/model_zoo/scripts/run_create_model_benchmark.py Note: There are many calls to .contiguous without format specification in aten/src/ATen/native/quantized/cpu. All those should be replaced with .contiguous(input.suggest_memory_format()) whenever applicable (Most likely applicable to all elementwise ops) Also same should apply for output allocation. Reviewed By: dreiss Differential Revision: D20794692 fbshipit-source-id: 6b81012497721d48e7d6a5efcc402f315b1dfe77

Author

kimishpatel

Committer

facebook-github-bot

Parents

3ef5ff60

pytorch 602b51eb - Changes to qadd for perf improvement.

Commit

pytorch
602b51eb - Changes to qadd for perf improvement.