Changes to qadd for perf improvement.
Summary:
qadd calls contiguous on input tensors. This by default does contiguous in NCHW
format (for 4D tensors). We should call
.contiguous(input.suggest_memory_format())
Output allocation also done NCHW format. This results in the subsequent conv
having to do memcpy for NHWC format.
Both of this leads to majority of the time spent in qadd in copying in FBNET_A
model.
Fixing these reduces runtime on S8 phone to 15ms from 17. Reducing the gap
between c2 and PT latency from ~24% to ~9.5%.
Also note that the contract for ops is that they return output tensor in same
format as the input memory format.
Test Plan:
Apply on top of diff D20721889.
bento console --file mobile-vision/projects/model_zoo/scripts/run_create_model_benchmark.py
Note: There are many calls to .contiguous without format specification in
aten/src/ATen/native/quantized/cpu.
All those should be replaced with .contiguous(input.suggest_memory_format())
whenever applicable (Most likely applicable to all elementwise ops)
Also same should apply for output allocation.
Reviewed By: dreiss
Differential Revision: D20794692
fbshipit-source-id: 6b81012497721d48e7d6a5efcc402f315b1dfe77