Perf optimization for conv and gemm kernels. (#37626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37626
Did some rescheduling of the instructions to hide latency of the loads.
Particularly at the start of the kernel we have latency bound chains.
It seems to improve perf form aarch32.
Also did some inst rescheduling for aarch64 gemm kernel. Not clear if
this actually helps with perf espcially in OOO CPUs, but worth a try.
Test Plan:
qnnpack tests
q8gemm-test
Imported from OSS
Differential Revision: D21339037
fbshipit-source-id: 0469581a0e3bd3fd04f15200c2171fc8c264722b