[QNNPACK, Sparsity] Added prepacking base aarch32 kernels (#50589)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50589
Adds 1. Input prepacking kernel 2. Compute kernels that processes
prepacked activation.
Hunch is that input prepacking will help with 1. Cache locality and 2.
Avoid a lot of address compute instructions.
Cache locality helps mainly comes from the fact that we are doing mr=8
and nr=4.
mr being 8 likely results in cache line evictions as likely cache
associativity is 4. Laying out transposed activations which are blocked
by mr=8 will lay all the transposed activation in one contiguous block.
Downside is that now we will tranpose all the blocks regardless of them
participating in compute. However it is likely that entire activation
matrix participates in compute for some output block.
Also add benchmark
Test Plan:
q8gemm-sparse-test
fully-connected-test-sparse
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D25925502
fbshipit-source-id: b2c36419a2c5d23b4a49f25f9ee41cee8397c3be