[QNNPACK, Sparsity] Sparse kernel with 4x8 blocking (#50590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50590
Larger blocking across M dim such as 8 in previous PR is likely
introducing wasted compute on the shapes being benchmarked.
Here we introduced 4x8 blocking of mrxnr. This helps 1) in packing
smaller data for small values of M and 2) for compute kernel it writes
same number of bytes but more contiguously. It is not certain but it
likely helps.
Test Plan:
q8gemm-sparse-test
fully-connected-sparse-test
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D25925499
fbshipit-source-id: 01c661ceea38bd6ee8321bb85cf1d5da5de4e984