[PyTorch Edge] Use Parallelization in Internal Quantized Matmul (#73247)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73247
Split up multiplication over outer dimensions
ghstack-source-id: 151250864
Test Plan:
From fbcode:
```buck test caffe2/test:quantization -- test_qmatmul```
Performance Improvement Summary:
For matmuls used by Transformer Model
- This diff makes qmatmul ~53% faster than the preceding diff (Ruy without parallelization)
- This entire diff stack makes qmatmul ~75% faster than the naive implementation
(see below for details)
**Detailed Benchmarking Results:**
*Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505*
*Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=537916317667891*
- Ruy QMatMul, Parallelization within PyTorch (this diff, v5): [7.5257ms](https://www.internalfb.com/intern/aibench/details/621856970876663)
- Ruy QMatMul, No Parallelization (D33735479, v18): [16.0261ms](https://www.internalfb.com/intern/aibench/details/867786467365069)
- Naive QMatMul (on master branch (base of D33332098), v22): [30.9919ms](https://www.internalfb.com/intern/aibench/details/418359955621359)
Experiments using Ruy Threadpool (which ended up being bad; abandoning):
- Ruy QMatMul, with Ruy Threadpool 4 threads (D34110676, v1): [59.8889ms](https://www.internalfb.com/intern/aibench/details/487293857402229)
- Ruy QMatMul, Parallelization within PyTorch and with Ruy Threadpool 4 threads (D34111050, v1): [624.8932 ms (?!)](https://www.internalfb.com/intern/aibench/details/330231112631355)
Reviewed By: kimishpatel
Differential Revision: D34012771
fbshipit-source-id: 79d137f295b05812968ab53fdf9798606f3f4e63
(cherry picked from commit 2634593b9f55c4cba18a94a1b0571de28d206637)