pytorch
fb47cff7 - [PyTorch Edge] Use Parallelization in Internal Quantized Matmul (#73247)

Commit
2 years ago
[PyTorch Edge] Use Parallelization in Internal Quantized Matmul (#73247) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73247 Split up multiplication over outer dimensions ghstack-source-id: 151250864 Test Plan: From fbcode: ```buck test caffe2/test:quantization -- test_qmatmul``` Performance Improvement Summary: For matmuls used by Transformer Model - This diff makes qmatmul ~53% faster than the preceding diff (Ruy without parallelization) - This entire diff stack makes qmatmul ~75% faster than the naive implementation (see below for details) **Detailed Benchmarking Results:** *Benchmarking done by on a model which performs matmuls of the same shapes and counts as Transformer Model, as determined in D30901505* *Notebook in which Benchmarking was performed: https://www.internalfb.com/intern/anp/view/?id=1582075&revision_id=537916317667891* - Ruy QMatMul, Parallelization within PyTorch (this diff, v5): [7.5257ms](https://www.internalfb.com/intern/aibench/details/621856970876663) - Ruy QMatMul, No Parallelization (D33735479, v18): [16.0261ms](https://www.internalfb.com/intern/aibench/details/867786467365069) - Naive QMatMul (on master branch (base of D33332098), v22): [30.9919ms](https://www.internalfb.com/intern/aibench/details/418359955621359) Experiments using Ruy Threadpool (which ended up being bad; abandoning): - Ruy QMatMul, with Ruy Threadpool 4 threads (D34110676, v1): [59.8889ms](https://www.internalfb.com/intern/aibench/details/487293857402229) - Ruy QMatMul, Parallelization within PyTorch and with Ruy Threadpool 4 threads (D34111050, v1): [624.8932 ms (?!)](https://www.internalfb.com/intern/aibench/details/330231112631355) Reviewed By: kimishpatel Differential Revision: D34012771 fbshipit-source-id: 79d137f295b05812968ab53fdf9798606f3f4e63 (cherry picked from commit 2634593b9f55c4cba18a94a1b0571de28d206637)
Author
Committer
Parents
Loading