onnxruntime
f4f2cc1a - Add batch interface to floating point GEMM (#7323)

Commit

5 years ago

Add batch interface to floating point GEMM (#7323) Currently in high dimension matmul, we call multiple GEMM sequentially. In this change we execute these GEMMs in parallel, removing barriers between two adjacent GEMM operations. Performance tested with Bert and T5 model. Bert model shows no noticeable perf differences, as the heavy lifting is done by the attention operator, which is not changed in this PR. In T5 model, we see no regression on low parallel threads (x4), and performance improvement is more pronounced in high number of threads (8-16). T5 shows 10% speedup with 16 threads. With profiling, we can see the most expensive MatMul operators in T5 achieves around 20% speedup with 16 threads. Co-authored-by: Chen Fu <fuchen@microsoft.com>

References

#7323 - Add batch interface to floating point GEMM

Author

chenfucn

Parents

7a3c1787

onnxruntime f4f2cc1a - Add batch interface to floating point GEMM (#7323)

onnxruntime
f4f2cc1a - Add batch interface to floating point GEMM (#7323)