onnxruntime
b30e721d - [js/webgpu] Provide a naive vectorized matmul algorithm (#18758)

Commit

2 years ago

[js/webgpu] Provide a naive vectorized matmul algorithm (#18758) ### Description This PR provided a vectorized matmul algorithm. In most situations, we still go to the workgroup memory optimized matmul. But for some situations, like N and K are very small, using workgroup optimized matmul can't fully utilize the underlying hardware due to the 32x32 tile size. So for very small N/K, we switch to the naive vectorized matmul algorithm to improve the hardware execution unit usage. With this PR, matmul with input0: [1, 36864, 3], input1: [1, 3, 3], input2: [3] becomes less than 1 ms from 4.34 ms on Intel Gen9 GPUs.

References

#18758 - [js/webgpu] Provide a naive vectorized matmul algorithm

Author

qjia7

Parents

1ad6eb13

onnxruntime b30e721d - [js/webgpu] Provide a naive vectorized matmul algorithm (#18758)

onnxruntime
b30e721d - [js/webgpu] Provide a naive vectorized matmul algorithm (#18758)