onnxruntime
607d5e4d - [WebGPU] Implement Split-K on Conv|MatMul (#26461)

Commit

38 days ago

[WebGPU] Implement Split-K on Conv|MatMul (#26461) ### Description This patch implements the `Split-K` optimization on `Conv|MatMul`. With `Split-K` we can re-arrange the computation into multiple workgroups when `K` is large to increase the parallelism on the platforms that `Split-K` is confirmed to be useful. 1. Support `Split-K` in `MakeMatMulPackedVec4Source()` to split a workgroup with large K into smaller ones. In this patch we only support `Split-K` with `batch_size == 1` and `vec4` on `Conv|MatMul`. 2. Support `Split-K` in `MatMulWriteFnSource()` (add the partial result to output with atomic built-in functions) 3. Implement `SplitKConfig` to decide whether `Split-K` should be used or not, and all the related thresholds. 4. Implement `MatMulFillBiasBeforeSplitKProgram` to initialize the output with `bias` or 0 when `Split-K` is used. ### Motivation and Context In current implementation, when `K` or `dim_inner` is large, in each invocation we always do the computation one by one in a very large loop, which may not make full use of all EUs on a GPU. With `Split-K` we can split such large amount of computation (`K`) into multiple workgroups with less computation (`kSplitK`, smaller than K), which can greatly improve the parallelism. With this patch we can get about 15% performance improvement on `efficientnet-lite-f16-demo` and 9% improvement on `mobilenetv2-12-f16-demo` on Lunar Lake and Meteor Lake.

References

#26461 - [WebGPU] Implement Split-K on Conv|MatMul

Author

Jiawei-Shao

Parents

81a04ca4

onnxruntime 607d5e4d - [WebGPU] Implement Split-K on Conv|MatMul (#26461)

onnxruntime
607d5e4d - [WebGPU] Implement Split-K on Conv|MatMul (#26461)