[CUDA] fp16 intB gemm (#24854)

Commit

277 days ago

[CUDA] fp16 intB gemm (#24854) ### Description * Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of TensorRT-LLM. * Add prepacking to convert weight/scales/zero_points to adapt MatMulNBits to use the kernel. Limitations: * Only enable fp16 kernel. BF16 support will be added later. * Requires zero points. The support of scales only might be added later. * Bias is not enabled since previous MatMulNBits kernel does not support bias. ### Motivation and Context To improve performance of LLM. Initial result shows 2.2x throughput on prompt processing and 1.25X throughput on token generation using onnxruntime-genai benchmark_e2e.py on phi-4-mini-instruct on A100.

References

#24854 - [CUDA] fp16 intB gemm

Author

tianleiwu

Parents

cd9d5fce

onnxruntime 9d6546e6 - [CUDA] fp16 intB gemm (#24854)

onnxruntime
9d6546e6 - [CUDA] fp16 intB gemm (#24854)