onnxruntime
42869cac - FP16 inference performance improvement on CPU (#25680)

Commit
65 days ago
FP16 inference performance improvement on CPU (#25680) Author: Masaru Ito [masaru.ito.mi@jp.fujitsu.com](mailto:masaru.ito.mi@jp.fujitsu.com) ### Description 1.Added fp16support for add,sub,mul and div operators thus enable gelu fusion and erf and surrounding ops 2.Enable fp16 to fp32 cast(performance improved from 15sec -> 0.27sec) 3.Added eigen::fp16 support in layer normalization (performance improved from 22sec to 0.6sec) 4.Enable fp16 Transpose call in mlas(performance improved from 3sec -> 0.47sec​) ### Steps taken to measure the performance numbers. Build and install the wheel of Onnxruntime. Download the INT/Float E5 model from Hugging Face. Convert the model to FP16 ONNX. Create a Python Script to Run Inference. To Analyze Operator-Level Timings use Session option. We used AWS Graviton 3e Machine (c7gn) with 64 cores. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Author
Parents
Loading