onnxruntime
42869cac - FP16 inference performance improvement on CPU (#25680)

Commit

92 days ago

FP16 inference performance improvement on CPU (#25680) Author: Masaru Ito [masaru.ito.mi@jp.fujitsu.com](mailto:masaru.ito.mi@jp.fujitsu.com) ### Description 1.Added fp16support for add,sub,mul and div operators thus enable gelu fusion and erf and surrounding ops 2.Enable fp16 to fp32 cast(performance improved from 15sec -> 0.27sec) 3.Added eigen::fp16 support in layer normalization (performance improved from 22sec to 0.6sec) 4.Enable fp16 Transpose call in mlas(performance improved from 3sec -> 0.47sec) ### Steps taken to measure the performance numbers. Build and install the wheel of Onnxruntime. Download the INT/Float E5 model from Hugging Face. Convert the model to FP16 ONNX. Create a Python Script to Run Inference. To Analyze Operator-Level Timings use Session option. We used AWS Graviton 3e Machine (c7gn) with 64 cores. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

References

#25680 - FP16 inference performance improvement on CPU

Author

nikhilfujitsu

Parents

b3ba5803

onnxruntime 42869cac - FP16 inference performance improvement on CPU (#25680)

onnxruntime
42869cac - FP16 inference performance improvement on CPU (#25680)