FP16 inference performance improvement on CPU (#25680)
Author: Masaru Ito
[masaru.ito.mi@jp.fujitsu.com](mailto:masaru.ito.mi@jp.fujitsu.com)
### Description
1.Added fp16support for add,sub,mul and div operators thus enable gelu
fusion and erf and surrounding ops
2.Enable fp16 to fp32 cast(performance improved from 15sec -> 0.27sec)
3.Added eigen::fp16 support in layer normalization (performance improved
from 22sec to 0.6sec)
4.Enable fp16 Transpose call in mlas(performance improved from 3sec ->
0.47sec​)
### Steps taken to measure the performance numbers.
Build and install the wheel of Onnxruntime.
Download the INT/Float E5 model from Hugging Face.
Convert the model to FP16 ONNX.
Create a Python Script to Run Inference.
To Analyze Operator-Level Timings use Session option.
We used AWS Graviton 3e Machine (c7gn) with 64 cores.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>