onnxruntime
530a1fbb - [QNN EP] Add BFloat16 dtype support in QNN EP (#26987)

Commit

70 days ago

[QNN EP] Add BFloat16 dtype support in QNN EP (#26987) ### Description - QNN NPU backend supports BFloat16 dtype for many operators - QNN EP adds a new session option "htp_bf16_enable" to enable Users to signal processing the Float32 graph in BFloat16 precision - When User specifies "htp_bf16_enable", the QNN EP lowers incoming Float32 Ort graph into BFloat16 QNN graph. - The ORT CPU fallback still receives Float32 partitions. - The lowered QNN graph still accepts float32 inputs, outputs and constant initializers. The QNN EP inserts Cast operators to do the necessary precision switch. ### Motivation and Context - This enables computing accuracy sensitive float32 models in bfloat16 precision on Qualcomm NPU accelerator to improve inference time w.r.t computing in float32 precision. --------- Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>

References

#26987 - [QNN EP] Add BFloat16 dtype support in QNN EP

Author

tirupath-qti

Parents

744e7fe1

onnxruntime 530a1fbb - [QNN EP] Add BFloat16 dtype support in QNN EP (#26987)

onnxruntime
530a1fbb - [QNN EP] Add BFloat16 dtype support in QNN EP (#26987)