[QNN EP] Add BFloat16 dtype support in QNN EP (#26987)
### Description
- QNN NPU backend supports BFloat16 dtype for many operators
- QNN EP adds a new session option "htp_bf16_enable" to enable Users to
signal processing the Float32 graph in BFloat16 precision
- When User specifies "htp_bf16_enable", the QNN EP lowers incoming
Float32 Ort graph into BFloat16 QNN graph.
- The ORT CPU fallback still receives Float32 partitions.
- The lowered QNN graph still accepts float32 inputs, outputs and
constant initializers. The QNN EP inserts Cast operators to do the
necessary precision switch.
### Motivation and Context
- This enables computing accuracy sensitive float32 models in bfloat16
precision on Qualcomm NPU accelerator to improve inference time w.r.t
computing in float32 precision.
---------
Co-authored-by: Ashwath Shankarnarayan <ashwshan@qti.qualcomm.com>