Attention CUDA BFloat16 Support (#25974)
### Description
Attention BFloat16 Support for CUDA - extends kernel implementations to
accept BF16 input/output tensors.
### Motivation and Context
We already have BFloat16 support for GQA (Group Query Attention), but
not for regular Attention which many models require for inference (e.g.
visual encoder of Gemma 3) due to FP32-like stability at lower
memory/compute cost.
---------
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>