Use CUDART_VERSION reduction compatibility in GQA attention (#28296)
### Description
Update
`/home/runner/work/onnxruntime/onnxruntime/onnxruntime/contrib_ops/cuda/bert/gqa_unfused_attention.cu`
to match the existing CUDA attention compatibility pattern used
elsewhere in the repo.
- Replace the local reduction functors with the established
`CUDART_VERSION >= 12090` guards.
- Use `::cuda::maximum()` and `::cuda::std::plus()` for CUDA 12.9+.
- Keep `cub::Max()` and `cub::Sum()` as the fallback for older toolkits.
### Motivation and Context
This keeps the GQA unfused attention kernel consistent with nearby CUDA
attention code and avoids the CUDA 12.9+ deprecation issue around the
old CUB reduction functors while preserving compatibility with older
CUDA toolkits.
Validation:
- `git diff --check`
- Code review validation: no comments
- CodeQL validation: no analyzable language changes detected
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>