Update CUDA TopK kernel registration to opset 24 with BFloat16 support (#27735)
- [x] Cap existing CUDA TopK kernel to versioned [11, 23] and add opset
24
- [x] Add BFloat16 support for CUDA TopK opset 24 (topk_impl_bf16.cu,
helpers, NumericLimits)
- [x] Add BFloat16 test cases for TopK opset 24
- [x] Fix CUB build error: map BFloat16 → __nv_bfloat16 for
BlockRadixSort and DeviceRadixSort
- [x] Add CubSortType trait in topk_impl.cuh
- [x] Update RadixTopK kernel to use CubSortType for BlockRadixSort
- [x] Update TopKImpl DeviceRadixSort calls to use CubSortType pointers
<!-- START COPILOT CODING AGENT TIPS -->
---
💬 Send tasks to Copilot coding agent from
[Slack](https://gh.io/cca-slack-docs) and
[Teams](https://gh.io/cca-teams-docs) to turn conversations into code.
Copilot posts an update in your thread when it's finished.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>