[Build] Fix clang build issues for CPU and CUDA builds (#27669)
## Description
This PR fixes clang-specific build failures that show up in both the
standalone clang build and the CUDA clang build. It keeps the
build-system changes targeted, prefers source fixes where the warnings
indicate real type or declaration issues, and avoids broader warning
suppression than necessary for the CUDA provider target.
## Summary of Changes
### Build System
| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Stop forwarding `-Wshorten-64-to-32` through
CUDA host compilation where the GNU host compiler does not recognize it.
|
| `cmake/onnxruntime_providers_cuda.cmake` | Add targeted clang
`-Wno-error` handling for warning classes that are currently triggered
by CUDA provider code and third-party CUDA headers under clang. |
### CPU / Common clang fixes
| File | Change |
|------|--------|
| `onnxruntime/core/common/cpuid_info.cc` | Replace the
clang-incompatible `__builtin_cpu_supports("waitpkg")` path with the
CPUID-bit check for TPAUSE detection. |
| `onnxruntime/test/framework/allocation_planner_test.cc` | Refactor
`typeid` assertions to avoid clang's potentially-evaluated-expression
warning while keeping test coverage unchanged. |
### CUDA provider and contrib fixes
| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/utils/dump_cuda_tensor.h` | Mark the
`IConsoleDumper` overrides explicitly while leaving CUDA-only overloads
unchanged. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Use
`template` on the dependent `GetAttrOrDefault` call so clang parses it
correctly. |
| `onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc` |
Make narrowing conversions to flash-attention parameter fields explicit.
|
| `onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc` | Make the
`nbits_` conversion explicit when calling the CUDA helper. |
| `onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc` |
Restrict the GCC-only warning pragma so clang does not treat it as an
unknown warning option. |
|
`onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc`
| Fix explicit state-field assignments to use the actual `int` field
type. |
| `onnxruntime/core/providers/cuda/cuda_mempool_arena.h` | Remove an
unused private field that clang flagged in the CUDA provider build. |
## Testing
Tested CPU and CUDA 12.8 builds in Azure Linux with
- clang 18.1.8
- gcc 13.2
- cmake 4.2.3
Example for CPU build:
```
export CC=clang
export CXX=clang++
bash build.sh --config RelWithDebInfo --parallel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON
```
## Motivation and Context
Clang is stricter than GCC/MSVC in a few areas that affect this tree:
CUDA host flag forwarding, explicit narrowing, dependent template
parsing, warnings emitted from third-party CUDA headers, and RTTI/typeid
expressions in tests. The goal here is to keep the staged fix minimal
and maintainable by correcting real source issues where practical and
confining warning downgrades to the CUDA provider target where
third-party header noise is currently unavoidable.