[Build] Fix debug build (#27659)
# Description
This PR addresses several build warnings and a build error in the CUDA
provider, primarily focused on improving the stability of Debug builds.
## Changes
### CUDA Provider Fixes
- **Fix signedness comparison warnings**:
- In
[tile.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/core/providers/cuda/tensor/tile.cc),
changed the `axis` loop variable type from `size_t` to `int32_t` to
match `input_rank`.
- In
[pad.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/core/providers/cuda/tensor/pad.cc),
converted `p_pads->size()` to `int32_t` using `narrow` and updated the
loop variable type to resolve signedness warnings across template
instantiations.
- **Fix GQA build error**:
- Added a missing include for `common.cuh` in
[group_query_attention_qkv.cuh](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh).
This resolves the `identifier "CUDA_KERNEL_ASSERT" is undefined` error
encountered in Debug builds.
### Test Improvements
- **Rotary Embedding Tests**:
- Skipped out-of-bounds position ID tests in
[rotary_embedding_op_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc)
and
[test/contrib_ops/rotary_embedding_op_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc)
for Debug builds. This is necessary because CUDA device-side asserts
(enabled in Debug mode) can poison the CUDA context when encountering
out-of-bounds indices, causing subsequent tests to fail.
### Minor Cleanup
- Simplified initializer list usage in
[graph_test.cc](https://github.com/microsoft/onnxruntime/blob/d2a67f87288aa3429a34a50c38f06933e1518683/onnxruntime/test/ir/graph_test.cc)
to avoid build error like:
```
inlined from ‘constexpr void std::vector<_Tp, _Alloc>::resize(size_type) [with _Tp = onnxruntime::NodeArg*; _Alloc = std::allocator<onnxruntime::NodeArg*>]’ at /usr/include/c++/13.2.0/bits/stl_vector.h:1013:21,
inlined from ‘virtual void onnxruntime::test::GraphTest_GraphConstruction_CheckGraphInputOutputOrderMaintained_Test::TestBody()’ at /home/tlwu/git/onnxruntime/onnxruntime/test/ir/graph_test.cc:1214:16:
/usr/include/c++/13.2.0/bits/stl_uninitialized.h:1132:28: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ forming offset 8 is out of the bounds [0, 8] [-Werror=array-bounds=]
1132 | __builtin_memmove(__result, __first, __count * sizeof(_Tp));
```