[CUDA] Add Validation of batch_indices in RoiAlign (#27603)
## Description
This PR implements a device-side bounds check for `batch_indices` in the
RoiAlign CUDA operator. This is a follow-up to
https://github.com/microsoft/onnxruntime/pull/27543, which fixed the
same vulnerability in the CPU implementation.
Previously, CheckROIAlignValidInput() only validated `batch_indices`
when they were accessible on the host (CPU). For the CUDA EP,
`batch_indices` reside in GPU memory, so host-side validation would
require an expensive GPU-to-CPU copy, which could also break CUDA graph
capture.
This change:
1. Passes `batch_size` from the host to the CUDA kernel.
2. Adds a check within the `RoIAlignForward` kernel to ensure `0 <=
batch_index < batch_size`.
3. If an invalid `batch_index` is encountered, the kernel sets the
output value for that specific RoI element to 0 and returns early for
that thread.
## Impact
- **Vulnerability fixed:** Heap out-of-bounds read on GPU.
- **Performance:** Negligible impact as it's a simple range check within
the existing kernel.
- **Compatibility:** No changes to ONNX models or public APIs.
## Validation
- Existing `RoiAlignTest` suite.
- Added two new test cases: `BatchIndicesOutOfRange_CUDA` and
`BatchIndicesNegative_CUDA` to verify that the CUDA provider correctly
handles out-of-range batch indices.
- Verified that the CUDA provider handles opset 10 without falling back
to the CPU EP for these tests.