Fix RoiAlign heap out-of-bounds read via unchecked batch_indices (#27543)
# Fix RoiAlign heap out-of-bounds read via unchecked batch_indices
## Description
Add value-range validation for `batch_indices` in the RoiAlign operator
to prevent out-of-bounds heap reads from maliciously crafted ONNX
models.
`CheckROIAlignValidInput()` previously validated tensor shapes but never
checked that the **values** in `batch_indices` fall within `[0,
batch_size)`. An attacker could supply `batch_indices` containing values
exceeding the batch dimension of the input tensor `X`, causing the
kernel to read arbitrary heap memory at:
- **CPU:** `roialign.cc:212` — `roi_batch_ind` used as unchecked index
into `bottom_data`
- **CUDA:** `roialign_impl.cu:109` — `batch_indices_ptr[n]` used as
unchecked index into `bottom_data` on GPU
## Impact
- **Vulnerability type:** Heap out-of-bounds read
- **Impact:** Arbitrary heap memory read, potential information
disclosure, program crash
- **Trigger:** Construct `batch_indices` with values ≥ `batch_size` or <
0
- **Affected providers:** CPU and CUDA (both call
`CheckROIAlignValidInput()`)
## Changes
### `onnxruntime/core/providers/cpu/object_detection/roialign.cc`
- Added per-element bounds check in `CheckROIAlignValidInput()`: each
`batch_indices[i]` must satisfy `0 <= value < X.shape[0]`
- Returns `INVALID_ARGUMENT` with a descriptive error message on
violation
- Guarded by `batch_indices_ptr->Location().device.Type() ==
OrtDevice::CPU` so it only runs when the tensor data is host-accessible
(CPU EP and CropAndResize). For the CUDA EP, `batch_indices` lives in
GPU memory and cannot be safely dereferenced on the host.
### `onnxruntime/test/providers/cpu/object_detection/roialign_test.cc`
- Added `BatchIndicesOutOfRange` test: `batch_indices={1}` with
`batch_size=1` (exercises `>= batch_size` path)
- Added `BatchIndicesNegative` test: `batch_indices={-1}` (exercises `<
0` path)
## Known Limitation
The CUDA execution path is **not** protected by this bounds check
because `batch_indices` is a GPU tensor and cannot be read on the host.
Adding a device-side bounds check would require passing `batch_size`
into the CUDA kernel — this is tracked as a follow-up.
Note: Using `.InputMemoryType(OrtMemTypeCPUInput, 2)` was considered but
rejected because it would force a GPU→CPU transfer of `batch_indices`,
breaking CUDA graph capture for models like Masked R-CNN where
`batch_indices` is produced by upstream GPU ops.
## Validation
- Full `RoiAlignTest.*` suite passes (12/12 tests) on CPU build
- Full `RoiAlignTest.*` suite passes (12/12 tests) on CUDA build
- No regressions in existing positive or negative tests