fix: out of bounds access for resize operation (#27419)
### Description
This PR fixes:
* An out-of-bounds write in CUDA Resize for LINEAR mode when running
trilinear paths (3D/5D)
* A race condition for the reduction kernel
### Root cause
1. The temporary dims-mapping buffer for LINEAR mode was sized using
only H+W, while the trilinear coordinate mapping kernel writes D+H+W
entries.
2. shared-memory race in the block-level reduction loop inside
[reduction_functions.cu](vscode-file://vscode-app/c:/Users/lukas.folle/AppData/Local/Programs/Microsoft%20VS%20Code/072586267e/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
The condition allowed threads outside the active lower half to update
shared memory in the same stride phase, creating overlapping read/write
hazards
My colleague @korbinian-mechlem-snkeos noticed this warning from
compute-sanitzer
> ========= Invalid __global__ write of size 4 bytes
========= at void
onnxruntime::cuda::_ResizeTrilinearCoordinateMapping<float,
onnxruntime::cuda::TransformCoordinate_HALF_PIXEL>(long long, long long,
long long, long long, long long, long long, float, float, float, float,
float, float, float, float, float, unsigned long long, bool, const T2 &,
onnxruntime::cuda::LinearMappingInfo *)+0x400
========= by thread (17,0,0) in block (2,0,0)
========= Address 0xb28fff7cc is out of bounds
========= and is 205 bytes after the nearest allocation at 0xb28fff400
of size 768 bytes
========= Saved host backtrace up to driver entry point at kernel launch
time
AND
> ========= Warning: Race reported between Read access at void
onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float,
float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity,
(bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xe80
========= and Write access at void
onnxruntime::cuda::detail::reduce_matrix_columns_kernel<float, float,
float, onnxruntime::cuda::Identity, onnxruntime::cuda::Identity,
(bool)0>(int, int, const T1 *, T2 *, T3 *, int *)+0xea0 [337920 hazards]
### Motivation and Context
Update LINEAR buffer size calculation to:
* use H+W for bilinear (2D/4D)
* use D+H+W for trilinear (3D/5D)
Prevents invalid global writes and intermittent CUDA memory errors in
trilinear resize workloads.
@johannes-rehm-snkeos