Fix CUDA ReduceSum crash on empty tensors with explicit axes
Remove the overly strict assertion that rejected reducing along a
zero-sized dimension even with explicit axes. Reducing axis K of shape
{N, 0} with keepdims=false produces shape {N} filled with the identity
value (0 for sum), which is mathematically valid.
The CPU implementation already handles this case via
check_and_reduce_empty_set_input(). The CUDA path now allows
PrepareForReduce to succeed, and ReduceComputeCore (line 369) already
handles input_count==0 correctly.
This fixes CUDA inference for models with dynamic KV cache where
past_sequence_length=0 during prefill (e.g., Gemma4 via ORT GenAI).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>