Fix bounds in WhisperDecoderSubgraph::CreateInitialFeeds initial feeds (#29239)
### Description
`WhisperDecoderSubgraph::CreateInitialFeeds` constructed the decoder
initial feeds using a single value that mixed a **byte count** with an
**element count**. The total size was computed as `cur_len *
batch_beam_size * sizeof(int)` (bytes) and then reused as:
- the element count for the int32 staging buffer (`MakeUniquePtr<int>`),
and
- the element count for the `gsl::span<int>` source/destination passed
to the device copy.
Because the `input_ids` tensor is allocated for exactly `batch_beam_size
* cur_len` int32 elements, the spans claimed 4x the real extent, so the
device copy ran past the end of the buffer. The per-beam `memcpy` also
used the same combined value as its length instead of a single
sequence's byte size.
This mirrors the correct T5 sibling (`subgraph_t5_decoder.cc`), which
separates the element count (used for the spans/staging allocation) from
the per-sequence byte count (used for the `memcpy`).
### Changes
- `subgraph_whisper_decoder.cc`: `total_size` is now the element count
`cur_len * batch_beam_size`; introduced `sequence_bytes = cur_len *
sizeof(int32_t)` for the per-beam `memcpy`. The staging buffer and spans
use `int32_t` consistently to match the `int32_t` tensors/sequences.
- Added regression test
`BeamSearchTest.DummyWhisperWithSequenceInputIds` (CPU, and CUDA under
`USE_CUDA`) exercising the `use_sequence_as_input_ids` path, with a
deterministic dummy model and its generator script. The test validates
both the `sequences` and `scores` outputs.
### Related bool-tensor normalization fixes
While exercising the Whisper path, bool tensors copied from raw data
could hold non-canonical byte values (anything non-zero rather than
strictly `{0, 1}`), causing provider-dependent behavior. To keep the fix
self-contained, the following normalization changes are included:
- `tensorprotoutils.cc`: `UnpackTensor<bool>` normalizes raw-data bytes
to `{0, 1}` (with a `static_assert(sizeof(bool) == 1)` guarding the
byte-wise loop).
- `compress_impl.cu` (CUDA `Compress`): the prefix-sum sizing predicate
normalizes bool bytes to `{0, 1}` so the output sizing agrees with the
element-selection truthiness check. Since bool initializers are now
normalized on unpack, the remaining exposure is runtime-produced bool
condition tensors.
- Added `CompressTest.Compress_cuda_non_canonical_bool_condition` (under
`USE_CUDA`), which feeds a raw `0xFF` condition byte through a
session-level run (`OpTester` normalizes bool inputs and so cannot
reproduce this) and asserts the Compress output is sized by truthiness
rather than by the sign-extended byte value.
### Motivation
The decoder shares one implementation file across CPU/CUDA/ROCm, so this
single change covers all execution providers. The previous behavior
could overrun the staging/feed buffers for models that drive the
sequence-as-input-ids decoder path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>