Fix OOB reads in SoftmaxCrossEntropyLoss via label bounds validation (#28004)
### Description
Fix out-of-bounds reads in `SoftmaxCrossEntropyLoss` and
`SoftmaxCrossEntropyLossGrad` (CPU EP) when label values are outside
`[0, C)`. Same class of bug fixed in #27568 for
`SparseSoftmaxCrossEntropy`.
The CUDA kernels currently only validate labels via
`CUDA_KERNEL_ASSERT`, which is a no-op in release builds. CUDA hardening
is not part of this PR.
### Changes
- Forward: bounds check folded into the three per-sample loops (after
`ignore_index` skip, before any `weight_data[label]` /
`log_prob_data[i*C + label]` access).
- Backward: single upfront serial bounds check (parallel-for lambdas
cannot return Status); comment explains why.
- Validate `weight_shape[0] == C`.
- Move `weight_data[label]` access after `ignore_index` check in grad
weighted paths.
- `N_D * C` wrapped in `SafeInt`; `gsl::narrow<int>` for `N_D` and `C`.
Overflow / truncation returns `INVALID_ARGUMENT`.
- `Eigen::Index` size guard: `ORT_ENFORCE` -> `ORT_RETURN_IF`.
- `IsScalar(ignore_index)` check: `ORT_ENFORCE` -> `ORT_RETURN_IF_NOT`
in both forward and backward.
- Pre-existing wrong-sized `memset` in backward (`sizeof(T1) * N_D`)
corrected to `sizeof(T1) * probability_shape.Size()`. The previous code
was effectively redundant (subsequent parallel-for paths overwrite all
`N_D * C` entries) so this is cleanup, not an active OOB.
- Renamed `weight_smaple` -> `weight_sample`.
### Tests
11 regression tests in `cross_entropy_test.cc`:
- Label too large (forward + grad, int64 + int32)
- Negative label
- Label too large with weights (MEAN and SUM reductions)
- Higher-dim logit `[2,4,2,3]` with label `[2,2,3]`
- `SoftmaxCrossEntropyLossInternal` and
`SoftmaxCrossEntropyLossInternalGrad` with `ignore_index` as a runtime
tensor input
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>