Fix Potential Integer Truncation Leading to Heap Out-of-Bounds Read/Write (#27544)
### Description
<!-- Describe your changes. -->
This pull request refactors several tensor operation kernels
(`GatherND`, `ScatterND`, and `GatherGrad`) to improve type safety and
consistency in parallelized code execution. The main change is replacing
`int` loop indices with `ptrdiff_t` to avoid overflow.
### Parallelization and Type Safety Improvements
* Updated lambda functions and parallel loop indices in `gather_nd.cc`
(`GatherNDBase::PrepareForCompute`, `GatherND::GatherNumber`, and
`GatherND::GatherString`) to use `ptrdiff_t` instead of `int64_t`, and
replaced index arithmetic with explicit casts to maintain correctness.
[[1]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL96-R100)
[[2]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL121-R121)
[[3]](diffhunk://#diff-a456934cd8ef2c51197e04af32ecbef5b531dae83f7f8c2aca46802b7a5e7b7bL192-R216)
* Refactored `scatter_nd.cc` (`ScatterNDDispatchTarget`) to use
`ptrdiff_t` for loop indices and index arithmetic in all reduction
cases, ensuring consistent type usage in parallel execution.
* Modified `gather_grad.cc` (`GatherGrad::ComputeImpl`) to use
`ptrdiff_t` for parallel loop indices, aligning with the changes in
other tensor kernels.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Another same issue was fixed in
https://github.com/microsoft/onnxruntime/pull/27444