Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363)
### Support SCELoss/SCELossGrad run with larger sized input
#### Motivation and Context: Run bigger batch size for Bloom model.
For Bloom560M model, ORT has potential to run bigger batch size from
initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023
* 250680. When Bsz is bigger than 8, totoal element count cannot be
represented by int32_t, which those kernels are using to passing total
elem count. There is silent overflow causing other indirectly
exceptions, or wrong mistake without errors.
#### Changes in this PR
- For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if
total element count is bigger than int32::max() to pass all element
count and element index for the ops mentioned above.
- For SCELossInternal/SCELossGradInternal CPU kernels,
- always use uint64_t to pass the element count.
- update the Eigen functions involved in the two kernels'
implementations, to use `ptrdiff_t` to pass element count instead of
original `int`.
- Parallelize SCELossInternal/SCELossGradInternal CPU kernels,
otherwise, it is super slow when handling so many elements.
- Others changed needed:
- Add `CompareOrtValueNumerals` to compare two OrtValue with different
data types (float or float16), without caller explicitly converting to
the lower-precision data types. The comparison is also done in parallel,
which reduce the comparsion time for the large UT case from 22s to
~1.6s.
- The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix
the bugs.
- The cross entropy tests are running CPU base line with float, then the
result is used to compare with float16 results of CUDA runs. But there
is precision issue when we check the results. Because the randomized
input data is represented in float, CPU use it directly, but CUDA use a
float16 version of it, so there is precision diff between the inputs, as
the test data count increases, it make the results fail even on 1e-2.
The fix is: generate data in float16, convert to float for CPU run,
directly use float16 for CUDA runs. When compare the output, cast back
CPU float to float16 then compare with CUDA outputs.
- `RandomValueGenerator ` for the large size take about ~20second, so
`ParallelRandomValueGenerator ` is added to random input in parallel, it
takes about <2s for preparing input data.
#### Non-goals
`SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not
covered in this PR