onnxruntime
8fc3037f - Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363)

Commit
2 years ago
Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363) ### Support SCELoss/SCELossGrad run with larger sized input #### Motivation and Context: Run bigger batch size for Bloom model. For Bloom560M model, ORT has potential to run bigger batch size from initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023 * 250680. When Bsz is bigger than 8, totoal element count cannot be represented by int32_t, which those kernels are using to passing total elem count. There is silent overflow causing other indirectly exceptions, or wrong mistake without errors. #### Changes in this PR - For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if total element count is bigger than int32::max() to pass all element count and element index for the ops mentioned above. - For SCELossInternal/SCELossGradInternal CPU kernels, - always use uint64_t to pass the element count. - update the Eigen functions involved in the two kernels' implementations, to use `ptrdiff_t` to pass element count instead of original `int`. - Parallelize SCELossInternal/SCELossGradInternal CPU kernels, otherwise, it is super slow when handling so many elements. - Others changed needed: - Add `CompareOrtValueNumerals` to compare two OrtValue with different data types (float or float16), without caller explicitly converting to the lower-precision data types. The comparison is also done in parallel, which reduce the comparsion time for the large UT case from 22s to ~1.6s. - The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix the bugs. - The cross entropy tests are running CPU base line with float, then the result is used to compare with float16 results of CUDA runs. But there is precision issue when we check the results. Because the randomized input data is represented in float, CPU use it directly, but CUDA use a float16 version of it, so there is precision diff between the inputs, as the test data count increases, it make the results fail even on 1e-2. The fix is: generate data in float16, convert to float for CPU run, directly use float16 for CUDA runs. When compare the output, cast back CPU float to float16 then compare with CUDA outputs. - `RandomValueGenerator ` for the large size take about ~20second, so `ParallelRandomValueGenerator ` is added to random input in parallel, it takes about <2s for preparing input data. #### Non-goals `SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not covered in this PR
Author
Parents
Loading