SCELoss(SCELossGrad) support half(float) input float(half) output (#13972)
### Description
A follow up change for
https://github.com/microsoft/onnxruntime/pull/13616.
SoftmaxCrossEntropyLossInternal/SoftmaxCrossEntropyLossInternalGrad
support different type for input and output.
Add SCELoss(SCELossGrad) support half(float) input float(half) output
### Test Note
#### Add tests for variant input and output types. To add such tests,
have to refactor existing testing code for sce loss and scelossinternal
gradient.
Originally,
FP32 input and output, the CPU kernels, runs with CPU kernels the
baseline, CUDA/RCOM then runs with same data, user CompareTester to
compare with CPU run results.
FP16 input and output, the CPU kernels (did not have half kernels), runs
with Cast_to_float->CPU kernel->cast_to_half as the baseline, CUDA/RCOM
then runs with same data but using Half implementation, user
CompareTester to compare with CPU run results.
Now, we want the support run different input and output types. The
proposed change here is, to run CPU kernels always with float input and
output as baseline (because CPU only have float type kernels impl), this
step is the very first thing for every test.
Then, we run CUDA/RCOM kernels using half_input_half_output,
float_input_float_output, half_input_float_output,
float_input_half_output if there is corresponding kernel registered.
Afterwards, compare the CUDA/ROCM run results with CPU float baselines.
Be noted, there is one thing that deserved a special note:
CompareOpTester's result compare can be loose than OpTester's.
Roughly speaking: the former tolerant diff <= atol +
rtol*expected_value, while the later one telerant diff < atol && diff <
rtol*expected_value. When the expected value is super small in many
cases of our tests cases, the former one can pass but the later one
fails. So the refactoring also move the check outside of OpTester,
explicitly check the values using the way CompareOPTester did (to align
the previous behaviour).
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->