Bugfix for SimplifiedLayerNormalization (#12975)
This PR is to fix https://github.com/microsoft/onnxruntime/issues/12930
and https://github.com/microsoft/onnxruntime/issues/12579.
In detail:
- For CPU EP, since current impl of SimplifiedLayerNormalization doesn't
support input and scale having different data types, so if the sub-graph
contains Cast Op, the sub-graph will not fused, this guarantee that both
inputs and output data type will be same
- For CUDA EP, add (fp16, float) support to (T,V) type constraints all
combinations of fp16 and float can be supported in the impl
With the fix, the original model can be run with
SimplifiedLayerNormalization, which also helps to improve the perf.