Add microbenchmark for layer normalization and improve latency (#22223)
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support
added in https://github.com/microsoft/onnxruntime/pull/22063.
- Updated the `LayerNormalization` MLFloat16 implementation to improve
the latency.
```
----------------------------------------------------------------------------------------------
Original MLFloat16 support Time CPU Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47
BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39
BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50
----------------------------------------------------------------------------------------------
Updated MLFloat16 support Time CPU Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84
BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93
BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84
```