[webgpu] optimize SkipLayerNormalization operator (#24164)
If the sizes of batch_size and sequence_length are ones, split the
hidden_size to improve parallelism.
### Description
<!-- Describe your changes. -->
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->