SkipSimplifiedLayerNorm + QuickGelu bfloat16 CUDA implementation (#24772)
### Description
<!-- Describe your changes. -->
SkipSimplifiedLayerNorm + QuickGelu bfloat16 CUDA implementation #24772
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->