[CUDA EP] Add hardswish op and add bf16 support for hardsigmoid (#25562)
### Description
<!-- Describe your changes. -->
Add HardSwish operator which is x*HardSigmoid(x)
Add bf16 support for HardSigmoid
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
HardSwish is implemented as HardSidmoid + Add in CUDA EP currently.
A fused HardSwish should take half the time of HardSigmoid + Add.
---------
Co-authored-by: kaiyu <kaiyu@bytedance.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>