Improve rpb cuda kernel (#14195)
### Description
Average latency (ms) of float16 relative position bias cuda kernel on
V100:
Kernel\Seq_LenĀ | 16 | 32 | 64 | 128 | 256 | 384 | 512 | 768 | 1024 |
2048 | 4096
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Before| 0.0494 | 0.0654 | 0.1519 | 0.4322 | 1.1865 | 2.4091 | 4.3676 |
14.912 | 36.517 | 142.09 | 561.80
After | 0.0483 | 0.0651 | 0.1294 | 0.3858 | 1.1128 | 2.2988 | 3.8391 |
14.290 | 34.542 | 136.13 | 529.54
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Review of this comment
https://github.com/microsoft/onnxruntime/pull/14149/#discussion_r1063152021
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>