optimize gated gru cuda kernel (#15525)
### Description
<!-- Describe your changes. -->
Improvement with Tulrv6 on A100

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>