Support T5 Beam Search with DecoderMaskedMHA (#15386)
### Description
<!-- Describe your changes. -->
tldr:
Latency improvement
t5-small: 37.8%
t5-base: 24.5%
Benchmark on V100
Before:
T5-small
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '104.74', 'latency_95_percentile': '104.74',
'latency_99_percentile': '104.74', 'average_latency_ms': '104.74',
'QPS': '19.10', 'parity': True}
T5-base
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '200.93', 'latency_95_percentile': '200.93',
'latency_99_percentile': '200.93', 'average_latency_ms': '200.93',
'QPS': '9.95', 'parity': True}
After:
T5-small
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '76.01', 'latency_95_percentile': '76.01',
'latency_99_percentile': '76.01', 'average_latency_ms': '76.01', 'QPS':
'26.31', 'parity': True}
T5-base
ORT {'test_times': 1, 'latency_variance': '0.00',
'latency_90_percentile': '161.40', 'latency_95_percentile': '161.40',
'latency_99_percentile': '161.40', 'average_latency_ms': '161.40',
'QPS': '12.39', 'parity': True}
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
---------
Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>