onnxruntime
34f22daf - Support T5 Beam Search with DecoderMaskedMHA (#15386)

Commit
2 years ago
Support T5 Beam Search with DecoderMaskedMHA (#15386) ### Description <!-- Describe your changes. --> tldr: Latency improvement t5-small: 37.8% t5-base: 24.5% Benchmark on V100 Before: T5-small ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '104.74', 'latency_95_percentile': '104.74', 'latency_99_percentile': '104.74', 'average_latency_ms': '104.74', 'QPS': '19.10', 'parity': True} T5-base ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '200.93', 'latency_95_percentile': '200.93', 'latency_99_percentile': '200.93', 'average_latency_ms': '200.93', 'QPS': '9.95', 'parity': True} After: T5-small ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '76.01', 'latency_95_percentile': '76.01', 'latency_99_percentile': '76.01', 'average_latency_ms': '76.01', 'QPS': '26.31', 'parity': True} T5-base ORT {'test_times': 1, 'latency_variance': '0.00', 'latency_90_percentile': '161.40', 'latency_95_percentile': '161.40', 'latency_99_percentile': '161.40', 'average_latency_ms': '161.40', 'QPS': '12.39', 'parity': True} ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Ubuntu <wy@v100-2.0cdb2e52twzevn1i4fi45bylyg.jx.internal.cloudapp.net>
Author
Parents
Loading